tb-eval

terminal-bench terminal-bench@2.0 · anthropic/claude-opus-4-7

10 hard tasks, 3 trials each, run with the claude-code agent under harbor.

2 pass3 fail25 pendingpass rate 40%
taskt1t2t3
01llm-inference-batching-scheduler
02train-fasttext
03feal-differential-cryptanalysis
04path-tracing
05regex-chess
06make-mips-interpreter
07torch-pipeline-parallelism
08write-compressor
09dna-assembly
10fix-code-vulnerability
connecting…status