tb-eval
terminal-bench terminal-bench@2.0 · anthropic/claude-opus-4-7
10 hard tasks, 3 trials each, run with the claude-code agent under harbor.
2 pass3 fail25 pendingpass rate 40%
| task | t1 | t2 | t3 |
|---|---|---|---|
| 01llm-inference-batching-scheduler | |||
| 02train-fasttext | |||
| 03feal-differential-cryptanalysis | |||
| 04path-tracing | |||
| 05regex-chess | |||
| 06make-mips-interpreter | |||
| 07torch-pipeline-parallelism | |||
| 08write-compressor | |||
| 09dna-assembly | |||
| 10fix-code-vulnerability |
connecting…status