New Benchtalks with @jyangballin: on ProgramBench (0% frontier models at launch) and the lineage/future of coding benchmarks, from SWE-bench/InterCode to now
01:29 ProgramBench launch and reception
03:41 Why artifact-level evaluation, not code-level
06:03 Why models love Python
08:29 ProgramBench as a research tool
12:45 From SWE-bench & InterCode to ProgramBench
17:47 How to grade a coding model
21:53 The position paper & humans in the loop
25:01 Managing quality with agents-in-the-loop
28:40 Internet access and benchmark integrity
35:26 Where models may surpass human abilities
38:56 When a model hits 80% on ProgramBench
43:55 Benchmarks worth paying attention to
46:24 What benchmark do you wish existed
49:32 Will benchmarks still look like benchmarks in 5 years
52:02 How to contribute to ProgramBench
If you are working in fields like AI or web3 or VR right now, you are living through and creating important history.
Enjoy it and keep notes.
May victory be yours.
Neural networks really, truly learn. It's not a fancy trick.
This is one of the most remarkable things humans have ever figured out, and the implications are difficult to overstate.