Working on EPS-SG satellites, HPC engineer, Phd in tropical cyclone modelling, weather and climate enthusiast, operational processing...
Any questions welcome!
Senator Chris Murphy speaks the truth! Trump's Ukraine "peace deal" is just a mafia corruption scheme to enrich Trump's family and friends by selling out Ukraine.
@ChrisMurphyCT
Yann LeCun (Chief AI Scientist, Meta, @ylecun), @PimDeWitte (CEO, General Intuition), and Aude Durand (Kyutai, @aude_drn), talk about world models, embodied agents, Yann's new company, and the limitations of LLMs
0:00 - Introduction to World Models
5:00 - Why World Models, Intuition & Introducing Yann's new company
10:00 - Architectures + Merging Language & Interaction Data towards General Agents
20:00 - Open Source, Sovereign AI & @kyutai_labs Partnership
Keynote for #aiPULSE2025 at Station F in Paris 🇫🇷
"J'aimerais pouvoir penser à quelque chose et que ChatGPT y réponde": Merge Labs, nouvelle start-up de Sam Altman, pourrait bouleverser les interfaces cerveau-machine et nous lier encore plus à l'intelligence artificielle https://t.co/wgBWy0aGaY via @techandco
CBS canceled Colbert’s show just THREE DAYS after Colbert called out CBS parent company Paramount for its $16M settlement with Trump – a deal that looks like bribery.
America deserves to know if his show was canceled for political reasons.
Watch and share his message.
CBS canceled Colbert’s show just THREE DAYS after Colbert called out CBS parent company Paramount for its $16M settlement with Trump – a deal that looks like bribery.
America deserves to know if his show was canceled for political reasons.
Watch and share his message.
Une étude d'Apple remet en question les progrès en «raisonnement» IA vantés par OpenAI, Google et Anthropic : leurs LRM subissent un « effondrement complet de leur précision » face à des problèmes complexes https://t.co/MaCkb4VLTU via @developpez
The Ultimate LLM Benchmark list:
SimpleBench: https://t.co/51rkwsB7pZ
SOLO-Bench: https://t.co/Zymtspj83V
AidanBench: https://t.co/5lpH3CGhl0
SEAL by Scale: https://t.co/mAFyIfod7V (particularly the MultiChallenge leaderboard)
LMArena: https://t.co/CIOyTQ9ufe (with Style Control)
LiveBench: https://t.co/1fsq2IOsy1
ARC-AGI: https://t.co/bKh8xsI9WX
Thematic Generalization by LechMazur: https://t.co/W9FIyRedE6
( other ones by Lech Mazur: https://t.co/vPDH3Aj5OO,
https://t.co/SrtUI7KYEZ, ...)
EQBench: https://t.co/g7zmT8Ilkq (especially the Longform writing leaderboard)
Fiction-Live Bench: https://t.co/NSA1d7LEGe
MC-Bench: https://t.co/JpXYWvjk3Z (ordered by winrate, not by Elo)
TrackingAI - IQ Bench: https://t.co/rWoTwz1eu9
Dubesor LLM: https://t.co/FyF32AKDa4
Balrog-AI: https://t.co/ZLwDpixw2E
Misguided Attention: https://t.co/2VMdPg5J4m
Snake-Bench: https://t.co/dEcvZYsVqz
SmolAgents LLM: https://t.co/iBock5Q4V4 (just because of GAIA and SimpleQA)
Context-Arena (MRCR and Graphwalks): https://t.co/bXn2wwMK6L
OpenCompass: https://t.co/GQbKwZDq8k
HHEM (Hallucination Benchmark): https://t.co/Z23lcd7XMc
Coding, Math and Agentic Benchmarks
Aider-Polyglot-Coding: https://t.co/aRGODg2PUA
BigCodeBench: https://t.co/HxNMp3GLk9
WebDev-Arena: https://t.co/sQB8tBLekG
WeirdML: https://t.co/38CA9RBml4
Symflower Coding: https://t.co/WxYMXjcHpZ
PHYBench: https://t.co/gyp0bGXxzt
MathArena: https://t.co/QVzZSeW9t9
Galileo Agent: https://t.co/Igs3TW3s1I
XLANG Agent: https://t.co/NZwxnbGMry
Important for tracking AI take-off
METR long task benchmarks: https://t.co/IYzI5SGUFd (incl. RE Bench)
PaperBench: https://t.co/uLLybqtwIg
SWE-Lancer: https://t.co/amsmTZYK7n
MLE-Bench: https://t.co/2DkbRKdVA5
SWE-Bench: https://t.co/TFJyzWqURA
other classics I ALWAYS want to see when a new model is released
GPQA-Diamond: https://t.co/t2HV6IiyaC
SimpleQA: https://t.co/lhpHqQcxJf
Tau-bench: https://t.co/TZT7fQc6cc
SciCode: https://t.co/J8HmPK9kiU
MMMU: https://t.co/rMHpsZQvRJ
Humanities Last Exam (HLE): https://t.co/OHSoyPZ9nY
Overview for classical benchmarks (GPQA, SimpleQA, AIME, MMLU, ...)
Simple-Evals: https://t.co/3sQysnQaVd
Vellum AI: https://t.co/E1g047GWk7
Artificial Analysis: https://t.co/sALmriQ4qC
Benchmarks I literally don't care about - saturated / no signal
MMLU, HumanEval, BBH, DROP, MGSM, basically all math benchmarks like GSM8K, MATH, AIME
OpenAI catches a lot of shit because it promised the public so much and is now falling short. But don't forget that it remains miles ahead of:
• Meta
• xAI
• Microsoft
• DeepSeek
which all get off light because they've only ever promised little or nothing!
AI models currently have a 50% chance of doing something that takes a human expert one hour.
This doubles every 7 months.
In 2 years? They could automate full workdays. In 4 years? A full month.
I discuss the most important graph in AI today with Beth Barnes, the CEO of METR, which uncovered this rule of AI progress.
Her bottom line: "It really doesn't seem like 2 years would be surprising for recursively self-improving AI."
Beth also explains: where company safety testing fails, why there are no true closed-weight models, AI undermines leading powers, why she's come around on open weighting, and why models might be about to start playing dumb much more often.
Enjoy! Available on the 80,000 Hours Podcast in all apps. Links below.
1:51 Can we see AI scheming in the chain of thought?
12:50 Alignment faking
17:33 We have to test models before they're even used inside AI companies
31:56 Each 7 months models can do tasks twice as long
51:31 METR's research finds AIs are solid at AI research already
58:18 AI may turn out to be strong at novel and creative research
1:07:55 Recursively self-improving AI might even be here in two years
1:14:29 Could evaluations backfire?
1:39:55 Do we need external auditors doing AI safety tests?
1:54:09 Why not work at AI companies
2:08:40 The new more dire situation has forced changes to METR's strategy
2:21:49 Overrated: Interpretability research
2:32:55 Overrated: Major AI companies' contributions to safety research
2:39:15 Could we ban using AI to enhance AI, or is that just naive?
2:45:31 Open-weighting models is often good
2:50:22 What we can learn about AGI from the nuclear arms race
3:10:43 AI is more like bioweapons because it undermines the leading power
3:42:09 What research METR plans to do next