🧩 tasks are now modular — each lives in its own file.
“suites” are going away → easier contributions, faster iteration.
explore all tasks available in lighteval here:
https://t.co/3fPGau70MP
Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”
We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly.
The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains.
https://t.co/lrJioBmpbT
A big part of our mission at Thinking Machines is to improve people’s scientific understanding of AI and work with the broader research community. Introducing Connectionism today to share some of our scientific insights.
Fei-Fei Li (@drfeifei) on limitations of LLMs.
"There's no language out there in nature. You don't go out in nature and there's words written in the sky for you.. There is a 3D world that follows laws of physics."
Language is purely generated signal.
We were able to reproduce the strong findings of the HRM paper on ARC-AGI-1.
Further, we ran a series of ablation experiments to get to the bottom of what's behind it.
Key findings:
1. The HRM model architecture itself (the centerpiece of the paper) is not an important factor.
2. The outer refinement loop (barely mentioned in the paper) is the main driver of performance.
3. Cross-task transfer learning is not very helpful. What matters is training on the tasks you will test on.
4. You can use much fewer data augmentations, especially at inference time.
Finding 2 & 3 mean that this approach is a case of *zero-pretraining test-time training*, similar to the recently published "ARC-AGI without pretraining" paper by Liao et al.
One perk of working on @AtlasIA projects: we get to confirm big-lab findings with limited community budget💪
We finetuned Qwen2.5-VL at two scales to find the sweet spot for LR × batch size and saw patterns validating DeepSeek’s scaling laws 📈
(https://t.co/KhsHPzeWs4).
@AnassAb01@Omar_H_ maybe this is the "report" you want: https://t.co/1suFJYMK31
Indeed the information mentioned by detafour is WRONG ! Maybe they misunderstood the 21st slide (which is the exact opposite of what they mentioned)
Nevertheless, we are still not at the top of our game yet !!!
@AnassAb01@Omar_H_ Also, i guess it's worth to mention that generally most the funding we have is internal (local VCs), while south africa and egypt lead given the British and GCC VCs respectively.
Not justifying falling behind here, but maybe one of the reasons!
🚨 New blog: The AI Evaluation Chart Crisis 📝
From misleading bar heights to missing error bars, recent model launches have sparked debate on AI evals. In our new blogpost, we dig into what’s broken, why it matters and how they should be presented 👇
https://t.co/5KnVw8a2mf
We have a long history of using games to measure progress in AI. 🎮
That’s why we’re helping unveil the @Kaggle Game Arena: an open-source platform where models go head-to-head in complex games to help us gauge their capabilities. 🧵
🚨 AI Evals Crisis: Officially kicking off the Eval Science Workstream 🚨
We’re building a shared scientific foundation for evaluating AI systems, one that’s rigorous, open, and grounded in real-world & cross-disciplinary best practices👇 (1/2)
https://t.co/AQdEKtJS3l
ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models
"We present the ChemPile, an open dataset containing over 75 billion tokens of curated chemical data, specifically built for training and evaluating general-purpose models in the chemical sciences."
Just released: A Parquet-converted version of the Newspaper Navigator dataset on @huggingface!
📰3M+ visual annotations from historic US newspapers from @ChronAmLOC
🗂️ Bounding boxes, OCR, metadata + IIIF crop URLs
📸 Covers photos, cartoons, comics, maps & more
NVIDIA released new vision reasoning model for robotics: Cosmos-Reason1-7B 🤖
> first reasoning model for robotics 😱
> based on Qwen 2.5-VL-7B, use with @huggingface transformers or vLLM 🤗
> comes with SFT & alignment dataset and a new benchmark 👏
I'm excited to share our new pre-print
ShiQ: Bringing back Bellman to LLMs!
https://t.co/yWMT6M0nuT
In this work, we propose a new, Q-learning inspired RL algorithm for finetuning LLMs 🎉
(1/n)