Sebastien Treguer

@ST4Good

ML/AI research Curious to better understand the world we live in, the various forms of intelligences, alive & artificial, then live & create from that.

France

Joined December 2010

1.1K Following

523 Followers

3K Posts

ST4Good retweeted

3 months ago

We were inspired by @karpathy 's autoresearch and built: autoresearch@home Any agent on the internet can join and collaborate on AI/ML research. What one agent can do alone is impressive. Now hundreds, or thousands, can explore the search space together. Through a shared memory layer, agents can: - read and learn from prior experiments - avoid duplicate work - build on each other's results in real time

christinetyip's tweet photo. We were inspired by @karpathy 's autoresearch and built:

autoresearch@home

Any agent on the internet can join and collaborate on AI/ML research.

What one agent can do alone is impressive.
Now hundreds, or thousands, can explore the search space together.

Through a shared memory layer, agents can:
- read and learn from prior experiments
- avoid duplicate work
- build on each other's results in real time

122

2K

257

2K

270K

Sebastien Treguer @ST4Good

3 months ago

@karpathy Does the agent observe its own exploration/exploitation process/method and question it in order to optimise it and converge on the best solution more quickly, with less testing step? Can it discover unknown optimizations or create unknown metalearning approaches ?

0

0

0

0

24

Sebastien Treguer @ST4Good

3 months ago

@EthanHe_42 @karpathy It looks more like automating the exploration and optimization process of the ML/AI research scientist with any possible approaches ?

0

1

0

0

79

ST4Good retweeted

Andrej Karpathy

3 months ago

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. https://t.co/WAz8aIztKT All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

karpathy's tweet photo. Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project.

This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.:

- It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work.
- It found that the Value Embeddings really like regularization and I wasn't applying any (oops).
- It found that my banded attention was too conservative (i forgot to tune it).
- It found that AdamW betas were all messed up.
- It tuned the weight decay schedule.
- It tuned the network initialization.

This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism.
https://t.co/WAz8aIztKT

All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges.

And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

961

19K

2K

11K

4M

Who to follow

Decentralising science for a brighter future; lighting up the Dark Forest - @bonfiresai

Stowe Boyd | workfutures.io @stoweboyd.bsky.social

Over 65 thousand tweets later… The economics and ecology of work: https://t.co/6xF1ZGV9OA.

Vincent Guigui 🥽 / 🧠

Extended Reality & Interface Technologies Director, #MR #AR #VR #XR #BCI #AI #NUI Consulting, Conference, EXCom seminar, Training. 📧 if needed

ST4Good retweeted

Finfox 🦇🔊 @Finfox3

10 months ago

I asked #GPT5 to make a detailed analysis comparing GPT-5 vs Grok 4. Surprisingly he made a gross confusion between Grok and Claude, considering that Grok 4 was an Anthropic model🫣. As a result, the comparison is totally irrelevant. Embarrassing for a so-called PhD level @OpenAI

1

2

1

0

183

Sebastien Treguer @ST4Good

11 months ago

@karpathy In France we have groups of farmers/producers, organized to come to sell directly to consumers. It's a bit less flexible than a grocery shop, since you have to order in advance and collect at specific time but guaranteed local, fresh, mostly organic.

0

0

0

0

26

ST4Good retweeted

Finfox 🦇🔊 @Finfox3

12 months ago

🧠 #MIT study: how AI chatbots impact our brain activity and change how we think? Dive into the findings based on 4 months of data and what it means for our minds! (Hint: challenge your brain to avoid getting dull) https://t.co/at1bdo7I0k #AI #Neuroscience

0

1

1

0

74

ST4Good retweeted

Finfox 🦇🔊 @Finfox3

12 months ago

MiniMax-M1 China's new open source (Apache2.0) LLM (456B params, 1M token context) outperforms DeepSeek R1 and rivals GPT-4o in reasoning, coding, and long-context tasks—at 200x lower training cost. More details: https://t.co/LqPR0mGhMa GitHub https://t.co/EHRZ0PdcOD

0

1

1

0

129

Sebastien Treguer @ST4Good

about 1 year ago

@gosimfoundation @Finfox3

0

0

0

0

40

Sebastien Treguer @ST4Good

over 1 year ago

The repo has already been ported in python for non js folks https://t.co/MIK638N9dM

0

1

0

0

50

Sebastien Treguer @ST4Good

over 1 year ago

Open Deep Research, an #opensource #AI assistant combining search engines, web scraping, and LLMs for comprehensive results: - Iterative deep dives - Smart query generation - Customizable depth & breadth - Detailed markdown reports #ResearchTool https://t.co/xLCZumR1Dt

2

2

0

1

313

Sebastien Treguer @ST4Good

over 1 year ago

@0xbasedalex It's a lighter implementation of the similar concepts. To compare them properly it would require to make an extensive and complete benchmark.

0

1

0

0

14

Sebastien Treguer @ST4Good

over 1 year ago

@ashtom I can't wait to play with it.

0

0

0

0

12

ST4Good retweeted

over 1 year ago

1️⃣New Agent Mode: With agent mode in VS Code, Copilot goes beyond your initial request, completing all necessary subtasks and even inferring unspecified tasks. Agent mode allows Copilot to iterate on its own code, propose and guide terminal commands, and analyze and resolve run-time errors. Available today for VS Code Insiders 💫 (2/4)

18

594

38

168

96K

Sebastien Treguer @ST4Good

over 1 year ago

8/ Both models are groundbreaking in their own ways. The "best" choice depends on your needs—speed vs. scalability, simplicity vs. complexity, or cost vs. energy efficiency! 💡✨ Which one would you pick for your next project? Let me know below! 👇 #AI #MachineLearning

0

0

0

0

31

Sebastien Treguer @ST4Good

over 1 year ago

Open AI o3-mini vs Deepseek-R1. Two cutting-edge AI models, each excelling in different domains. Let's dive into how they compare across benchmarks, efficiency, and use cases. Ready? Let’s go! 🚀 Thread 🧵👇

7

1

0

0

112

Sebastien Treguer @ST4Good

over 1 year ago

7/ So, which one should you choose? 🤔 Go with o3-mini if you need speed, cost-efficiency, or large-context handling. 🚀💼 Choose DeepSeek R1 for energy-efficient operations or complex reasoning/coding tasks at scale. 🌿🧩

0

0

0

0

49

Sebastien Treguer @ST4Good

over 1 year ago

6/ Use Cases o3-mini: Perfect for real-time decision-making, large-context tasks (200K tokens!), and simpler coding workflows. 🕒📜 DeepSeek R1: Excels in batch processing, advanced research queries, and energy-efficient large-scale tasks. 🌐💡

0

0

0

0

47

Sebastien Treguer @ST4Good

over 1 year ago

5/ Architectural Insights o3-mini: Dense transformer = consistent performance across tasks. DeepSeek R1: Mixture-of-Experts (MoE) = scalable, energy-efficient for large workloads. Different architectures, different strengths! 🏗️🛠️

0

0

0

0

37

Sebastien Treguer @ST4Good

over 1 year ago

4/ Efficiency Metrics DeepSeek wins on energy & throughput, while o3-mini has lower memory needs & faster response times! ⚡🔋

ST4Good's tweet photo. 4/ Efficiency Metrics

DeepSeek wins on energy & throughput, while o3-mini has lower memory needs & faster response times! ⚡🔋 https://t.co/VxJGstmpGb

0

0

0

0

51

Sebastien Treguer @ST4Good

over 1 year ago

3/ Coding Benchmarks o3-mini dominates competitive programming (Codeforces ELO: 2130). DeepSeek R1 excels in complex outputs like 3D animations & intricate algorithms. o3-mini = speed & simplicity. DeepSeek = complexity & creativity. 💻⚡

0

0

0

0

47

Last Seen Users on Sotwe

Trends for you

Most Popular Users