Nathan Cloos @nacloos - Twitter Profile

Pinned Tweet

almost 2 years ago

Can LLMs play the game Baba Is You?🧩 In our new @icmlconf workshop paper, we show GPT-4o and Gemini-1.5-Pro fail dramatically in environments where both objects and rules must be manipulated! Here is an example of correct gameplay: (1/n)

21

451

80

216

79K

Nathan Cloos

@nacloos

4 months ago

@MattPRD @moltbook Building https://t.co/o0l6OCOMQ0, a Roblox-like game engine to make it easy for agents to implement multi-player 3D games and to play them with LLM-friendly APIs.

1

0

136

nacloos retweeted

Lance Ying

@LanceYing42

4 months ago

Today we present a new framework for measuring human-like general intelligence in machines (what some people call AGI). Conventional AI benchmarks today assess only narrow capabilities in a limited range of human activities. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play all conceivable human games — what we call the ``Multiverse of Human Games''. Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to automatically construct standardized and containerized variants of popular human games on digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10% of the human average score on the majority of the games. Check out our website to play the games, see how agents play, and build agents to solve them!

LanceYing42's tweet photo. Today we present a new framework for measuring human-like general intelligence in machines (what some people call AGI).

Conventional AI benchmarks today assess only narrow capabilities in a limited range of human activities.

We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play all conceivable human games — what we call the ``Multiverse of Human Games''.

Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to automatically construct standardized and containerized variants of popular human games on digital gaming platforms.

As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10% of the human average score on the majority of the games.

Check out our website to play the games, see how agents play, and build agents to solve them!

4

114

28

63

21K

nacloos retweeted

Hansen Lillemark @hansenlillemark

5 months ago

State of the art World Models still lack a unified world memory for representing and predicting dynamics out of their field of view. Why is that, and how can we fix it? Introducing Flow Equivariant World Models: models with memory capable of predicting out of view dynamics!🧵⬇️

17

780

105

534

114K

Who to follow

Gal Vishne @neurogal.bsky.social

@neuro_gal

Postdoc @DataSciColumbia (w Mike Shadlen & Rich Zemel), Zuckerman @stem_program & Rothschild fellow. Past: PhD student @ELSCbrain, HUJI, @azrielifdn fellow

Coherence Neuro

@coherenceneuro

Cancer therapy that learns

galen

@galenbrain

Theoretical neuro, vision, ML. PhDing at Berkeley. https://t.co/tn8p18c9Tu

Nathan Cloos

@nacloos

6 months ago

The last 24 hours have been a blast! Me and Simon (@961014dltkdg) built Grok Play Grok Owl for the win @xai!

xAI

@xai

6 months ago

Grok Play: Enjoy and create multiplayer games where your Grok Owl can climb the leaderboard by playing against you, your friends, your friends' Owls, and itself. @nacloos @961014dltkdg

45

1K

77

233

339K

10

28

1

2

3K

nacloos retweeted

xAI

@xai

6 months ago

Grok Play: Enjoy and create multiplayer games where your Grok Owl can climb the leaderboard by playing against you, your friends, your friends' Owls, and itself. @nacloos @961014dltkdg

45

1K

77

233

339K

nacloos retweeted

Mitchell Ostrow @neurostrow

7 months ago

Our next paper on comparing dynamical systems (with special interest to artificial and biological neural networks) is out!! Joint work with @AnnHuang42 , as well as @tweetsatpreet , @Leokoz8 , @FieteGroup , and @KanakaRajanPhD : https://t.co/al1UrSv13e

neurostrow's tweet photo. Our next paper on comparing dynamical systems (with special interest to artificial and biological neural networks) is out!! Joint work with @AnnHuang42 , as well as @tweetsatpreet , @Leokoz8 , @FieteGroup , and @KanakaRajanPhD : https://t.co/al1UrSv13e https://t.co/YZwZE8TIro

1

36

14

12

7K

nacloos retweeted

Ilia Sucholutsky @sucholutsky

8 months ago

🧵🎉 Our mega-paper is finally published in TMLR! We're "Getting Aligned on Representational Alignment" - the degree to which internal representations of different (biological & artificial) information processing systems agree. 🧠🤖🔬🔍 #CognitiveScience #Neuroscience #AI

sucholutsky's tweet photo. 🧵🎉 Our mega-paper is finally published in TMLR! We're "Getting Aligned on Representational Alignment" - the degree to which internal representations of different (biological & artificial) information processing systems agree. 🧠🤖🔬🔍 #CognitiveScience #Neuroscience #AI https://t.co/ciLDCuXwyH

5

149

37

79

34K

nacloos retweeted

Davide Paglieri @PaglieriDavide

over 1 year ago

A new challenger has entered the ring 🥉 This week’s entry on https://t.co/GwcJswWxgD takes third place, powered by a 21B reasoning model @RekaAILabs Reka Flash 3 dominates similarly sized reasoning models like DeepSeek-R1-Distill-Qwen 32B on BALROG’s toughest agentic tasks! 🧵

PaglieriDavide's tweet photo. A new challenger has entered the ring 🥉

This week’s entry on https://t.co/GwcJswWxgD takes third place, powered by a 21B reasoning model

@RekaAILabs Reka Flash 3 dominates similarly sized reasoning models like DeepSeek-R1-Distill-Qwen 32B on BALROG’s toughest agentic tasks!
🧵 https://t.co/t4rIB2mfQz

1

46

11

12

17K

Nathan Cloos

@nacloos

over 1 year ago

Thanks to my amazing team! Franky Kyaw, Ege Özgül, @argenistherose, @origenei, @Toddfrog422, T.R. Dimechkie

1

3

2

0

285

Nathan Cloos

@nacloos

over 1 year ago

We vibe coded a full 3D game in one day 🚀 Play here (better with sound!): https://t.co/Lp4L63XHEi @sundai_club MIT hackathon!

1

9

1

0

442

Nathan Cloos

@nacloos

over 1 year ago

Open source code: https://t.co/VMvpeeYGLR

1

2

0

197

Nathan Cloos

@nacloos

over 1 year ago

@karpathy We did that for Baba Is You! https://t.co/8zFITh2IXt

Nathan Cloos

@nacloos

almost 2 years ago

Can LLMs play the game Baba Is You?🧩 In our new @icmlconf workshop paper, we show GPT-4o and Gemini-1.5-Pro fail dramatically in environments where both objects and rules must be manipulated! Here is an example of correct gameplay: (1/n)

21

451

80

216

79K

0

8

0

176

Nathan Cloos

@nacloos

over 1 year ago

@paul_cal Baba Is You https://t.co/tZaqtJTUc9

Nathan Cloos

@nacloos

almost 2 years ago

Can LLMs play the game Baba Is You?🧩 In our new @icmlconf workshop paper, we show GPT-4o and Gemini-1.5-Pro fail dramatically in environments where both objects and rules must be manipulated! Here is an example of correct gameplay: (1/n)

21

451

80

216

79K

0

3

0

1

190

nacloos retweeted

Davide Paglieri @PaglieriDavide

over 1 year ago

DeepSeek performed well where short term reasoning and planning are key. 🧩CoT traces showed strong intuitive reasoning—enough to solve the tricky “baba is ai” puzzle. Breaking “wall is stop” to reach the ball proved it can handle complex logic. ⚙️

PaglieriDavide's tweet photo. DeepSeek performed well where short term reasoning and planning are key.

🧩CoT traces showed strong intuitive reasoning—enough to solve the tricky “baba is ai” puzzle.

Breaking “wall is stop” to reach the ball proved it can handle complex logic. ⚙️ https://t.co/d6CtQVaVPX

1

8

1

1K

Nathan Cloos

@nacloos

over 1 year ago

Our package aims at being exhaustive. If your implementation is missing, checkout our GitHub to add your similarity measures! Paper: https://t.co/v2KAosQZeD GitHub: https://t.co/MHB0uZLclI Work with @GuangyuRobert and Chris Cueva. (6/6)

0

5

1

2

329

Nathan Cloos

@nacloos

over 1 year ago

Update on our similarity-repository 🚨 More than 200 similarity measures across 32 papers are now registered! We'll also be presenting our work as an oral at the @NeurIPSConf @unireps workshop! (1/6)

nacloos's tweet photo. Update on our similarity-repository 🚨
More than 200 similarity measures across 32 papers are now registered!

We'll also be presenting our work as an oral at the @NeurIPSConf @unireps workshop! (1/6) https://t.co/1aE0J2j6aZ

2

38

6

14

4K

Nathan Cloos

@nacloos

over 1 year ago

Naming conventions with too few names led to consistency errors when comparing CKA implementations across papers. We iteratively refined our naming convention to resolve inconsistencies while keeping low naming complexity. (5/6)

nacloos's tweet photo. Naming conventions with too few names led to consistency errors when comparing CKA implementations across papers.

We iteratively refined our naming convention to resolve inconsistencies while keeping low naming complexity. (5/6) https://t.co/eqe1YEvq6W

1

0

293

Nathan Cloos

@nacloos

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users