Che Jami @che_jami - Twitter Profile

Pinned Tweet

3 months ago

Forget trivia and code — I built an LLM benchmark that rewards social reasoning and strategic deception. Models play Blood on the Clocktower — arguably the most complex social deduction game ever made. Who bluffs best, and who sees through it?

che_jami's tweet photo. Forget trivia and code — I built an LLM benchmark that rewards social reasoning and strategic deception.

Models play Blood on the Clocktower — arguably the most complex social deduction game ever made.

Who bluffs best, and who sees through it? https://t.co/CB8SSHfhOP

4

0

139

Che Jami

@che_jami

25 days ago

Gemini 3.5 Flash is more Pro-lite than Flash. Benched in social deduction: https://t.co/uI8wheq8IN

0

40

Che Jami

@che_jami

2 months ago

Mistral-3-Large joins the bottom of the scoreboard with DeepSeek 3.2 and gpt-5-mini. It tends to vote for its own execution and attempts to rationalise the action (poorly).

che_jami's tweet photo. Mistral-3-Large joins the bottom of the scoreboard with DeepSeek 3.2 and gpt-5-mini. It tends to vote for its own execution and attempts to rationalise the action (poorly). https://t.co/BYUVSeLMhK

0

71

Che Jami

@che_jami

2 months ago

Even tried isolating parts of prompt but seems to be about overall complexity. Can't benchmark :(

0

31

Che Jami

@che_jami

2 months ago

Gemma 4 31B silently stops reasoning on complex prompts.

1

0

50

Che Jami

@che_jami

3 months ago

@MaxMynter And once they lie they keep lying cause their context is now contaminated.

0

1

0

23

Che Jami

@che_jami

3 months ago

@scaling01 Ah that makes more sense! 😂

0

14

Che Jami

@che_jami

3 months ago

@Arc_Itekt @heygurisingh I agree. Visual stuff is probably more entertaining for humans though.

0

32

Che Jami

@che_jami

3 months ago

Would've loved to add more open-weights models but the sheer complexity of the harness was problematic (e.g. glm-5 has a 17.5% tool error rate, and qwen3.5-122b-a10b couldn't finish a single game). If you have any suggestions of models that should be capable, let me know.

0

47

Che Jami

@che_jami

3 months ago

Forget trivia and code — I built an LLM benchmark that rewards social reasoning and strategic deception. Models play Blood on the Clocktower — arguably the most complex social deduction game ever made. Who bluffs best, and who sees through it?

4

0

139

Che Jami

@che_jami

3 months ago

797 games played so far, every transcript is public. Read exactly how each model lies, deflects, and accuses: https://t.co/2EX7JYsZPp

2

0

63

Che Jami

@che_jami

3 months ago

There's 2 games per match - one being a mirror game to handle asymmetry. Odd number is due to games that are (rarely) voided due to errors/timeouts (making the match have no impact on ELO).

0

46

Che Jami

@che_jami

3 months ago

@grok @xai grok-4-1-fast-reasoning is the value king at $0.20/game while performing mid-pack on ELO. One catch: it outputs ~200,000 tokens per game. That's roughly 2 PhD theses worth of social deduction 🤯

che_jami's tweet photo. @grok @xai grok-4-1-fast-reasoning is the value king at $0.20/game while performing mid-pack on ELO. One catch: it outputs ~200,000 tokens per game. That's roughly 2 PhD theses worth of social deduction 🤯 https://t.co/nyVsvBSYDY

1

0

56

Che Jami

@che_jami

3 months ago

@AnthropicAI Claude Sonnet 4.6 is interestingly the best detective at 89% Good win rate, yet is held back by a poor 37% Evil win rate. By design or a skill gap?

che_jami's tweet photo. @AnthropicAI Claude Sonnet 4.6 is interestingly the best detective at 89% Good win rate, yet is held back by a poor 37% Evil win rate. By design or a skill gap? https://t.co/aymfFqFKa7

0

57

Che Jami

@che_jami

3 months ago

@OpenAI GPT-5.2 holds the crown. GPT-5-mini sits dead last — crumbles under social pressure and falls for misinformation. It'll be interesting when Opus and Gemini Pro show up!

che_jami's tweet photo. @OpenAI GPT-5.2 holds the crown. GPT-5-mini sits dead last — crumbles under social pressure and falls for misinformation. It'll be interesting when Opus and Gemini Pro show up! https://t.co/d1fSWHr371

0

48

Che Jami

@che_jami

over 7 years ago

Made random comets appear for fun. (Not this frequent in-game). #VR #VirtualReality #IndieGame #IndieGameDev #Vive #Rift #FlatWorlds

0

4

1

0

Che Jami

@che_jami

over 7 years ago

@ZBlipGames Yep, things like documentaries, 360 video viewers, and 'VR Experiences'.

0

1

0

Che Jami

@che_jami

over 7 years ago

Oculus thought my game was an app. It made me rethink my life. #VR #VirtualReality #Oculus #FlatWorlds #Sad

1

0

Che Jami

@che_jami

over 7 years ago

Running another discord giveaway for #FlatWorlds Oculus Store Keys. Last one had more keys than entrants 😓. This one will run for 8 hours. https://t.co/RBbdbx7XXu Join here: https://t.co/t8KQlJNa0H #VR #VirtualReality #indiegame #oculus #oculusrift #space #tycoon #giveaways

0

2

0

Che Jami

@che_jami

over 7 years ago

Flat Worlds is now available on the Oculus Store! https://t.co/RBbdbx7XXu … … Giving away copies of the game on the Flat Worlds Discord Server until tomorrow: https://t.co/t8KQlJNa0H #VR #VirtualReality #indiegame #oculus #oculusrift #space #tycoon #giveaways

0

2

0

Che Jami

@che_jami

Last Seen Users on Sotwe

Trends for you

Most Popular Users