Rishi Mehta @rishicomplex - Twitter Profile

4 days ago

@JeremyNguyenPhD yeah this is a fair point, writing good jokes is actually very hard and the human baseline here is very strong

0

29

Rishi Mehta

@rishicomplex

5 days ago

Made a little benchmark called RoastBench - it compares frontier models on their roast jokes. The models roast 10 personalities from comedy central roasts I enjoyed, and I manually rank their jokes. I also mark the ones that made me laugh. LLMs are way worse than top humans.

rishicomplex's tweet photo. Made a little benchmark called RoastBench - it compares frontier models on their roast jokes. The models roast 10 personalities from comedy central roasts I enjoyed, and I manually rank their jokes. I also mark the ones that made me laugh. LLMs are way worse than top humans. https://t.co/kHRFcPy6Ry

6

24

1

4

3K

Rishi Mehta

@rishicomplex

5 days ago

@ehalm_ I think that's part of it but they also don't seem to understand what's funny

0

39

Rishi Mehta

@rishicomplex

5 days ago

you can check out all the jokes at https://t.co/jPx3Eb9r0O.

0

2

0

123

Who to follow

Tim Fist

@fiiiiiist

Director of Emerging Technology @IFP. Adjunct Senior Fellow @CNASdc. AI & compute policy, science, innovation.

Jason Phang

@zhansheng

Foundations at @OpenAI. PhD @NYUDataScience, @AiEleuther, 🇸🇬. Prev: @Google, @Microsoft

Nick Erickson

@innixma

Author & Lead Developer of @AutoGluon and TabArena Research Scientist at Prior Labs #automl #tabpfn #opensource

Rishi Mehta

@rishicomplex

5 days ago

The models can kind of figure out the beginnings of a setup but their punchlines just fall flat. It's like they don't yet have a good model for what causes a human to laugh.

rishicomplex's tweet photo. The models can kind of figure out the beginnings of a setup but their punchlines just fall flat. It's like they don't yet have a good model for what causes a human to laugh. https://t.co/QRaUAQGady

1

3

0

1

391

Rishi Mehta

@rishicomplex

9 days ago

New opus! It's smarter, more reliable, and uses its tokens better.

Claude

@claudeai

9 days ago

Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors. Available today at the same price.

claudeai's tweet photo. Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors.

Available today at the same price. https://t.co/EufxL7T1kb

4K

67K

9K

8K

15M

3

16

1

0

1K

Rishi Mehta

@rishicomplex

18 days ago

@karpathy Welcome!

0

148

Rishi Mehta

@rishicomplex

29 days ago

Sara's team is awesome, apply if you're excited about aligning Claude!

Sara Price

@sprice354_

29 days ago

This is important and challenging work. If you are excited about contributing please consider applying - particularly by joining the Anthropic Fellows program!

2

40

2

0

3K

0

2

0

2

1K

Rishi Mehta

@rishicomplex

about 2 months ago

new opus in town

Claude

@claudeai

about 2 months ago

Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision.

claudeai's tweet photo. Introducing Claude Opus 4.7, our most capable Opus model yet.

It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back.

You can hand off your hardest work with less supervision. https://t.co/PtlRdpQcG5

5K

81K

10K

12K

14M

0

3

0

396

Rishi Mehta

@rishicomplex

about 2 months ago

what's up with the X algorithm it's just cycling the same 10 posts for me

0

1

0

237

rishicomplex retweeted

Nat McAleese @__nmca__

about 2 months ago

at long last we have built and chosen not to release the zero-day machine from the classic sci-fi tale “please do not release the zero-day machine”

__nmca__'s tweet photo. at long last we have built and chosen not to release the zero-day machine from the classic sci-fi tale “please do not release the zero-day machine” https://t.co/RPi8Nv7JxN

17

3K

151

327

130K

Rishi Mehta

@rishicomplex

about 2 months ago

Claude Mythos Preview is a substantial jump, especially on cyber, and requires a new kind of cooperation for society to adapt

Anthropic

@AnthropicAI

about 2 months ago

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. https://t.co/NQ7IfEtYk7

2K

44K

7K

16K

31M

0

8

0

449

Rishi Mehta

@rishicomplex

2 months ago

@sideboared @FakePsyho In the case of humans, per the quote in the paper it appears they can reset the action count

0

32

Rishi Mehta

@rishicomplex

2 months ago

@andreasorob @fchollet In the case of the human participants, from the quote in the paper it appears they can reset the action count in the middle of a game, which the AI can't do

0

2

0

120

Rishi Mehta

@rishicomplex

2 months ago

@fchollet according to your paper: "Participants were limited to a single attempt per environment and could not revisit previously completed levels. However, they were allowed to reset the current level at any time. In some cases, participants reset levels after reaching a solution in order to improve efficiency, though this typically increased total interaction time." So humans could play around with the task a bunch, and then just reset the game when they figured it out to get the optimal trajectory? Is AI allowed to do this?

François Chollet

@fchollet

2 months ago

ARC-AGI-3 is out now! We've designed the benchmark to evaluate agentic intelligence via interactive reasoning environments. Beating ARC-AGI-3 will be achieved when an AI system matches or exceeds human-level action efficiency on all environments, upon seeing them for the first time. We've done extensive human testing that shows 100% of these environments are solvable by humans, upon first contact, with no prior training and no instructions. Meanwhile, all frontier AI reasoning models do under 1% at this time.

236

3K

340

725

622K

1

25

1

5

3K

Rishi Mehta

@rishicomplex

2 months ago

@RyanPGreenblatt Possibly not because it looks like they cheated by giving humans infinite retries https://t.co/28vaYFF32O

Rishi Mehta

@rishicomplex

2 months ago

@fchollet according to your paper: "Participants were limited to a single attempt per environment and could not revisit previously completed levels. However, they were allowed to reset the current level at any time. In some cases, participants reset levels after reaching a solution in order to improve efficiency, though this typically increased total interaction time." So humans could play around with the task a bunch, and then just reset the game when they figured it out to get the optimal trajectory? Is AI allowed to do this?

1

25

1

5

3K

1

5

0

1

697