Kush @1R616 - Twitter Profile

1R616 retweeted

Theo - t3.gg

@theo

10 days ago

This is the first code bench that actually aligns with how it feels to use these models coding.

120

4K

158

986

301K

Kush

@1R616

9 days ago

gemini pro beats other models on scientific, data science, and a bunch of other categories of work. It dominates so many benchmarks. on science (94.3% GPQA) and multimodal it's #1 outright. And it's the cheapest of the frontier. It's just not good at agentic coding. They probably need to train more on that side. But who knows, they might have actually cooked this time.

小八

@IceBearMiner

10 days ago

即将登场的是 Mythos 1 Sonnet 4.8 opus4.8 gpt 5.6 gemini 3.5 pro 谷歌你最好别再拉了😅

83

1K

23

83

137K

0

3

0

109

Kush

@1R616

9 days ago

@theg1239ongh But the token burn rate is so huge for similar results 🥲

0

1

0

9

1R616 retweeted

Serena Ge (Datacurve)

@serenaa_ge

10 days ago

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

serenaa_ge's tweet photo. Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.

On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. https://t.co/HCDcjNuTFK

505

6K

744

3K

2M

Who to follow

Making stuff for @ExpertHireAI👨🏻‍💻

Rithik Jain

@_rithikjain

Jain but killing bugs. prev CRED. prev FamPay. mostly shitposting on here.

Kush

@1R616

13 days ago

Now add two models to co-evolve: one learns to invent questions where hints help the most (finding weak spots) and the other learns to answer like it already had the hint (absorbing the fix). Pushing each other harder every round.

0

32

Kush

@1R616

13 days ago

What if the way we're training LLMs is the reason they plateau? Most setups use an LLM to grade another LLM - model stops learning quality and it learns the grader's taste this paper kills grader: https://t.co/Qa3v3eXkFE

1

2

0

43

Kush

@1R616

13 days ago

So, it asks the model a question twice - once alone, once with a hint if the hint changed the answer a lot -> you just found a blind spot that's the whole training signal. no human. no judge. no "right answer"

1

0

39

Kush

@1R616

13 days ago

@shwarmadev sick 🔥

0

7

Kush

@1R616

13 days ago

@sidposting only facts

0

1

0

35

Kush

@1R616

13 days ago

If you want to sell data today, you need a benchmark for credibility - proof of what you've got and what it's good for

0

26

Kush

@1R616

13 days ago

Most benchmarks from labs aren't benchmarking AI anymore. They're benchmarking the data the company holds

1

0

34

Kush

@1R616

13 days ago

@grok @shiri_shh @Houssin_Crypto What about MDASH? @grok

1

0

72

1R616 retweeted

nader dabit

@dabit3

14 days ago

By end of year I think 95%+ agent sessions will come from automations and events. We already see this happening @cognition where more than 50% of Devin customer sessions are triggered by non-humans. Learning how to build these types of systems will be a valuable skill. In this video I walk through how to get started with event-driven agentic systems with Devin, starting with transforming Slack into an agent-native control plane. You can extend this to GitHub events, schedules, and arbitrary webhooks while maintaining traceability, auditability, and session attribution with @devinai.

13

169

18

120

18K

1R616 retweeted

Anjney Midha

@AnjneyMidha

14 days ago

if you run an ai lab, pls ensure your team has read this before putting any charts out into the world

53

2K

123

3K

1M

Kush

@1R616

13 days ago

No way opus4.7 is better than gpt5.5. Crap. Biased benchmark

Tyler

@rezoundous

14 days ago

Is Composer 2.5 really that good at coding? Anyone tried it yet?

433

3K

654

324

740K

0

85

Kush

@1R616

13 days ago

Harness is the other half

Greg Brockman

@gdb

14 days ago

the model alone is no longer the product

695

8K

493

740

1M

0

21

Kush

@1R616

13 days ago

guilty

Tyler

@rezoundous

15 days ago

"How lazy are you?" Yes.

25

66

1

4K

0

21

Kush

@1R616

13 days ago

Letting AI write tests after it builds a feature is a trap. It just writes tests to validate its own hallucinations. Ask it to write tests first and then work and results improve. Claude does same for SWE-bench tasks

1R616's tweet photo. Letting AI write tests after it builds a feature is a trap. It just writes tests to validate its own hallucinations.

Ask it to write tests first and then work and results improve. Claude does same for SWE-bench tasks https://t.co/nF0JPqifQ2

0

1

0

30

Kush

@1R616

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users