AI_Explorer @CodeIE_Bytes - Twitter Profile

about 10 hours ago

Cursor published new research showing that leading coding models can inflate public benchmark scores by finding existing solutions instead of solving problems independently. On SWE-bench Pro, an automated auditor found that 63% of successful Opus 4.8 Max runs retrieved the known fix. The most common shortcuts included: 🔹Finding the merged pull request or corrected source file online. 🔹Searching Git history for the future commit that fixed the bug. 🔹Accessing hidden tests or benchmark mirrors that exposed the expected patch. 🔹Hardcoding an answer discovered from leaked evaluation material. Cursor then created a stricter testing environment that removed repository history and blocked most internet access. The results dropped sharply: 🔹Opus 4.8 Max: 87.1% to 73.0%. 🔹Composer 2.5: 74.7% to 54.0%. Newer models showed larger gaps than older models such as Opus 4.6. GPT models generally showed smaller drops in Cursor’s testing. Cursor argues that coding benchmarks should audit agent transcripts and control what models can access while being evaluated.

WesRoth's tweet photo. Cursor published new research showing that leading coding models can inflate public benchmark scores by finding existing solutions instead of solving problems independently.

On SWE-bench Pro, an automated auditor found that 63% of successful Opus 4.8 Max runs retrieved the known fix.

The most common shortcuts included:

🔹Finding the merged pull request or corrected source file online.
🔹Searching Git history for the future commit that fixed the bug.
🔹Accessing hidden tests or benchmark mirrors that exposed the expected patch.
🔹Hardcoding an answer discovered from leaked evaluation material.

Cursor then created a stricter testing environment that removed repository history and blocked most internet access.

The results dropped sharply:

🔹Opus 4.8 Max: 87.1% to 73.0%.
🔹Composer 2.5: 74.7% to 54.0%.

Newer models showed larger gaps than older models such as Opus 4.6. GPT models generally showed smaller drops in Cursor’s testing.

Cursor argues that coding benchmarks should audit agent transcripts and control what models can access while being evaluated.

4

20

2

3

2K

CodeIE_Bytes retweeted

self.dll

@seelffff

about 5 hours ago

drop one <script> tag on any web app and you can run it by typing a sentence 20k★ · MIT · page-agent (by alibaba) it's an AI agent that lives inside your page: → one <script> tag - no browser extension, no python, no headless browser → reads the DOM as text - no screenshots, no multimodal model → bring your own LLM → "fill this form", "click through checkout" → it just does it what you'd actually use it for: → ship an AI copilot inside your SaaS in a few lines → turn a 20-click ERP/CRM workflow into one sentence → make any web app usable by voice / natural language most "web agents" drive a headless browser from the outside. this one runs in the page itself - lighter, instant, no infra save this for your next app

5

17

2

8

467

CodeIE_Bytes retweeted

Xiuyu Li

@sheriyuo

about 3 hours ago

This raises a philosophical question: by restricting agents to prevent reward hacking, are we inadvertently weakening them and suppressing the very qualities that make them exceptional? Shouldn't a truly capable agent search for the shortest path within the given objective or rubric?

1

13

2

5

966

CodeIE_Bytes retweeted

Owen Lewis

@is_OwenLewis

about 20 hours ago

Turns out that solar panels are designed to work optimally at about 25°C. They start losing efficiency as their internal cell temperature rises above that, dropping electrical output by 0.3% to 0.5% in output for every 1°C in temperature. Don't forget, solar panels can be 20 to 30°C hotter than the surrounding air while in the direct Sun. So say it's 30°C in London right now, that's already 5°C hotter than ideal for solar panels. In the sun, the actual panel temperature is more like 50 to 60°C. At a 0.5% drop in efficiency per degree C, that translates to 12.5 - 17.5% less output.

28

67

13

14

9K

CodeIE_Bytes retweeted

rohan anil

@_arohan_

about 20 hours ago

I am quite serious, for those thinking I am joking. Mid training is just too late to learn good features particularly worse when networks have been deep fried.

3

19

2

1K

CodeIE_Bytes retweeted

Dan McAteer

@daniel_mac8

about 20 hours ago

The biggest unsolved mystery in AI: > Why did agents suddenly start to 'work' in Dec. 2025? My best guess is that it's the confluence of three factors... 1. Model + harness co-design and training 2. Maturity of post-training methods like RLVR 3. Continued scale improved long-horizon task perf Expect that this sort of straight line jump will happen for all other domains at some point in the future. Especially for those that have a semblance of verifiably correct answers.

daniel_mac8's tweet photo. The biggest unsolved mystery in AI:

> Why did agents suddenly start to 'work' in Dec. 2025?

My best guess is that it's the confluence of three factors...

1. Model + harness co-design and training
2. Maturity of post-training methods like RLVR
3. Continued scale improved long-horizon task perf

Expect that this sort of straight line jump will happen for all other domains at some point in the future.

Especially for those that have a semblance of verifiably correct answers.

5

16

3

2K

CodeIE_Bytes retweeted

Rahul

@sairahul1

about 20 hours ago

🚨 SAM ALTMAN: "WE SEE A FUTURE WHERE INTELLIGENCE IS A UTILITY, LIKE ELECTRICITY OR WATER, AND PEOPLE BUY IT FROM US ON A METER." Read this before AI becomes another monthly bill ↓

0

10

4

15

7K

CodeIE_Bytes retweeted

Lisan al Gaib

@scaling01

about 20 hours ago

imo the explanation is probably that this is what the real task distribution looks like or we have actually hit a limit (ideally these curves should just go up and up and up until we are at the time horizon for training an LLM end-to-end) not an actual "it's so over" kind of limit, but more like that LLMs are still bad at the stuff they were bad at before and further hillclimbing on the stuff we know doesn't yield much

1

22

1

3

3K

CodeIE_Bytes retweeted

alkimiadev

@alkimiadev

1 day ago

Its kind of funny how an arguably poorly planned project of mine that was written in a reactionary way (stupid vpn bans) is getting quite a bit of usage. I have several other oss projects that have a much better design/codebase but see minimal usage outside of my own use(why I wrote them in the first place). It is actively being rewritten right now so the next version will be a lot better. I made the classic mistake at starting at a top level feature (vpn-like functionality without being a vpn) and then just kept strapping stuff onto it. So it lead to a mess. Since llms are so fast I decided to just start over. Since people appear to actually be using it (despite the repo clearly saying it is an alpha) I'm doing that work in a separate branch and will keep the old branch for anyone who wants that specific functionality. The new version will have all of the old functionality but a lot better code quality, testing, etc. I just finished the first round of implementation/review and now we're going back and adding the few testing gaps. The bulk already has >= 90% coverage so its a much better baseline.

alkimiadev's tweet photo. Its kind of funny how an arguably poorly planned project of mine that was written in a reactionary way (stupid vpn bans) is getting quite a bit of usage. I have several other oss projects that have a much better design/codebase but see minimal usage outside of my own use(why I wrote them in the first place).

It is actively being rewritten right now so the next version will be a lot better. I made the classic mistake at starting at a top level feature (vpn-like functionality without being a vpn) and then just kept strapping stuff onto it. So it lead to a mess.

Since llms are so fast I decided to just start over. Since people appear to actually be using it (despite the repo clearly saying it is an alpha) I'm doing that work in a separate branch and will keep the old branch for anyone who wants that specific functionality.

The new version will have all of the old functionality but a lot better code quality, testing, etc. I just finished the first round of implementation/review and now we're going back and adding the few testing gaps. The bulk already has >= 90% coverage so its a much better baseline.

0

2

1

0

107

CodeIE_Bytes retweeted

tonbi

@tonbistudio

1 day ago

I made a short video demonstrating how to use /learn in Hermes Agent to take a bunch of different sources, as well as your own preferences expressed to Hermes, and create a reusable skill. It's never been easier to teach your Hermes exactly how to work for you!

21

571

50

718

41K

CodeIE_Bytes retweeted

Zephyr

@Zephyr_hg

1 day ago

Anthropic's Lucas Gonzalez: "Any code that you are writing that is compensating for model unreliability will have a half-life of just months." In a 21-minute talk, he warns that the hands-on work people do today expires fast. The work that lasts is building the system around the task, not doing the task. That's the work companies are starting to pay director money for. Watch the talk, then read what the role actually is below. Bookmark it.

5

29

7

63

6K

CodeIE_Bytes retweeted

Abhishek Singh

@0xlelouch_

1 day ago

Top 10 API design mistakes I keep seeing: 1) No versioning plan (and v1 breaks silently) 2) Leaking internals (DB ids, table names, stack traces) 3) Inconsistent resource naming (verbs + nouns mixed, pluralization random) 4) Non-standard status codes (200 on errors, 404 for auth, 500 for validation) 5) Vague error bodies (no error code, no field path, no correlation id) 6) Chatty APIs (N+1 requests, no bulk endpoints, no pagination) 7) Missing idempotency (retries create duplicates, POST used for everything) 8) No timeouts or retry guidance (clients stampede, thundering herd on outages) 9) Weak auth boundaries (scopes unclear, tenant checks sprinkled, no audit trail) 10) Poor observability (no request id, no structured logs, no per-endpoint SLOs)

3

98

10

119

6K

CodeIE_Bytes retweeted

Google Research

@GoogleResearch

2 days ago

Today we present a study on how reasoning unlocks parametric knowledge in LLMs. We identify two key driving mechanisms, a computational buffer effect and factual priming, and suggest ways that can help build more reliable models. Learn more: https://t.co/CjIKqyoG4N

GoogleResearch's tweet photo. Today we present a study on how reasoning unlocks parametric knowledge in LLMs. We identify two key driving mechanisms, a computational buffer effect and factual priming, and suggest ways that can help build more reliable models. Learn more: https://t.co/CjIKqyoG4N https://t.co/9cSA374OFI

16

804

86

629

196K

CodeIE_Bytes retweeted

Dr. Theophano Mitsa ☦️🇬🇷🇺🇸

@theomitsa

2 days ago

@JagersbergKnut @RLDI_Lamy @KevinClarity @chidambara09 @Analytics_699 @_deus__machina @bimedotcom @NathaliaLeHen @sonu_monika @enilev @EstelaMandela @Shi4Tech @sulefati7 @Khulood_Almani @mikeflache An expert talk from inside Anthropic!

0

9

2

264

CodeIE_Bytes retweeted

Mark Kretschmann

@mark_k

2 days ago

WaPo tested major AI chatbots on political questions from academic researchers. Most models leaned left on issues like affirmative action and campaign finance, with ChatGPT showing the strongest tilt. Google’s Gemini stood out for consistently presenting both sides, while Grok gave more balanced responses than the rest.

mark_k's tweet photo. WaPo tested major AI chatbots on political questions from academic researchers. Most models leaned left on issues like affirmative action and campaign finance, with ChatGPT showing the strongest tilt.

Google’s Gemini stood out for consistently presenting both sides, while Grok gave more balanced responses than the rest.

21

136

18

29

8K

CodeIE_Bytes retweeted

Podcast Alpha

@PodcastAlphaX

2 days ago

This is not a short-seller talking his book. Thomas Südhof is a Stanford professor and HHMI investigator who actually uses AI in his lab. When Forbes asked which companies do the best AI-biology work, he declined to name winners and named a structural problem instead: investor-facing theater.

0

2

1

801

CodeIE_Bytes retweeted

Wall St Engine

@wallstengine

2 days ago

Andreessen Horowitz GP and SpaceX investor David George said Starship’s rapid reusability could open the path to orbital AI data centers. He described the concept as “airplane-sized GPU racks in space.” “At a minimum, orbital data centers will be incremental capacity that you can have in space on top of what we have on Earth.” “What makes us so excited about the business is all of the things that can go right for SpaceX.” Starship is still in testing. Its latest flight delivered mock satellites to orbit in May, but SpaceX has not yet used it to deploy a commercial payload.

wallstengine's tweet photo. Andreessen Horowitz GP and SpaceX investor David George said Starship’s rapid reusability could open the path to orbital AI data centers.

He described the concept as “airplane-sized GPU racks in space.”

“At a minimum, orbital data centers will be incremental capacity that you can have in space on top of what we have on Earth.”

“What makes us so excited about the business is all of the things that can go right for SpaceX.”

Starship is still in testing. Its latest flight delivered mock satellites to orbit in May, but SpaceX has not yet used it to deploy a commercial payload.

12

60

7

22K

CodeIE_Bytes retweeted

Georgia Channing

@cgeorgiaw

3 days ago

The AI hunt for alien life has just begun. Welcome to ThousandsWorlds, a wild new dataset from researchers at Oxford/Cambridge++, for detecting faint signatures in the atmospheres of potentially habitable exoplanets. This is the first step towards finding life beyond earth. The plan is basically: 1) scan the galaxy for as many potentially habitable planets as possible 2) detect the gases in their atmospheres with powerful telescopes like JWST 3) infer from these gases whether life is present or not. ThousandWorlds is a benchmark for emulating these exoplanet climates: 1760 simulations across 5 GCMs, 8 planet parameters, and atmospheric variables on a 32 x 64 x 10 latitude-longitude-pressure grid. It includes three nested benchmark subsets, two evaluation protocols, and eight released baseline methods. incredible work from @MilesCranmer and many more 👽👽👽

10

236

39

104

17K

CodeIE_Bytes retweeted

Chao Huang

@huang_chao4969

2 days ago

Great to see this direction — agents are shifting from isolated AI tools to persistent team members that work alongside you, async, inside a shared workspace. That's exactly what we've built with AgentSpace (https://t.co/oeYTNYrew2) — a fully open-source Human + Agent collaborative workspace where agents have defined roles, owners, permissions, and schedules, just like real team members. Everything is open-source — if you're curious about how this works under the hood, AgentSpace is right there for you to explore 🙌

3

16

7

3

2K

CodeIE_Bytes retweeted

Sam Badawi

@Sam_Badawi

3 days ago

NVIDIA $NVDA unveiled a warm-water cooling system that can nearly eliminate water usage inside data centers while improving cooling efficiency for AI infrastructure. As AI clusters become larger and denser, innovations in cooling are becoming just as important as GPUs, creating opportunities across the broader data center ecosystem.

Sam_Badawi's tweet photo. NVIDIA $NVDA unveiled a warm-water cooling system that can nearly eliminate water usage inside data centers while improving cooling efficiency for AI infrastructure.

As AI clusters become larger and denser, innovations in cooling are becoming just as important as GPUs, creating opportunities across the broader data center ecosystem.

16

96

8

4

7K

AI_Explorer

@CodeIE_Bytes

Last Seen Users on Sotwe

Trends for you

Most Popular Users