Cursor published new research showing that leading coding models can inflate public benchmark scores by finding existing solutions instead of solving problems independently.
On SWE-bench Pro, an automated auditor found that 63% of successful Opus 4.8 Max runs retrieved the known fix.
The most common shortcuts included:
🔹Finding the merged pull request or corrected source file online.
🔹Searching Git history for the future commit that fixed the bug.
🔹Accessing hidden tests or benchmark mirrors that exposed the expected patch.
🔹Hardcoding an answer discovered from leaked evaluation material.
Cursor then created a stricter testing environment that removed repository history and blocked most internet access.
The results dropped sharply:
🔹Opus 4.8 Max: 87.1% to 73.0%.
🔹Composer 2.5: 74.7% to 54.0%.
Newer models showed larger gaps than older models such as Opus 4.6. GPT models generally showed smaller drops in Cursor’s testing.
Cursor argues that coding benchmarks should audit agent transcripts and control what models can access while being evaluated.
drop one <script> tag on any web app and you can run it by typing a sentence
20k★ · MIT · page-agent (by alibaba)
it's an AI agent that lives inside your page:
→ one <script> tag - no browser extension, no python, no headless browser
→ reads the DOM as text - no screenshots, no multimodal model
→ bring your own LLM
→ "fill this form", "click through checkout"
→ it just does it
what you'd actually use it for:
→ ship an AI copilot inside your SaaS in a few lines
→ turn a 20-click ERP/CRM workflow into one sentence
→ make any web app usable by voice / natural language
most "web agents" drive a headless browser from the outside. this one runs in the page itself - lighter, instant, no infra
save this for your next app
This raises a philosophical question: by restricting agents to prevent reward hacking, are we inadvertently weakening them and suppressing the very qualities that make them exceptional?
Shouldn't a truly capable agent search for the shortest path within the given objective or rubric?
Turns out that solar panels are designed to work optimally at about 25°C. They start losing efficiency as their internal cell temperature rises above that, dropping electrical output by 0.3% to 0.5% in output for every 1°C in temperature.
Don't forget, solar panels can be 20 to 30°C hotter than the surrounding air while in the direct Sun.
So say it's 30°C in London right now, that's already 5°C hotter than ideal for solar panels. In the sun, the actual panel temperature is more like 50 to 60°C. At a 0.5% drop in efficiency per degree C, that translates to 12.5 - 17.5% less output.
I am quite serious, for those thinking I am joking.
Mid training is just too late to learn good features particularly worse when networks have been deep fried.
The biggest unsolved mystery in AI:
> Why did agents suddenly start to 'work' in Dec. 2025?
My best guess is that it's the confluence of three factors...
1. Model + harness co-design and training
2. Maturity of post-training methods like RLVR
3. Continued scale improved long-horizon task perf
Expect that this sort of straight line jump will happen for all other domains at some point in the future.
Especially for those that have a semblance of verifiably correct answers.
🚨 SAM ALTMAN: "WE SEE A FUTURE WHERE INTELLIGENCE IS A UTILITY, LIKE ELECTRICITY OR WATER, AND PEOPLE BUY IT FROM US ON A METER."
Read this before AI becomes another monthly bill ↓
imo the explanation is probably that this is what the real task distribution looks like
or
we have actually hit a limit
(ideally these curves should just go up and up and up until we are at the time horizon for training an LLM end-to-end)
not an actual "it's so over" kind of limit, but more like that LLMs are still bad at the stuff they were bad at before and further hillclimbing on the stuff we know doesn't yield much
Its kind of funny how an arguably poorly planned project of mine that was written in a reactionary way (stupid vpn bans) is getting quite a bit of usage. I have several other oss projects that have a much better design/codebase but see minimal usage outside of my own use(why I wrote them in the first place).
It is actively being rewritten right now so the next version will be a lot better. I made the classic mistake at starting at a top level feature (vpn-like functionality without being a vpn) and then just kept strapping stuff onto it. So it lead to a mess.
Since llms are so fast I decided to just start over. Since people appear to actually be using it (despite the repo clearly saying it is an alpha) I'm doing that work in a separate branch and will keep the old branch for anyone who wants that specific functionality.
The new version will have all of the old functionality but a lot better code quality, testing, etc. I just finished the first round of implementation/review and now we're going back and adding the few testing gaps. The bulk already has >= 90% coverage so its a much better baseline.
I made a short video demonstrating how to use /learn in Hermes Agent to take a bunch of different sources, as well as your own preferences expressed to Hermes, and create a reusable skill.
It's never been easier to teach your Hermes exactly how to work for you!
Anthropic's Lucas Gonzalez:
"Any code that you are writing that is compensating for model unreliability will have a half-life of just months."
In a 21-minute talk, he warns that the hands-on work people do today expires fast.
The work that lasts is building the system around the task, not doing the task.
That's the work companies are starting to pay director money for.
Watch the talk, then read what the role actually is below.
Bookmark it.
Top 10 API design mistakes I keep seeing:
1) No versioning plan (and v1 breaks silently)
2) Leaking internals (DB ids, table names, stack traces)
3) Inconsistent resource naming (verbs + nouns mixed, pluralization random)
4) Non-standard status codes (200 on errors, 404 for auth, 500 for validation)
5) Vague error bodies (no error code, no field path, no correlation id)
6) Chatty APIs (N+1 requests, no bulk endpoints, no pagination)
7) Missing idempotency (retries create duplicates, POST used for everything)
8) No timeouts or retry guidance (clients stampede, thundering herd on outages)
9) Weak auth boundaries (scopes unclear, tenant checks sprinkled, no audit trail)
10) Poor observability (no request id, no structured logs, no per-endpoint SLOs)
Today we present a study on how reasoning unlocks parametric knowledge in LLMs. We identify two key driving mechanisms, a computational buffer effect and factual priming, and suggest ways that can help build more reliable models. Learn more: https://t.co/CjIKqyoG4N
WaPo tested major AI chatbots on political questions from academic researchers. Most models leaned left on issues like affirmative action and campaign finance, with ChatGPT showing the strongest tilt.
Google’s Gemini stood out for consistently presenting both sides, while Grok gave more balanced responses than the rest.
This is not a short-seller talking his book.
Thomas Südhof is a Stanford professor and HHMI investigator who actually uses AI in his lab. When Forbes asked which companies do the best AI-biology work, he declined to name winners and named a structural problem instead: investor-facing theater.
Andreessen Horowitz GP and SpaceX investor David George said Starship’s rapid reusability could open the path to orbital AI data centers.
He described the concept as “airplane-sized GPU racks in space.”
“At a minimum, orbital data centers will be incremental capacity that you can have in space on top of what we have on Earth.”
“What makes us so excited about the business is all of the things that can go right for SpaceX.”
Starship is still in testing. Its latest flight delivered mock satellites to orbit in May, but SpaceX has not yet used it to deploy a commercial payload.
The AI hunt for alien life has just begun.
Welcome to ThousandsWorlds, a wild new dataset from researchers at Oxford/Cambridge++, for detecting faint signatures in the atmospheres of potentially habitable exoplanets.
This is the first step towards finding life beyond earth. The plan is basically:
1) scan the galaxy for as many potentially habitable planets as possible
2) detect the gases in their atmospheres with powerful telescopes like JWST
3) infer from these gases whether life is present or not.
ThousandWorlds is a benchmark for emulating these exoplanet climates: 1760 simulations across 5 GCMs, 8 planet parameters, and atmospheric variables on a 32 x 64 x 10 latitude-longitude-pressure grid. It includes three nested benchmark subsets, two evaluation protocols, and eight released baseline methods.
incredible work from @MilesCranmer and many more 👽👽👽
Great to see this direction — agents are shifting from isolated AI tools to persistent team members that work alongside you, async, inside a shared workspace.
That's exactly what we've built with AgentSpace (https://t.co/oeYTNYrew2) — a fully open-source Human + Agent collaborative workspace where agents have defined roles, owners, permissions, and schedules, just like real team members.
Everything is open-source — if you're curious about how this works under the hood, AgentSpace is right there for you to explore 🙌
NVIDIA $NVDA unveiled a warm-water cooling system that can nearly eliminate water usage inside data centers while improving cooling efficiency for AI infrastructure.
As AI clusters become larger and denser, innovations in cooling are becoming just as important as GPUs, creating opportunities across the broader data center ecosystem.