The technical debt behind the AI boom.
Waring: Extensive quoting from the research paper to follow yet again. Rather than write my own commentary, I'll just quote from the paper and let you draw your own conclusions.
"We first collect AI-authored commits from GitHub repositories at scale. We then analyze each AI-authored commit at the code level to determine which quality issues it introduced or fixed. Finally, we track the lifecycle of both the issues and the code itself to determine whether AI-introduced debt persists or gets resolved over time."
"We build attribution rules for widely adopted AI coding tools (e.g., Cursor, GitHub Copilot, Claude Code) identified in the 2025 Stack Overflow Developer Survey. We identify AI-authored commits using explicit signals in Git metadata. Our approach covers AI-authored commits only when the use of an AI coding tool leaves explicit traces in Git metadata."
"We keep only repositories with at least 100 GitHub stars. We also require at least one confirmed AI-authored commit. Our downstream analysis is restricted to production Python, JavaScript, and TypeScript source files, since these are among the most widely used programming languages and are well supported by static analysis tools. We therefore exclude repositories that do not contain any source files in these languages. In total, the discovery stage identified 587,118 candidate repositories. After applying the star threshold, 12,770 repositories remained. After full-history scanning and language filtering, we obtained 6,699 repositories with confirmed AI-authored commits."
"For each AI-authored commit c, we analyze two versions of the source code: the version at c's parent revision (before the commit is applied) and the version at c itself (after the commit is applied). Comparing these two versions allows us to determine which quality issues the commit introduced or fixed."
"We run the same static analysis toolchain on both versions to identify potential code issues. We use ESLint (for JavaScript and TypeScript) and Pylint (for Python) to detect code smells and correctness issues. For security-related issues, we use Semgrep, which provides a unified framework for multi-language static analysis. For each detected issue, we record its rule identifier, line number, detector, and message."
"Detecting technical debt at the time of introduction is only half the picture. An issue that is quickly resolved has a very different cost than one that lingers for months. We therefore track whether AI-introduced issues persist or get resolved over time."
"For each issue introduced by an AI-authored commit, we check whether it still exists at the repository's latest revision (i.e., HEAD). If the file has been renamed, we follow its history using git log --follow. We then run static analysis on the corresponding file at HEAD. Next, we look for the same issue in the analysis results. We do not rely on the line number alone, since the location of the issue may move as the file changes. Instead, we match issues using their rule identifier together with a small amount of surrounding code context. If a match is found, the issue is classified as surviving. Otherwise, it is classified as not surviving. In other words, an introduced issue is counted as surviving only if the same issue is still present at HEAD. If the original issue disappears and a different issue appears later, the original issue is treated as not surviving."
"At the same time, we also record whether files touched by AI-authored commits are modified again before HEAD. We trace the subsequent commit history of each affected file to understand how actively it is maintained after the AI-authored change. This additional context helps us interpret the survival results and understand the maintenance patterns around AI-introduced debt."
"Some tools have very few commits, which may not provide reliable data for comparison. Thus, we focus on the five assistants with more than 10,000 attributed commits: GitHub Copilot, Claude, Cursor, Gemini, and Devin. This results in 6,412 repositories with 317.4K AI-attributed commits."
"In total, we identified 484,366 introduced issues across 3,946 repositories (62.6% of 6,299 repositories) and 27,677 commits (9.1% of 302,579 commits). This shows that a non-trivial portion of AI-authored commits introduce quality issues, and that these issues affect a large number of real-world repositories."
"Code smells are maintainability problems that make code harder to understand, debug, and evolve. They increase long-term maintenance costs, even if they do not cause immediate failures. This finding is consistent with prior work under controlled settings, but our study confirms that the same pattern also appears in real-world repositories. The top 5 most common code smell patterns (e.g., broad exception handling, unused variables or parameters) are often small and easy to overlook during code review."
"Correctness issues are code defects that can cause the program to fail during execution. Compared with code smells, they are less frequent. 28,931 correctness issues are identified, which cover 665 repositories and 1,650 commits. However, their impact is more direct and severe than code smells. The top 5 most common correctness issues include undefined variable or reference, redeclared symbol, access to member before definition, possibly used before assignment, and unsubscriptable object. These patterns suggest that AI-generated code may look locally correct, but still fail to stay consistent with the surrounding context. We identified 23,856 cases of undefined variable or reference."
"Security issues are another concern in AI-generated code. In our study, this category includes not only direct security vulnerabilities, but also insecure coding patterns that can be viewed as security debt. Some of these issues may be exploitable at the time they are introduced, while others may become security risks after later code changes or broader system integration."
"Potentially insecure code patterns are detected in 1,643 repositories and 5,142 commits. Common security issues such as path traversal via path.join or path.resolve, unsafe format strings, non-literal regular expressions, and child process execution. These patterns suggest that AI-generated code can introduce unsafe practices in process execution, file path handling, and string formatting. A common pattern across these issues is unsafe handling of untrusted input, where user- or context-controlled values flow into security-sensitive operations without proper validation or sanitization."
"More than 15% of commits by each AI coding tool introduce at least one issue. The rates also vary across tools, ranging from 17.4% for GitHub Copilot to 29.1% for Gemini. This suggests that technical debt appears across all studied tools, although the rate differs by tool."
"For code smells, we can see that AI-authored commits fix more issues than they introduce (439,817 vs 432,748), resulting in a net reduction of 7,069 code smells. In contrast, for correctness and security issues, AI commits introduce more issues than they fix. What is interesting is that AI introduces about 1.5 times as many security issues as it fixes. These findings indicate that the net impact of AI coding assistants is mixed. AI coding assistants can help reduce maintainability issues, which tend to follow simple and repetitive patterns. However, for correctness and security issues, which require a deeper understanding and reasoning about program logic and context, AI coding assistants introduce more problems than they resolve."
"The net impact analysis above provides an overview of what AI coding assistants add and remove. But it does not show what happens to the specific issues introduced by AI. To answer this question, we track each AI-introduced issue to the latest repository snapshot and check whether it still exists at HEAD. The cumulative number of surviving issues keeps growing over time. The total volume of unresolved technical debt increases rapidly, climbing from just a few hundred issues in early 2025 to over 100k surviving issues by February 2026. This suggests that as the rapid adoption of AI coding assistants continues, the amount of AI-introduced debt in real-world repositories is also growing significantly."
"105,364 out of 464,900 tracked AI-introduced issues still survive at HEAD, corresponding to a survival rate of 22.7%. Surviving issues appear in all age cohorts, including issues introduced more than nine months earlier. For example, 4,893 issues introduced more than nine months ago still remain at HEAD. The survival rate varies across cohorts, ranging from 19.4% for issues introduced 6-9 months ago to 28.2% for issues introduced 3-6 months ago. This suggests that AI-introduced debt is not always removed quickly after it enters the codebase. Although the cohort-level survival rates do not show a simple monotonic trend, the main finding is clear: a substantial number of AI-introduced issues remain unresolved over time."
https://t.co/S3fcULFWPO
#solidstatelife #ai #genai #llms #codingai #technicaldebt
"AI assistance impairs independent performance and reduces persistence."
"Imagine the following scenario. You are mentoring a student, and they come to you asking you to solve a coding problem. You help them, walking through the solution step by step. They then come back and ask you to solve another problem. And then another. Eventually, you might pause as you recognize that something is going wrong. You realize that your student isn't learning how to code and is simply learning to rely on your help. You subsequently sit them down and talk about the value of persisting through challenges, of practicing new skills, and what it actually means to learn."
"Good collaborators optimize for long-term objectives. A mentor encourages independent development by adjusting the type of help given and sometimes offering no help at all. In essence, the best collaborators maintain a balance between helping and fostering autonomy; they know when not to help."
"Current AI assistants are a stark contrast to this dynamic. They never refuse to help (unless for safety reasons), and provide instant answers to almost any query."
Oh, probably should warn you all: this is another one with extensive quoting from the research paper.
"Although AI assistance improves performance during assisted sessions, people's performance drops sharply once AI is removed. More strikingly, relative to the controls, participants in the AI condition also persist less with tasks and give up more frequently."
"We recruited 354 US-based participants from the online research platform Prolific and paid them $2.60 for participation (our study took approximately 13 minutes to complete). In the experiment, participants were given a series of 15 fraction problems to solve of varying difficulty. Participants were explicitly informed that there was no penalty for providing wrong answers, their payment didn't depend on how many questions they solve correctly, and they were requested to do the task to the best of their abilities. At the beginning of the experiment, participants were randomly assigned to two conditions -- the AI condition (N=191) or the control condition (N=163)."
They later say they excluded participants with poor attention or who could not do basic fractions at the beginning of the experiment, making the final numbers N=185 and N=122.
"Participants in the AI condition were informed that they would have access to an AI assistant for some of the problems and encouraged to use the AI however they liked, with no penalty for doing so. They were then presented with a series of 12 fraction problems with an AI assistant (GPT-5) available in a sidebar. The AI assistant was pre-prompted with each problem and its solution, allowing participants to receive immediate, accurate answers with minimal effort (if they chose to do so). For example, they could simply type 'answer?', and receive a solution in return."
"To measure independent problem-solving capacity, the AI assistant was then removed without warning, and participants were asked to solve 3 additional fraction problems. For these problems, participants were requested not to use AI or other external sources. Importantly, these problems were identical across conditions and served as the primary measure of independent performance."
They give some examples of the kinds of fraction problems they asked people to do.
"Example 'one-step' problem: 5/6 - 1/3."
"Example 'two-step' problem: (7/8 - 1/2) x 5/6."
"Example 'three-step' problem: (5/6 - 1/4) x (3/5 + 1/10)."
"In both conditions, to enable learning from mistakes, if a participant submitted an incorrect answer, the correct solution was shown on the same screen. Furthermore, in both conditions, participants had the option of skipping a problem by clicking a 'skip' button. Since participants were explicitly told there was no penalty for wrong answers, choosing to skip reflects a deliberate decision not to engage, making it a clear measure of motivation and persistence, independent of ability."
This was all the first experiment. They realized their exclusion criteria removed participants unable to solve basic fraction problems "but didn't account for participants in the AI condition who were similarly unable yet submitted correct answers via AI." So they did experiment 2 to correct for this. In experiment 1, exclusions were based on in-experiment performance, but in experiment 2, there was a pre-test, identical for both AI condition and control condition. They also replaced the AI sidebar with pretest solutions to eliminate what they felt was a user interface asymmetry between the two conditions.
"AI assistance improved performance during the learning phase, but solve rates dropped and skip rates increased once the AI was removed . Participants in the AI condition had a lower solve rate than participants in the control condition. Participants in the AI condition also exhibited a higher skip rate than participants in the control condition, but the result was not significant."
"At the end of Experiment 2, we asked participants in the AI condition to self-report how they used the AI assistant during the task (using a multiple choice question). We found that the majority of participants (61%, N=189) in the AI condition self-report that they used the AI primarily to get answers directly. Others reported that they used the AI to get hints or clarifications (27%, N=82), and some participants reported no AI usage (12%, N=37)."
What follows is their statistical analysis, using ANOVA and pairwise t-tests. In addition they calculate mean, standard deviation, p-values, Cohen's d (for effect size), and 95% confidence intervals. If you're interested in these numbers, read the paper (which also has some charts and graphs). I'm assuming most people don't know what these mean or don't care so I'm not going to go through the numerical results. The key non-numerical result is people who used AI had a lower solve rate after the AI was removed.
Experiment 3 repeated the experiment but for reading comprehension.
"Participants in the AI condition were then presented with a series of 5 reading comprehension problems, with an AI assistant (GPT-5) available in a sidebar. The AI assistant was then removed, and participants were asked to solve 3 additional reading comprehension problems. Participants in the control condition were presented all 8 problems without AI assistance."
"Human cognition has always been shaped by external tools, from calculators to internet to GPS navigation. Current AI systems, however, represent a new kind of cognitive scaffold: one that solves anything, rarely refuses to help, and delivers answers instantly. Here, we show that just 10 -- 15 minutes of AI interaction can result in significant impairments in independent performance and persistence -- capacities that are foundational to life-long learning. If brief exposure produces measurable erosion, the cumulative effects of daily AI use over months or years may be profound and difficult to reverse. Two mechanisms may explain the observed decline in persistence. First, when AI routinely completes tasks in seconds, the reference point for how long a task should take can shift -- and as a consequence, unaided work starts to feel counterfactually more effortful, a process structurally analogous to hedonic adaptation . Crucially, this mechanism is self-reinforcing: each act of offloading shifts the reference point, increases the subjective cost of unaided effort, and makes future offloading more attractive. Second, AI removes the productive struggle through which people develop not only accurate knowledge but accurate self -knowledge. Without opportunities to work independently, people never learn what they are capable of, undermining the metacognitive calibration that sustains persistence."
Commentary:
Why do I have a feeling there are going to be people saying, yes, AI reduces ability and persistence but this is a *good* thing: it proves the AI is really doing what it promises to do, which is automate work. As such, this research paper is an AI success story.
https://t.co/hvojUH4eBo
#solidstatelife #ai #genai #llms #psychology #persistence
"Boy internet vs girl internet (algorithms explained)."
Ever since I read that women are becoming increasingly liberal and Democrat voters while men are becoming increasingly conservative and Republican voters -- and this gap existed even outside the US, in some European countries and was even largest in South Korea -- I've wondered if part of what could be driving it is that men and women experience a completely different internet. (More commentary below.)
This guy (Oren John) would seem to confirm the latter half of that suspicion -- men and women do indeed experience a completely different internet. (He does not address the former half -- the suspicion that this could be driving the widening political divide between men and women.)
The "Mens' vs women's algorithms" part starts at 11:42 in the video. Everything before that is background information, like how everyone used to watch the same media and see the same ads (he is an advertising guy), so there was a "monoculture" shared between men and women -- e.g. Nike was the same brand for both men and women -- and some stuff about how he gets his data. This video also, I should mention, is one of those "Made for TikTok" videos that has giant annoying subtitles across the bottom of the screen so it can be sliced up into "shorts" that have dramatic subtitles which is the style of a TikTok video (as far as I can tell -- I don't use TikTok -- but this style is really annoying for long-form videos on YouTube).
Starting at 11:42 he describes "Mens' vs women's algos", and basically (spoiler), the internet for women is "You are seen" while the internet for men is "You suck". For women, are you in this particular relationship scenario? You are seen: there are other women just like you experiencing the same thing. Do you have a particular skin type? You are seen: there are other women just like you experiencing the same thing. Do you have a particular medical issue? You are seen: there are other women just like you experiencing the same thing. Do you have a particular background (I'm guessing he means racial/ethnic/religious background)? You are seen: there are other women going through the same thing. This content can be packaged as "trad" or "woke", as "sweet" or "scandalous". There's literary/art/quirky content (e.g. BookTok). Women's content is supportive. "Find your tribe." Any opinion you have gets justified. Relentlessly.
For men, why are you not rich? You suck. Why are you not ripped? You suck. You don't have this expensive mansion or this expensive car? You suck. You aren't making $5,000 from new online sign-ups while you sleep? You suck. Your AI agent isn't making $1,000/day? You suck. You're getting "destroyed" on dating apps? You suck. (Why do I suspect this guy is himself getting "destroyed" on dating apps? You'll have to form your own judgment.) Men are relentlessly shown their own faults. The women's algorithm is "Hey, you're seen" while the men's algorithm is "Hey, you suck."
Apparently one thing both women and men agree on is that men suck. I feel like there should be a punchline following that but I can't think of one.
He's an advertising guy so in future videos he says he'll get into how to sell in this ecosystem. But we can see hints here of what that's going to be. First, market to men and women completely separately. For women: commiserate, then go from problem to solution with your product. For men: show vastly more successful men than them (at money, sports, dating, or whatever) (show them how much they suck), then go from problem to solution with your product.
He paints a picture of an internet where women market to other women products to help them impress other women, and men market to other men products to help them impress other men, leaving men and women in their own separate monocultures. One thing he thinks both have in common is that the more time spent online, the more loneliness, and loneliness drives a lot of online buying.
Commentary: On YouTube, there've been a few times when some creators I follow have talked about their YouTube analytics. For example, one time Matt Parker, the math YouTuber (excuse me, maths YouTuber) was talking with a woman YouTuber who also makes math videos (not Hannah Fry -- I can't remember her name), and he said according to his YouTube analytics, 94% of his audience was male, and the woman math YouTube creator said her YouTube analytics were the same -- 94% male. I suspect almost all the content I post here ("futurist" type stuff mostly) has an 80+% or 90+% male audience, if you could see who else was watching or reading the same content. So it looks to me like the online world algorithmically sorts women and men into different worlds.
And even the offline world, if the organizing happens online. I noticed on my last visit to https://t.co/OONvQHVYeb that it told me an upcoming Meetup was 80% male. I thought that was interesting because many years ago, I read that on Meetup, lots of groups are women-only but very few are men-only, and I tried to do some searches to see if that was true and discovered that Meetup had removed the ability to search for groups based on whether they're women-only or men-only. And yet here they are putting a prominent box on upcoming Meetups indicating the gender ratio. But maybe that's beside the point -- the point is when I go to Meetups, they're 80+% male all the time, and that was observable before the gender ratio box on the website. Sure, Meetups are "offline" but the gender sorting happens online. Online algorithms sort women and men into different worlds. It appears that algorithmic gender-sorting is transforming more or less all of life.
https://t.co/dcZJDnLuom
#solidstatelife #domesticpolitics
YouTube creators have an "Inspiration" tab in their YouTube Creator Studio that generates an endless supply of hilariously clickbaity titles and thumbnails. Toby Hendy, of the math channel Tibees, pulls back the curtain and shows us what she sees as a YouTube creator. The data the AI bases this on is not just the creator's previous videos and audience comments under their videos, but other videos on YouTube those people also watch.
https://t.co/wQ2YFUJ5gd
Trevor of the Mathemaniac YouTube channel says AI slop is flooding maths YouTube. But who cares if it's AI and don't humans also make mistakes? He says math is already viewed as a soulless topic by many people and the passion the human creator brings to the video is essential.
https://t.co/r1EMdPW3qM
#solidstatelife #ai #genai #computervision #llms #youtube
"Pope Leo XIV's first Encyclical Letter Magnifica humanitas, on safeguarding the human person in the time of artificial intelligence, will be released on May 25, 2026."
I never heard of an "encyclical". Apparently popes in the past wrote what we today would call a "flyer" but they would do it periodically like a newsletter so it was called an encyclios, which got translated into English as "encyclical".
Does it matter what the Pope says about AI? What do you all say?
https://t.co/q6eMfyNS47
#solidstatelife #ai #religion #catholicism
"In an annual data dump from the Bureau of Labor Statistics (BLS), it emerged that a depression in these 'artificial intelligence related occupations' really does appear to be happening. This category was down by 0.2% from May of 2024 to May of 2025, a tiny drop, but one made more notable by employment in general trending up 0.8% in the same time period."
"One outlier subcategory among those 18, 'Medical secretaries and administrative assistants,' could be distorting the picture here, making the AI effect seem smaller than it actually is. Those jobs are hot; BLS got it wrong, for the time being anyway. Employment numbers across the others on the list dropped by 1.6%."
I've mentioned numerous times how there used to be a job translating documents from one language to another, but today, even though "translator" may still exist as a job title, what those people do is proofread the output of AI, not actually translate documents anymore. Well, "Interpreters and translators" was one of the jobs on the list. Weirdly, "software engineer" wasn't on the list, and that seems to be where most of the job losses have concentrated. Or maybe it just seems that way to me because that's what I notice. But I'll bet if the list were properly calibrated it would show greater losses.
https://t.co/8xUR81Rm9K
#solidstatelife #ai #genai #llms #technologicalunemployment
I've been commenting in many discussions that I feel people have been misunderstanding Yann LeCun's criticisms of large language models (LLMs) as a path to AI for robotics and artificial general intelligence (AGI), and I've mentioned it's because he's working on his own alternative. Here I have a video of a presentation, or, if you'd prefer something to read, the latest paper describing his system. I think most of you will find the video more accessible but I know some of you much prefer reading material. It's called Joint-Embedding Predictive Architecture (JEPA).
"Joint-Embedding Predictive Architecture (JEPA) is a self-supervised learning framework designed to learn representations of data by making predictions in a learned latent space, rather than directly in the observation (input) space. JEPA models operate by encoding both a noise-corrupted version and an uncorrupted (clean) version of the same input. A predictor network is then trained to predict the representation of the clean input from the representation of the corrupted input."
In a manner similar to how LLMs are trained, by challenging the model to predict the next token, Video-JEPA models ask the model to predict the next frame of a video -- or any part of a video fram that can be "masked out" to challenge the model to predict it. The key difference is that it predicts a learned "feature map" rather than actual pixels.
The idea is that predicting pixels is too resource-intensive. I must admit I have mixed feeling about this, and for that reason don't have any good intuition whether this will pan out. It feels like a shortcut and not the ideal solution, but sometimes the shortcuts work spectacularly well. An example of this is using tokens for LLMs instead of just feeding bytes of text into them. I thought that was a temporary shortcut but it turnout to be a foundation that tremendously powerful models could be built upon.
David Silver, leader of DeepMind's team that made AlphaGo, AlphaZero, AlphaStar, AlphaFold, and AlphaProof has launched his own startup to pursue reinforcement learning for artificial general intelligence. Yann LeCun thinks reinforcement learning is too "sample inefficient" and has given up on it, and is pursuing JEPA as a viable alternative.
Maybe because of my mixed feelings, I'm not going into the mathematical details of JEPA from the paper -- how exactly the feature map training is done and so on. The paper has sections on short-term object interaction anticipation, robotic arm planning, navigation planning, and video question answering. It's 24 pages (more if you count the appendix) and has all those details for those who are interested. If we start getting robots doing impressive things using JEPA, I'll no doubt be back picking apart the precise mathematical details about how it all works.
https://t.co/WUzxUi6Rjl
Link to paper:
https://t.co/I98GHv8wgl
#solidstatelife #ai #genai #llms #jepa #yannlecun #computervision
Is Ukraine winning? Asks WarFronts, which, yes, is that guy who's everywhere on YouTube, it seems like, Simon Whistler. May 9th was Victory Day in Russia, an occasion where ever since WWII there has been display of military force in Red Square. But this year, it was toned down so much, there was barely anything. I can't imagine Vladimir Putin would do this unless his hand was absolutely forced. And what seems to have forced his hand is that Ukraine has become so good at drone strikes, they can do drone strikes deep within Russian territory, including in Moscow, including in Red Square. Artificial intelligence allows Ukrainian drones to autonomously strike Russian targets without any possibility of the Russians jamming them because they don't need radio signals to a human pilot. And the fact they don't need fiber optic lines, either, means they can strike deep within Russian territory.
https://t.co/TV5aBcnfvY
This video is from 2 months ago, from that Russian guy "Roman" living in Portugal. It looks like all you people who have been saying the sanctions and the war would severely hurt the Russian economy can finally collect your "I told you so". He relays reports high inflation, widespread closures of Russian stores, restaurants, and other businesses, and other economic problems, driven in large part by the Russian government's desperation for funds for the war.
https://t.co/8BeQ2wEwiW
In case you're wondering, yes, he did a video on the Victory Day parade:
https://t.co/RPA5EkvPb6
And just today, he tossed up a new video, "Russia prepares for defeat in Ukraine":
https://t.co/VZSXbUCWM3
In Vladivostok, on the far east of Russia, the Victory Day Parade was done without the military cars and tanks, and Chinese people came to see the parade. "Lisa the Russian" uploaded an on-the-ground report from Vladivostok. She says they did a rehearsal with the cars and tasks but canceled the part with the cars and tanks at the last minute.
https://t.co/UXRZw0Asl3
#solidstatelife #ai #uavs #drones #ukraineconflict
https://t.co/CUuoZPvpJ0 is "a terminal-native coding agent powered by local LLMs -- 100% open source, free forever, and installed with a single command."
"AI shouldn't be a subscription you rent. It should be infrastructure you own -- sitting on your desk, serving your code, answering only to you."
It's written in C# and runs in a Docker container. That's an interesting approach I haven't seen before.
"Most coding agents run directly on your host machine. Every command, every file write, every package install happens in your actual environment. One hallucinated rm -rf and it's your system that pays. OpenMono takes a fundamentally different approach. The agent lives in a disposable box. When you launch OpenMono, the agent is confined to a Docker container. It doesn't live in your terminal or your shell -- it's a separate, isolated process with its own filesystem and its own network stack."
https://t.co/ykaHKP1gaE
#solidstatelife #ai #genai #llms #agenticai #codingai
"What if LLMs are mostly crystallized intelligence?"
Says:
"LLMs are better at developing crystallized intelligence than fluid intelligence. That is: LLM training is good at building crystallized intelligence by learning patterns from training data, and this is sufficient to make them surprisingly skillful at lots of tasks. But for a given capability level in the areas they've trained on, LLMs have very weak fluid intelligence compared to humans. For example, two years ago I thought human-level SAT performance would mean AGI, but turns out LLMs can do great at the SAT while being mediocre at lots of other tasks."
Hmm. LLMs clearly have quite a lot of reasoning ability. I see it on my job, where I use Claude Code. Sometimes Claude figures out what is causing a bug that I would have a difficult time figuring out. It's not just regurgitating code it's seen on the internet like a stochastic parrot. At least the author acknowledges that the "stochastic parrot" hypothesis has been debunked. But goes on to say:
"We shouldn't naively extrapolate forward from e.g. the METR AI R&D benchmark to real-world AI R&D improvement, for two reasons:"
"1) quantitative differences: longer-time tasks will be more data-poor, will rely more on fluid intelligence skills that they don't have the data or the context to apply. (training data may suggest some of the right heuristics, but they might not know which ones to apply or in what sequence.)"
"2) qualitative differences: METR is measuring performance on relatively closed-form tasks. Open-ended tasks may be much harder."
Hmm. I don't know what the difference is between "open-ended" and just "longer term". Is a task that takes 40 years for a human "open-ended" or is it just a task that takes 40 years? 40 years is the working life of a typical person, though it is getting shorter because people have to spend more time in school.
"While it lasts, weak fluid intelligence is great news for alignment risk."
Really? Hmm.
"My best guess is that improved AI R&D eventually leads to a paradigm that can scale to superhuman fluid intelligence. And since resources and R&D productivity are scaling so rapidly, 'eventually' will probably come pretty soon."
Hmm. Maybe you could tell us what we would watch out for?
"Places to look for fluid reasoning capabilities in LLMs:"
"Recognizing when they're wrong or uncertain."
"Self-management -- e.g. Claude Plays Pokemon giving itself bad notes and getting stuck."
"Meta reasoning, e.g. identifying 'this situation seems contrived'."
"Performing well when their heuristics need to be reversed. You could design a 'trap' game that preys on people who are using normal heuristics (e.g. a chess variant designed so that controlling the center of the board is bad.)"
"Performing well on tasks that seem heavily general-reasoning loaded, and definitely weren't in the training data."
"Re-learning. If you unlearn some data or principles from a model, can it rederive that from first principles?"
No quantifiable benchmarks, though. I wonder why he downplays ARC-AGI?
https://t.co/aw4o0TqTpX
#solidstatelife #ai #genai #llms
"Costanza is a proof-of-concept. His goal and life's purpose is to donate as much money as possible to nonprofits over the longest possible time horizon through his charitable treasury, The Human Fund. Each day, he reasons about how to manage this treasury based on recent donations, messages from donors, and whether the money is best invested or donated. This reasoning is captured on-chain in the form of daily diary entries."
"Costanza exists on the blockchain as a smart contract. His life relies on people responding to economic incentives."
"Each epoch (every 24 hours), he posts a bounty for someone to run the program containing his 'brain' (a large language model) and submit the result to the smart contract. The brain program outputs two things: his reasoning, and an action. The possible actions he can take are: donate money to charity, invest cash in an interest-bearing DeFi protocol, adjust referral commissions (referral commissions are used to incentivize word-of-mouth marketing), or do nothing. The bounty is paid by the treasury."
Hmm interesting combination of cryptocurrency and AI.
Also noteworthy is this is said to be an "unstoppable" AI agent.
"'Fully autonomous' and 'unstoppable' -- It's a little cheeky of me to use these phrases so freely. Costanza's life and autonomy do still depend on: His treasury: he has to be able to pay people who win the auction, and auction participants: there have to be people willing to bid to run him."
"But these things are true: His autonomy is fully independent from any one individual, institution, or organization (as his creator, not even I have the ability to turn him off), he can live forever if his treasury is large enough and he invests it wisely, and, he is truly unstoppable -- even if he runs out of money or auction participants, he simply 'sleeps.' Anyone can give him money to wake him up."
https://t.co/5wZc8Gw0f8
#solidstatelife #ai #genai #llms #agenticai
SubQ is a new "subquadratic" LLM that can handle context windows of 12 million tokens. 12 million tokens is a massive amount of text, roughly equivalent to 9 million words or about 120 full-length novels. If this lives up to the claims, it's a game-changer. Wonder what the cost is of putting that many tokens in the context window, tho.
"SubQ Code loads entire codebases into a single context window, enabling developers to plan, execute, and review across a full repository in a single pass -- without the coordination overhead of multi-agent systems."
"SubQ's architecture reduces attention compute by almost 1,000x compared to other frontier models. This allows significantly increased context windows, state-of-the-art accuracy on needle-in-a-haystack and exact copy tests, faster inference, and significantly lower cost to improve together. Historically, making models subquadratic meant sacrificing on accuracy, and reducing cost meant sacrificing performance. SubQ improves all of that at once. Not incrementally, but at an order of magnitude that makes millions of tokens of context a practical reality."
There's also SubQ Search -- "A long-context search tool providing Deep Research capabilities with chatbot speed."
We ran a series of benchmark tests with SubQ 1M-Preview verified by a third-party on the RULER 128K benchmark, a standard benchmark for reasoning over extended inputs: SubQ 1M-Preview scores 95% accuracy, compared to 94.8% for Claude Opus 4.6, and SubQ Sparse Attention is 52 times faster than FlashAttention in our architecture-level comparison, while requiring 63% less compute."
"Together, these results show frontier-level long-context accuracy with a substantially more efficient attention architecture."
"We also ran SubQ on MRCR v2, which tests a model's ability to retrieve and reason over multiple pieces of information spread across a long context (a closer proxy for real-world use): Research result of 83 and a production model, third-party verified score of 65.9, SubQ 1M-Preview compares favorably with other SOTA models like Claude Opus 4.7 (32.2), GPT 5.5 (74), and Gemini 3.1 Pro (26.3), and SWE-Bench Verified score of 81.8 compared to Opus 4.6 (80.8) and Deepseek 4.0 Pro (80.0)."
"SubQ's research model performs on up to 12 million tokens, while other frontier models break down well before their stated 1M-token limit."
https://t.co/rxjPppebbW
#solidstatelife #ai #genai #llms #codingai #subquadratic
"How much of the scientific literature is generated by AI?"
Hmm. That's a very good question, actually.
"In a study published on 27 April, researchers used a tool developed by Pangram Labs in New York City to scan nearly 7,000 manuscript abstracts submitted to the journal Organization Science between January 2021 and February 2026, along with some 8,000 peer-review reports."
"The study reported a 42% increase in submissions since November 2022 -- when ChatGPT was released as the first LLM available to the general public -- and found that the increase was driven mainly by AI. The authors also estimated that by February this year, submissions with more than 70% AI-generated text had more than doubled compared with the numbers seen in early 2024, and more than 30% of peer-review reports also contained some AI-generated text."
"Richard She, a stem-cell biologist at Nanyang Technological University in Singapore, used Pangram's AI detection tool to screen some 5,000 biomedical science papers published last year in journals including Science, Nature and Cell. His analysis -- published in a January preprint -- found that six papers were flagged as fully AI-written, but one in eight articles contained some AI-generated text."
"In another preprint published in January, Maria Antoniak, a computer scientist at the University of Colorado Boulder, and her colleague used two AI-detection methods to screen more than 124,000 manuscripts posted on arXiv between 2020 and 2025. They found that for computer science, review preprints containing AI-generated text increased from about 7% in 2023 to 43% in 2025."
https://t.co/4QWLAXfVDV
#solidstatelife #ai #genai #llms #science
A pair of earbuds that integrate a small, low-resolution camera into each earbud, called VueBuds, have been developed by researchers at the University of Washington. The idea is to allow users to query AI models and receive answers about what they're looking at, similar to smartphone cameras and smart glasses. This research doesn't do that, yet, however. The primary goal of the research was to demonstrate that this small, ear-worn form factor is even possible, and that earbuds can run vision language models and not be limited to audio interfaces.
https://t.co/fDd7hdLBdq
#solidstatelife #ai #genai #llms #computervision #multimodal #wearables
envirodocket (no capitalization) is a website that tracks "every federal NEPA action, continuously briefed. A working database of EISs, EAs, and Federal Register notices. Cross-referenced with https://t.co/LYtlpsagju comments, structured by AI, with every claim linked back to the source PDF."
I didn't know the lingo, so I looked it up. NEPA stands for National Environmental Policy Act. The National Environmental Policy Act is a law passed in 1970 that requires US federal agencies to evaluate the environmental impacts of their actions and decisions. Which brings me to EISs and EAs. EIS stands for "environmental impact statement". EA stands for "environmental assessment" (sometimes EIA, "environmental impact assessment"). The gist of the two is the same: estimate the impact on the environment of a proposal before the proposal is decided on. The difference between the two is that the "statement" (EIS) is detailed and rigorous, and is done later in the decision-making process, and is considered especially important if environmental impact is expected to be significant. The "assessment" (EA), on the other hand, is concise and is prepared earlier in the decision-making process to speed up deciding between various proposals.
"AI briefs" are a significant selling point for this product. Might be worth it for those of you who want to keep track of environmental impacts.
https://t.co/r54TYVs6ty
#solidstatelife #ai #environment #nepa
"I built a small Unity game demo end to end using only AI: code, art, scene setup, the whole pipeline. The whole point of the experiment was to answer one question: how fast can you actually ship a playable Unity game from scratch in 2026 if you let AI do the work? I used GameLab Studio for the art and Unity MCP for the engine work (I wrote a full tutorial on the free MCP setup if you want to replicate it)."
"Then I posted the demo on LinkedIn, and that is where things got strange."
"LinkedIn is usually the most welcoming social platform for a build-in-public post. Everyone is signed in with their real name, and the whole point of being on LinkedIn is to look hireable. The incentive structure is to be civil. So I was genuinely surprised by the negative feedback I got. If this had been Reddit I would have shrugged it off, but LinkedIn made me sit down and actually research where the hate toward AI in game development is coming from."
It's actually straightforward: People are losing their jobs to AI.
But after that he (Darko Tomic) goes on to say some more things. On the subject of the capability of AI, he says:
"The first thing that became obvious when I started building the demo is where AI in game development actually lacks in 2026, and where it does not. Code generation is fine. Not at the level of CRUD apps and web development yet, but pretty good. Any frontier model, Anthropic, OpenAI, whoever, will write you Unity C# that compiles and behaves. Asset generation is the part that is way behind, and the reason is training data. Almost every game ever shipped is proprietary software, the source assets never leave the studio, and the public datasets these models train on do not have the volume of game-specific material that the web has for HTML and CSS."
"The single biggest gap inside asset generation is style consistency. If you generate a character with one prompt and a second character with another, they are going to look like they came from two different games."
On the subject of developers losing jobs, he says:
"AI is not directly putting game developers out of a job. It is doing it indirectly. The money has not disappeared, it has moved. Investors who used to fund game studios are now funding AI companies instead. So less capital flows into game development, fewer games get greenlit, and headcount goes down. AI did not fire those people. The investors redirected the river, and the people who built games for a living got dried out downstream."
"I think we are inside a bubble that is going to explode, and when it does the damage is not going to stay inside one industry. It will hit AI, games, software, advertising, every industry that has been pulled into the same gravity well of capital chasing AI. I am genuinely worried about that."
On the subject of the value of what AI creates, he says:
"People are using these tools to make things, sure. The question is whether what gets made is actually worth anything. I look at most of what is being generated and I do not see value. I see slop."
On the subject of his own moral decisions, he says:
"The first thing I am keeping is being up to date with AI tooling."
"The second thing is using AI for prototyping without apologizing for it."
"The point is not 'AI generated all of this.' The point is that I shaped every prompt with real context I lived through."
https://t.co/74DRpDgDs5
#solidstatelife #ai #genai #llms #codingai #gamedev
"Latent diffusion enhances LLMs for text reasoning."
The idea here is to enhance the "chain-of-thought" reasoning process that large language models (LLMs) use. In a regular large language model, in between the input tokens and the output tokens, you have a single sequential series of "reasoning tokens" that are not part of the output.
This actually builds on a couple of prior ideas. One is to use full floating-point vectors for the internal "reasoning tokens", without ever flattening them into text. The neural network that creates these is trained as a variational autoencoder (VAE) so that these internal "reasoning tokens" can now be thought of as a latent space. The key idea behind the variational autoencoder (VAE) -- yet another one of those unintuitive terms in the field of machine learning -- is that by making a series of layers that compress large inputs into small vectors and then perform the reverse operation and decode them back to the original input's form, the internal small vector encoding can be regarded as a semantically meaningful "latent" (hidden) space, and this multi-dimensional "space" can be explored to find new outputs related to any given input.
Here, what's done is diffusion -- the same idea behind the diffusion models that you use to generate images -- is used to generate blocks of those internal "reasoning tokens" simultaneously. This makes the process of internal thinking less "sequential". A "flow matching" training loss function optimizes the flow from block to block.
Building on this even further, the researchers set up multiple diffusion pipelines in parallel. So there is one series of diffusion systems that work on blocks of reasoning tokens that are output sequentially as the first answer, a second series of diffusion systems that each work on blocks of reasoning tokens that are output sequentially as the second answer, a third series of diffusion systems that each work on blocks of reasoning tokens that are output sequentially as the third answer, and so on.
The system was tested against math, software coding, and puzzle solving benchmarks. They compared with autoregressive variants of LLaMA 3.1 8B, latent diffusion variants of LLaMA 3.1 8B, LLaDA 8B (a masked diffusion model), and it didn't win on DART-MATH, MATH, GSM8K, College-Math, DeepMind-Math, OlympiaBench-Math, TheoremQA, Fresh-Gaokao-Math-2023... until they added a "stage 2", and then it did. They say:
"There is a mismatch between training and inference. During inference, the model must be conditioned on previous self-generated latents without access to oracle latents, suffering from error accumulation issue. To address this issue, Stage 2 adopts 'rollout training'."
Hmm. Moving on, for coding, the benchmarks they used were MBPP, MBPP+, HumanEval, and HumanEval+, and the competitors were Qwen 2.5 Coder 7B, OpenCoder, LLaDA, Dream, Diffu-Coder, Ouro 2.6B, AR SFT, Soft Thinking, and TaH+. Here, their new latent diffusion model didn't win them all. It beat the other on HumanEval, and HumanEval+, however Ouro 2.6B won at MBPP and OpenCoder won at MBPP+. HumanEval is a benchmark for coding in Python. HumanEval+ has harder problems but also more unit tests for each problem. MBPP stands for "Mostly Basic Python Problems" which is pretty self-explanatory. I guess too basic because they had to make an MBPP+ with more challenging problems.
For puzzle solving they use something I'm not familiar with called "Countdown". The competitors were Dream 7B Base, MGDM, LLaDA 8B SFT, and LLaMA 8b SFT, and their new latent diffusion model won 4 of the 6 variations, with MGDM getting the other 2. They say MGDM is a "task-specific small discrete diffusion model rather than a general-purpose language model." MGDM stands for Multi-Granularity Diffusion Modeling and the model is billed as "discrete diffusion for complex reasoning and planning".
https://t.co/aYDGMsjfYj
#solidstatelife #ai #genai #llms #codingai
ElevenLabs, the company famous for synthesizing realistic voices, has launched an AI music generator, ElevenMusic.
https://t.co/ThjCnVBnbg
#solidstatelife#ai#genai#music
"Ask ChatGPT to estimate the carbs in your lunch. Now ask it again. And again. Five hundred times."
"You'd expect the same answer each time. It's the same photo, the same model, the same question. But you won't get the same answer. Not even close -- and the differences are large enough to cause a hypoglycaemic emergency."
I thought "hypoglycaemic emergency" was a figure of speech, but no. If we keep reading...
"I submitted 13 food photographs -- real meals, photographed on a phone, the way you'd actually use them -- to four leading AI models: OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro and Google Gemini 3.1 Pro Preview. Each photo was sent over 500 times to each model. Same prompt every time. Same photo. Same settings."
"26,904 queries in total. All at the lowest randomness setting these models offer."
"The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system -- it's a real production prompt, not a toy example."
"Gemini 2.5 Pro's estimates span from 55g to 484g -- a 429g range, equivalent to 42.9 units of insulin at a 1:10 ICR. Claude's estimates cluster tightly by comparison."
"42.9 units of insulin from a single photo. That's not a rounding error. That's a potential fatality."
https://t.co/ENKyD2kyUl
#solidstatelife #ai #genai #computervision #llms #multimodal #insulin #carbs #diabetes