Andrew Ng just proposed the "Turing-AGI Test" and it is the reality check the industry needs.
The original Turing Test measured the ability to mimic humans. This new proposal measures the ability to do the job.
The criteria are simple. Can the AI accept training? Can it execute tasks over multiple days? Can it deliver economic value?
We build with this mindset at MaybeAI. Benchmarks are interesting, yet reliable workflow execution is what actually matters to a business.
Happy 2026! Will this be the year we finally achieve AGI? I’d like to propose a new version of the Turing Test, which I’ll call the Turing-AGI Test, to see if we’ve achieved this. I’ll explain in a moment why having a new test is important.
The public thinks achieving AGI means computers will be as intelligent as people and be able to do most or all knowledge work. I’d like to propose a new test. The test subject — either a computer or a skilled professional human — is given access to a computer that has internet access and software such as a web browser and Zoom. The judge will design a multi-day experience for the test subject, mediated through the computer, to carry out work tasks. For example, an experience might consist of a period of training (say, as a call center operator), followed by being asked to carry out the task (taking calls), with ongoing feedback. This mirrors what a remote worker with a fully working computer (but no webcam) might be expected to do.
A computer passes the Turing-AGI Test if it can carry out the work task as well as a skilled human.
Most members of the public likely believe a real AGI system will pass this test. Surely, if computers are as intelligent as humans, they should be able to perform work tasks as well as a human one might hire. Thus, the Turing-AGI Test aligns with the popular notion of what AGI means.
Here’s why we need a new test: “AGI” has turned into a term of hype rather than a term with a precise meaning. A reasonable definition of AGI is AI that can do any intellectual task that a human can. When businesses hype up that they might achieve AGI within a few quarters, they usually try to justify these statements by setting a much lower bar. This mismatch in definitions is harmful because it makes people think AI is becoming more powerful than it actually is. I’m seeing this mislead everyone from high-school students (who avoid certain fields of study because they think it’s pointless with AGI’s imminent arrival) to CEOs (who are deciding what projects to invest in, sometimes assuming AI will be more capable in 1-2 years than any likely reality).
The original Turing Test, which required a computer to fool a human judge, via text chat, into being unable to distinguish it from a human, has been insufficient to indicate human-level intelligence. The Loebner Prize competition actually ran the Turing Test and found that being able to simulate human typing errors — perhaps even more than actually demonstrating intelligence — was needed to fool judges. A main goal of AI development today is to build systems that can do economically useful work, not fool judges. Thus a modified test that measures ability to do work would be more useful than a test that measures the ability to fool humans.
For almost all AI benchmarks today (such as GPQA, AIME, SWE-bench, etc.), a test set is determined in advance. This means AI teams end up at least indirectly tuning their models to the published test sets. Further, any fixed test set measures only one narrow sliver of intelligence. In contrast, in the Turing Test, judges are free to ask any question to probe the model as they please. This lets a judge test how “general” the knowledge of the computer or human really is. Similarly, in the Turing-AGI Test, the judge can design any experience — which is not revealed in advance to the AI (or human subject) being tested. This is a better way to measure generality of AI than a predetermined test set.
AI is on an amazing trajectory of progress. In previous decades, overhyped expectations led to AI winters, when disappointment about AI capabilities caused reductions in interest and funding, which picked up again when the field made more progress. One of the few things that could get in the way of AI’s tremendous momentum is unrealistic hype that creates an investment bubble, risking disappointment and a collapse of interest. To avoid this, we need to recalibrate society’s expectations on AI. A test will help.
If we run a Turing-AGI Test competition and every AI system falls short, that will be a good thing! By defusing hype around AGI and reducing the chance of a bubble, we will create a more reliable path to continued investment in AI. This will let us keep on driving forward real technological progress and building valuable applications — even ones that fall well short of AGI. And if this test sets a clear target that teams can aim toward to claim the mantle of achieving AGI, that would be wonderful, too. And we can be confident that if a company passes this test, they will have created more than just a marketing release — it will be something incredibly valuable.
[Original text: https://t.co/mGAmoOGga7 ]
@Meta acquiring @ManusAI for ~$2B+ is the clearest signal yet.
The era of "just chat" is ending. The era of "agents that execute" is here.
Models provide the intelligence. Agents provide the hands to do the work.
We focus on this execution layer every day at MaybeAI. Reliable, multi-step workflows are the only way to capture durable value from AI. 😎
Manus is entering the next chapter: we’re joining forces with Meta to take general agents to the next level.
Full story on our blog: https://t.co/huPrnbITCi
@ads3_ai@animocabrands A game-changer for the Web3 ecosystem! This strategic investment from @animocabrands into @ads3_ai will accelerate the mission of onboarding the next wave of users.
Gemini Agent can help tackle all sorts of tasks. Even renting a car.
Tell Gemini Agent your budget and it’ll get to work comparing prices, gathering info from your inbox, and booking the car.
Now available for Google AI Ultra users in the US on desktop and mobile.
@deedydas Page saw it clearly. The data was already there in 2000 - just needed the compute to catch up. Most predictions fail because they ignore existing constraints. This one worked because it built on what Google already had.
@deedydas World model training for robotics is a massive leap. 0.88 correlation suggests we're approaching the point where simulation becomes genuinely predictive of reality.
@ben_burtenshaw@huggingface Code generation for model fine-tuning is fascinating. The meta-level of AI helping improve AI opens interesting questions about iterative capability growth.
@Yuchenj_UW Intellectual challenge as the path to happiness rings true. The best builders are driven by curiosity, and we're living through the most curious time in history.
@lexfridman The balance between deep technical work and human connection is what makes research meaningful. Your approach to grounding theory in real engineering challenges resonates deeply.
@ThePracticalDev@jandedobbeleer The instruction file approach is brilliant. Most teams underestimate how much clarity AI needs to be genuinely useful. Clear constraints create better outcomes than endless possibilities.
@claudeai Android support and hotkeys feel like small features until you use them daily. The best improvements are the ones you quickly take for granted.
@Yuchenj_UW The best ideas often come from places where people aren't worried about appearances. Sometimes the messy garage beats the polished boardroom.
OpenAI’s Disney partnership looks flashy: iconic characters, a $1B equity investment, and instant attention for Sora.
The harder question is who carries the long-term risk.
Licensing creative IP into generative systems shifts value and control in subtle ways. Visibility is immediate. Payoff is uncertain.
In AI, distribution deals grab headlines. Sustainable advantage still comes from how the technology is actually deployed in real workflows.
Really happy to be working with Disney to bring some magic to Sora and image gen!
Disney is the best storytelling company in the world, and our users really, really want to generate content with their characters.