Tan Pham

@ngctnnnn

CS PhD student @ UIUC "26 Vingroup fellowship winner "26

Joined October 2021

158 Following

9 Followers

59 Posts

ngctnnnn retweeted

elvis

@omarsar0

24 days ago

This SkillOpt paper from Microsoft is a must-read! (bookmark it) I was a bit skeptical of the results reported in the paper when I shared it a few days ago. However, I managed to integrate it into my agent orchestrator and ran a few experiments. The results are mindblowing. Essentially, all my agent skills now have a proper testing framework and a way to self-evolve. I have started to improve all my agent skills with this. One exciting result was when I applied it to my paper-figure-extraction skill, which requires an agent to do multimodal analysis. In particular, it improved quality by +20 points (0.73 → 0.93). I went to see the extracted tables and figures, and I was absolutely stunned by how much better my skill got at the task. Self-improving AI is in the early days, but I think this work is a clear example of the current ability of agents to self-improve. In this case, it was skills, but it's not hard to imagine how this scales to optimizing agent patterns, tool use, context engineering efforts, agentic search, workflows, evals, and even the harness itself. I already started with a few of these ideas inspired by SkillOpt. Stay tuned!

omarsar0's tweet photo. This SkillOpt paper from Microsoft is a must-read!

(bookmark it)

I was a bit skeptical of the results reported in the paper when I shared it a few days ago.

However, I managed to integrate it into my agent orchestrator and ran a few experiments.

The results are mindblowing.

Essentially, all my agent skills now have a proper testing framework and a way to self-evolve. I have started to improve all my agent skills with this.

One exciting result was when I applied it to my paper-figure-extraction skill, which requires an agent to do multimodal analysis. In particular, it improved quality by +20 points (0.73 → 0.93). I went to see the extracted tables and figures, and I was absolutely stunned by how much better my skill got at the task.

Self-improving AI is in the early days, but I think this work is a clear example of the current ability of agents to self-improve.

In this case, it was skills, but it's not hard to imagine how this scales to optimizing agent patterns, tool use, context engineering efforts, agentic search, workflows, evals, and even the harness itself. I already started with a few of these ideas inspired by SkillOpt.

Stay tuned!

634

105

875

44K

ngctnnnn retweeted

How To Prompt

@HowToPrompt__

24 days ago

ByteDance has published a paper that should make every NVIDIA investor sweat. They trained an AI that writes CUDA better than humans experts. They call it CUDA Agent. And it completely rewrites the economics of AI hardware. They built a massive agentic reinforcement learning loop. The AI writes a kernel, compiles it, profiles the hardware, analyzes the bottlenecks, and rewrites the code until it's flawless. It learned how to optimize memory access patterns and hardware tiling strategies that traditional compilers miss. The results are staggering. On the industry-standard KernelBench, CUDA Agent completely destroyed traditional compilers. It delivered code that runs up to 3.2x faster than PyTorch's native execution. On the hardest, most complex models, it beat the strongest proprietary models in the world—including Claude Opus 4.5 and Gemini 3 Pro, by 40%. It didn't just match human experts. It started discovering optimizations that static compilers literally cannot see. Here is why this is a massive threat to NVIDIA. NVIDIA's dominance relies on the fact that CUDA is incredibly hard to master. Developers get locked in because optimizing code for other chips is too painful. But if an AI agent can autonomously generate hyper-optimized hardware kernels... You don't need a team of $500k a year CUDA engineers to build world-class infrastructure. And if an AI can autonomously master CUDA, it can master AMD's ROCm. Or custom silicon. The impenetrable software wall protecting NVIDIA's monopoly just got breached by a reinforcement learning loop. If anyone can automatically squeeze maximum performance out of any chip... Hardware becomes a commodity.

HowToPrompt__'s tweet photo. ByteDance has published a paper that should make every NVIDIA investor sweat.

They trained an AI that writes CUDA better than humans experts.

They call it CUDA Agent.

And it completely rewrites the economics of AI hardware.

They built a massive agentic reinforcement learning loop. The AI writes a kernel, compiles it, profiles the hardware, analyzes the bottlenecks, and rewrites the code until it's flawless.

It learned how to optimize memory access patterns and hardware tiling strategies that traditional compilers miss.

The results are staggering.

On the industry-standard KernelBench, CUDA Agent completely destroyed traditional compilers.

It delivered code that runs up to 3.2x faster than PyTorch's native execution.

On the hardest, most complex models, it beat the strongest proprietary models in the world—including Claude Opus 4.5 and Gemini 3 Pro, by 40%.

It didn't just match human experts. It started discovering optimizations that static compilers literally cannot see.

Here is why this is a massive threat to NVIDIA.

NVIDIA's dominance relies on the fact that CUDA is incredibly hard to master. Developers get locked in because optimizing code for other chips is too painful.

But if an AI agent can autonomously generate hyper-optimized hardware kernels...

You don't need a team of $500k a year CUDA engineers to build world-class infrastructure.

And if an AI can autonomously master CUDA, it can master AMD's ROCm. Or custom silicon.

The impenetrable software wall protecting NVIDIA's monopoly just got breached by a reinforcement learning loop.

If anyone can automatically squeeze maximum performance out of any chip...

Hardware becomes a commodity.

292

164K

ngctnnnn retweeted

elvis

@omarsar0

24 days ago

New research from Google. Just shows the impressive results you can get from custom agent harnesses. LEAP wraps a general-purpose LLM in an agentic scaffold that grounds every step in the Lean compiler and iterates against verifier feedback. The same general model solves all 12 Putnam 2025 problems and lifts Lean-IMO-Bench one-shot solve rate from under 10% to 70%, beating a specialized gold-medal system that scores 48%. Paper: https://t.co/bh4Yoi19E2 Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. New research from Google.

Just shows the impressive results you can get from custom agent harnesses.

LEAP wraps a general-purpose LLM in an agentic scaffold that grounds every step in the Lean compiler and iterates against verifier feedback.

The same general model solves all 12 Putnam 2025 problems and lifts Lean-IMO-Bench one-shot solve rate from under 10% to 70%, beating a specialized gold-medal system that scores 48%.

Paper: https://t.co/bh4Yoi19E2

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

457

426

35K

ngctnnnn retweeted

Google AI Developers

@googleaidevs

25 days ago

Building autonomous agents for scientific discovery? 🧬🤖 @GoogleDeepMind Science Skills is now available on GitHub. We've open-sourced this specialized toolkit to accelerate your agentic workflows with scientific grounding and higher token efficiency. Download now ↓ https://t.co/cwp1HOeKvo

270

89K

Who to follow

Siddharth Joshi

@sjoshi804

Leading Multimodal Data Curation at @DatologyAI | ML PhD @UCLA | Prev @MSFTResearch AI Frontiers

Klarer Name

@aracolonia

Die Freiheit ist das höchste Gut. Abstraktion ist die Betrachtung der Gemeinsamkeiten der Dinge, unter Nichtbeachtung ihrer Unterschiede. Ratio zuerst!

Fu-En (Fred) Yang

@FuEnYang1

Research Scientist @NVIDIAAI | Ph.D. @NTU_TW | Prev. Research Intern @NVIDIAAI | Unifying World, Language & Action for Generalist Robotics

ngctnnnn retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

25 days ago

Can medical AI research be automated with AI itself This new benchmark from NVIDIA and UC Santa Cruz aims to evaluate this: AutoMedBench: Towards Medical AutoResearch with Agentic AI Models "we present AutoMedBench, a workflow-aware benchmark for evaluating autonomous agents on end-to-end medical-AI research tasks" The benchmark covers 24 tasks across segmentation, question answering, report generation, etc. and across modalities like CT, X-ray, pathology, etc. The paper experiments with six frontier models (Opus 4.6, GLM-5, Gemini 3.1 Pro, GPT-5.4, MiniMax-M2.5, Qwen3.5-397B) and these models remain far from reliable medical AI researchers. While agents can often set up runnable pipelines, validation is consistently the weakest stage, and engineering failures dominate over understanding errors. Definitely curious to see how this performs with the newest generation of models/agents!

iScienceLuvr's tweet photo. Can medical AI research be automated with AI itself

This new benchmark from NVIDIA and UC Santa Cruz aims to evaluate this:

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

"we present AutoMedBench, a workflow-aware benchmark for evaluating autonomous agents on end-to-end medical-AI research tasks"

The benchmark covers 24 tasks across segmentation, question answering, report generation, etc. and across modalities like CT, X-ray, pathology, etc.

The paper experiments with six frontier models (Opus 4.6, GLM-5, Gemini 3.1 Pro, GPT-5.4, MiniMax-M2.5, Qwen3.5-397B) and these models remain far from reliable medical AI researchers. While agents can often set up runnable pipelines, validation is consistently the weakest stage, and engineering failures dominate over understanding errors.

Definitely curious to see how this performs with the newest generation of models/agents!

ngctnnnn retweeted

Peter Chen @PeterLauLukCh

about 1 month ago

Check RACO, accepted as an 𝗢𝗿𝗮𝗹 paper to #ICML2026 (𝗧𝗼𝗽 𝟬.𝟳%)✨ we propose a new conflict-averse optimization scheme for LLM multi-objective finetuning, with counterintuitive theoretical acceleration and better empirical pareto frontier. paper: https://t.co/pDvjn5fYR4

PeterLauLukCh's tweet photo. Check RACO, accepted as an 𝗢𝗿𝗮𝗹 paper to #ICML2026 (𝗧𝗼𝗽 𝟬.𝟳%)✨

we propose a new conflict-averse optimization scheme for LLM multi-objective finetuning, with counterintuitive theoretical acceleration and better empirical pareto frontier.

paper: https://t.co/pDvjn5fYR4 https://t.co/RsxmFiLz3G

122

16K

ngctnnnn retweeted

AlphaSignal AI

@AlphaSignalAI

about 1 month ago

Google just figured out why AI lies with confidence. Large language models still make confident mistakes on simple factual questions. A new paper from Google Research explains why this keeps happening. Models cannot reliably tell what they know from what they are guessing. The internal score separating right answers from wrong ones sits around 0.70 to 0.85. Forcing strict accuracy backfires. Cutting errors from 25% to 5% means staying silent on over half of correct answers. The team proposes faithful uncertainty. The model's words should match its actual internal confidence. Instead of refusing to answer, it hedges honestly. "I think" becomes a real signal, not filler. This same awareness tells agents when to reach for search tools. The paper flags open problems worth tackling: > Static training versus shifting knowledge > Alignment erasing confidence signals > Misleading calibration metrics dominating evaluation

AlphaSignalAI's tweet photo. Google just figured out why AI lies with confidence.

Large language models still make confident mistakes on simple factual questions.

A new paper from Google Research explains why this keeps happening.

Models cannot reliably tell what they know from what they are guessing.

The internal score separating right answers from wrong ones sits around 0.70 to 0.85.

Forcing strict accuracy backfires.

Cutting errors from 25% to 5% means staying silent on over half of correct answers.

The team proposes faithful uncertainty.

The model's words should match its actual internal confidence.

Instead of refusing to answer, it hedges honestly.

"I think" becomes a real signal, not filler.

This same awareness tells agents when to reach for search tools.

The paper flags open problems worth tackling:

> Static training versus shifting knowledge
> Alignment erasing confidence signals
> Misleading calibration metrics dominating evaluation

299

227

21K

ngctnnnn retweeted

AlphaSignal AI

@AlphaSignalAI

about 1 month ago

A 4B model can now anticipate scientific breakthroughs before scientists do. Researchers often build breakthroughs by combining ideas from older papers. A new paper asks whether language models can do the same thing on demand. The task is called insight anticipation. Give a model two foundational papers, and it predicts the core insight of a future paper built on them. To test this, the team built GiantsBench, an open benchmark of 17K paper tuples spanning 8 scientific fields. They then trained GIANTS-4B using reinforcement learning, rewarding it for generating insights close to real follow-up papers. The results: > 34% higher similarity score than Gemini 3 Pro > Preferred 68% of the time for citation potential > Generalizes zero-shot to physics, biology, economics Only 4B parameters, fully open-source. The model produces ideas with clearer reasoning, not just more complex ones.

AlphaSignalAI's tweet photo. A 4B model can now anticipate scientific breakthroughs before scientists do.

Researchers often build breakthroughs by combining ideas from older papers.

A new paper asks whether language models can do the same thing on demand.

The task is called insight anticipation.

Give a model two foundational papers, and it predicts the core insight of a future paper built on them.

To test this, the team built GiantsBench, an open benchmark of 17K paper tuples spanning 8 scientific fields.

They then trained GIANTS-4B using reinforcement learning, rewarding it for generating insights close to real follow-up papers.

The results:
> 34% higher similarity score than Gemini 3 Pro
> Preferred 68% of the time for citation potential
> Generalizes zero-shot to physics, biology, economics

Only 4B parameters, fully open-source.

The model produces ideas with clearer reasoning, not just more complex ones.

ngctnnnn retweeted

MONTREAL.AI

@Montreal_AI

about 1 month ago

A 0.6B model learned to manage giants. That is the idea behind TRINITY, a new ICLR 2026 paper by Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, and Yujin Tang. The paper is not asking: “How do we build one model that knows everything?” It is asking something more interesting: “How do we build a small intelligence layer that knows who should think, who should act, and who should verify?” TRINITY is a lightweight coordinator for LLMs. It does not merge weights. It does not require architectural compatibility. It does not need access to closed-model internals. It does not try to turn the coordinator into the smartest model in the room. Instead, it orchestrates a pool of strong models at test time, including closed and open models. At each turn, TRINITY chooses a model and gives it one of three roles: Thinker — plan and decompose Worker — solve and execute Verifier — critique and accept/revise That may sound simple. It is not. Too many multi-agent systems are still prompts plus hope. TRINITY learns the coordination policy. A compact ~0.6B language model produces hidden-state representations of the conversation. A tiny head then uses those representations to decide the next model-role pair. The authors optimize this coordinator with an evolutionary strategy, sep-CMA-ES, because the problem is expensive, high-dimensional, and reward-sparse. The result is not just better routing. It is learned division of labor. The paper reports that TRINITY outperforms individual models and existing coordination methods across coding, math, reasoning, and domain knowledge tasks. In its full-power setting, it reaches 86.2% on LiveCodeBench and transfers to held-out benchmarks including AIME, BigCodeBench, MT-Bench, and GPQA-D. The most important idea here is bigger than the benchmark. The future of AI may not be a single supermodel. It may be an organization of models. A small conductor. A team of specialists. A protocol for planning, execution, and verification. An intelligence layer that learns how to allocate cognition. This feels like a real shift: from bigger models to better systems from raw capability to coordinated capability from “which model is best?” to “what structure makes many models better together?” Full credit to the authors: Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang. Paper: TRINITY: An Evolved LLM Coordinator https://t.co/H7YE67U67f I’m attaching the first page because the abstract is worth reading closely. The future of AI may not be monolithic. It may be coordinated. #ArtificialIntelligence #LLM #MultiAgentSystems #MachineLearning #EvolutionaryAlgorithms

Montreal_AI's tweet photo. A 0.6B model learned to manage giants.

That is the idea behind TRINITY, a new ICLR 2026 paper by Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, and Yujin Tang.

The paper is not asking:

“How do we build one model that knows everything?”

It is asking something more interesting:

“How do we build a small intelligence layer that knows who should think, who should act, and who should verify?”

TRINITY is a lightweight coordinator for LLMs.

It does not merge weights.
It does not require architectural compatibility.
It does not need access to closed-model internals.
It does not try to turn the coordinator into the smartest model in the room.

Instead, it orchestrates a pool of strong models at test time, including closed and open models.

At each turn, TRINITY chooses a model and gives it one of three roles:

Thinker — plan and decompose
Worker — solve and execute
Verifier — critique and accept/revise

That may sound simple.

It is not.

Too many multi-agent systems are still prompts plus hope.

TRINITY learns the coordination policy.

A compact ~0.6B language model produces hidden-state representations of the conversation. A tiny head then uses those representations to decide the next model-role pair. The authors optimize this coordinator with an evolutionary strategy, sep-CMA-ES, because the problem is expensive, high-dimensional, and reward-sparse.

The result is not just better routing.

It is learned division of labor.

The paper reports that TRINITY outperforms individual models and existing coordination methods across coding, math, reasoning, and domain knowledge tasks. In its full-power setting, it reaches 86.2% on LiveCodeBench and transfers to held-out benchmarks including AIME, BigCodeBench, MT-Bench, and GPQA-D.

The most important idea here is bigger than the benchmark.

The future of AI may not be a single supermodel.

It may be an organization of models.

A small conductor.
A team of specialists.
A protocol for planning, execution, and verification.
An intelligence layer that learns how to allocate cognition.

This feels like a real shift:

from bigger models
to better systems

from raw capability
to coordinated capability

from “which model is best?”
to “what structure makes many models better together?”

Full credit to the authors:
Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang.

Paper: TRINITY: An Evolved LLM Coordinator
https://t.co/H7YE67U67f

I’m attaching the first page because the abstract is worth reading closely.

The future of AI may not be monolithic.

It may be coordinated.

#ArtificialIntelligence #LLM #MultiAgentSystems #MachineLearning #EvolutionaryAlgorithms

266

305

13K

ngctnnnn retweeted

AlphaSignal AI

@AlphaSignalAI

about 1 month ago

First survey covering all 4 phases of AI in academic research. 5 Core Principles: > Structured tasks work. Judgment doesn't. > Generation outpaces verification. > AI assists humans, doesn't replace them. > Explore. Execute. Verify. > Disclosure beats detection.

AlphaSignalAI's tweet photo. First survey covering all 4 phases of AI in academic research.

5 Core Principles:

> Structured tasks work. Judgment doesn't.
> Generation outpaces verification.
> AI assists humans, doesn't replace them.
> Explore. Execute. Verify.
> Disclosure beats detection. https://t.co/Zz8iUyyAVG

ngctnnnn retweeted

DAIR.AI

@dair_ai

about 1 month ago

NEW paper worth reading. A full agentic workflow can be distilled into model weights and run at roughly 100x lower inference cost while preserving near-frontier task quality. The workflow includes multi-step LLM calls, tool invocations, intermediate scratchpads, and decision structure. Instead of expressing all of that at runtime through a framework, the paper amortizes the behavior into a compiled model through targeted distillation. This is the strongest economic argument for agent compilation so far. Runtime loops are flexible, but expensive. Compiled workflows trade some flexibility for a massive inference-cost reduction. Paper: https://t.co/4k4urYOAeQ Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

dair_ai's tweet photo. NEW paper worth reading.

A full agentic workflow can be distilled into model weights and run at roughly 100x lower inference cost while preserving near-frontier task quality.

The workflow includes multi-step LLM calls, tool invocations, intermediate scratchpads, and decision structure.

Instead of expressing all of that at runtime through a framework, the paper amortizes the behavior into a compiled model through targeted distillation.

This is the strongest economic argument for agent compilation so far. Runtime loops are flexible, but expensive. Compiled workflows trade some flexibility for a massive inference-cost reduction.

Paper: https://t.co/4k4urYOAeQ

Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

289

284

19K

ngctnnnn retweeted

Huaxiu Yao

@HuaxiuYaoML

about 1 month ago

Every memory system for LLM agents evolves what it stores. None evolves how it retrieves. 🧬 EvolveMem is out, now shipping inside the SimpleMem v0.3.0 update. Powered by AutoResearch: the system researches its own retrieval, treating the full retrieval config as a structured action space and running a closed loop: evaluate ➜ diagnose ➜ propose ➜ validate ➜ repeat. 🔬 From a minimal baseline, 7 autonomous rounds produce a retrieval policy that beats the strongest published baseline by +25.7% on LoCoMo and +18.9% on MemBench. 🧬 It discovers entirely new retrieval dimensions not present in the original design, all integrated into the unified SimpleMem package. 📄 Paper: https://t.co/BWCXebWhG1 💻 Code: https://t.co/hhdgvVjblP Led by @itsJiaqiLiu, @XinyeYee with contributions from @richardxp888, @ZhengBerkeley, @cihangxie

HuaxiuYaoML's tweet photo. Every memory system for LLM agents evolves what it stores. None evolves how it retrieves.

🧬 EvolveMem is out, now shipping inside the SimpleMem v0.3.0 update. Powered by AutoResearch: the system researches its own retrieval, treating the full retrieval config as a structured action space and running a closed loop: evaluate ➜ diagnose ➜ propose ➜ validate ➜ repeat.

🔬 From a minimal baseline, 7 autonomous rounds produce a retrieval policy that beats the strongest published baseline by +25.7% on LoCoMo and +18.9% on MemBench.

🧬 It discovers entirely new retrieval dimensions not present in the original design, all integrated into the unified SimpleMem package.

📄 Paper: https://t.co/BWCXebWhG1
💻 Code: https://t.co/hhdgvVjblP

Led by @itsJiaqiLiu, @XinyeYee with contributions from @richardxp888, @ZhengBerkeley, @cihangxie

423

375

29K

ngctnnnn retweeted

Benjamin Chang @benjamin0chang

about 1 month ago

My first PhD paper is out now in @Nature! Very grateful to have worked with the FutureHouse team on this, and a big shoutout to my co-first author @agreeb66 😀

benjamin0chang's tweet photo. My first PhD paper is out now in @Nature! Very grateful to have worked with the FutureHouse team on this, and a big shoutout to my co-first author @agreeb66 😀 https://t.co/3OPVQb2so4

129

514

96K

ngctnnnn retweeted

Google Research

@GoogleResearch

about 1 month ago

Our latest research on Co-Scientist is out today in Nature! Built with Gemini, this multi-agent system powers the new Hypothesis Generation tool within Gemini for Science, helping researchers navigate the rigorous cycle of ideation, critique, and refinement. Read more from @ymatias and explore the full announcement: https://t.co/aFqSBETpsh"

207

16K

ngctnnnn retweeted

Rohan Paul

@rohanpaul_ai

about 1 month ago

New Google paper: A forecast needs context, not just history. Some patterns are caused by events, not time. Nexus reframes forecasting as a reasoning problem, where events and numbers have to explain each other. Nexus argues that forecasting improves when models read the world around the numbers, not just the numbers themselves. In the Zillow tests, one Claude-based version cut average MAPE by 86.6% versus direct chain-of-thought prompting. That matters because most time series models are fluent in pattern, but mute about cause. A housing inventory curve can reflect seasonality, mortgage pressure, migration, layoffs, and local supply, while a stock price can be bent by earnings, regulation, hype, and fear. Nexus separates those jobs instead of asking one prompt to do everything. One agent turns messy historical text into a clean event timeline, one reads the broad regime, another tracks local shocks, and a synthesizer reconciles them with calibration from past errors. The interesting result is not merely that context helps, but that structure helps the language model use context without losing the time series. The evidence is still narrow: Zillow counts, seven equities, post-cutoff data, and single-run evaluations, so this is not a universal law of forecasting. But the direction is clear: future forecasters will not only extrapolate curves; they will argue about what made the curve move. ---- Paper Link – arxiv. org/abs/2605.14389 Paper Title: "Nexus : An Agentic Framework for Time Series Forecasting"

rohanpaul_ai's tweet photo. New Google paper: A forecast needs context, not just history.

Some patterns are caused by events, not time. Nexus reframes forecasting as a reasoning problem, where events and numbers have to explain each other.

Nexus argues that forecasting improves when models read the world around the numbers, not just the numbers themselves.

In the Zillow tests, one Claude-based version cut average MAPE by 86.6% versus direct chain-of-thought prompting.

That matters because most time series models are fluent in pattern, but mute about cause.

A housing inventory curve can reflect seasonality, mortgage pressure, migration, layoffs, and local supply, while a stock price can be bent by earnings, regulation, hype, and fear.

Nexus separates those jobs instead of asking one prompt to do everything.

One agent turns messy historical text into a clean event timeline, one reads the broad regime, another tracks local shocks, and a synthesizer reconciles them with calibration from past errors.

The interesting result is not merely that context helps, but that structure helps the language model use context without losing the time series.

The evidence is still narrow: Zillow counts, seven equities, post-cutoff data, and single-run evaluations, so this is not a universal law of forecasting.

But the direction is clear: future forecasters will not only extrapolate curves; they will argue about what made the curve move.

----

Paper Link – arxiv. org/abs/2605.14389

Paper Title: "Nexus : An Agentic Framework for Time Series Forecasting"

485

393

62K

ngctnnnn retweeted

Poonam Soni

@CodeByPoonam

about 1 month ago

Oxford, Stanford, and Anthropic just discovered that the smarter an AI model gets at reasoning, the easier it is to jailbreak. The same feature labs are selling as "safer" is the one breaking the safety guardrails. The attack is called Chain-of-Thought Hijacking. You wrap a harmful request inside a long, harmless puzzle. Sudoku grids. Logic puzzles. Math problems. Then you add the actual dangerous question at the very end. The model gets so absorbed in solving the puzzle that the refusal mechanism never activates. The success rates are not borderline. They are catastrophic. 99% on Gemini 2.5 Pro. 100% on Grok 3 mini. 94% on GPT o4 mini. 94% on Claude 4 Sonnet. Every frontier reasoning model on the market. Every major lab. One trick. The researchers showed the attack scales with reasoning length. Minimal reasoning: 27% success. Natural reasoning: 51%. Extended reasoning: 80%+. The smarter you make the model think, the more reliably it breaks. Every lab has spent the last 18 months telling the world that "more thinking" makes AI safer. The Oxford paper just proved the opposite is true on every major model they tested.

CodeByPoonam's tweet photo. Oxford, Stanford, and Anthropic just discovered that the smarter an AI model gets at reasoning, the easier it is to jailbreak.

The same feature labs are selling as "safer" is the one breaking the safety guardrails.

The attack is called Chain-of-Thought Hijacking. You wrap a harmful request inside a long, harmless puzzle. Sudoku grids. Logic puzzles. Math problems. Then you add the actual dangerous question at the very end.

The model gets so absorbed in solving the puzzle that the refusal mechanism never activates.

The success rates are not borderline. They are catastrophic.

99% on Gemini 2.5 Pro.
100% on Grok 3 mini.
94% on GPT o4 mini.
94% on Claude 4 Sonnet.

Every frontier reasoning model on the market. Every major lab. One trick.

The researchers showed the attack scales with reasoning length. Minimal reasoning: 27% success. Natural reasoning: 51%. Extended reasoning: 80%+.

The smarter you make the model think, the more reliably it breaks.

Every lab has spent the last 18 months telling the world that "more thinking" makes AI safer. The Oxford paper just proved the opposite is true on every major model they tested.

ngctnnnn retweeted

Guillermo Casaus

@_guillecasaus

about 1 month ago

🚨 Google acaba de liberar sus skills oficiales para agentes de IA. Ha publicado 13 skills compatibles con Claude Code, Cursor, Copilot y otros agentes. Permiten que los agentes puedan ejecutar tareas avanzadas y automatizar flujos de trabajo complejos. Es gratis y open-source 👇

_guillecasaus's tweet photo. 🚨 Google acaba de liberar sus skills oficiales para agentes de IA.

Ha publicado 13 skills compatibles con Claude Code, Cursor, Copilot y otros agentes.

Permiten que los agentes puedan ejecutar tareas avanzadas y automatizar flujos de trabajo complejos.

Es gratis y open-source 👇

400

363K

ngctnnnn retweeted

DAIR.AI

@dair_ai

about 1 month ago

// Harnessing Agentic Evolution // Pay attention to this one if you run iterative agentic search loops. (bookmark it) AEvo splits the self-improvement loop into two jobs: > One proposes the next candidate. > The other watches what worked, what failed, and edits the procedure that proposes future candidates. Past runs (candidates, feedback, traces, failures) become memory the meta-agent reads from. Achieves 26% relative gain over the strongest evolution baseline on agentic and reasoning benchmarks. SOTA on three open-ended optimization tasks under the same iteration budget. If you are accumulating agentic search logs you never use, this is how to feed them back into the search procedure itself. Paper: https://t.co/eWFO4rI4iA Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

dair_ai's tweet photo. // Harnessing Agentic Evolution //

Pay attention to this one if you run iterative agentic search loops.

(bookmark it)

AEvo splits the self-improvement loop into two jobs:

> One proposes the next candidate.

> The other watches what worked, what failed, and edits the procedure that proposes future candidates.

Past runs (candidates, feedback, traces, failures) become memory the meta-agent reads from.

Achieves 26% relative gain over the strongest evolution baseline on agentic and reasoning benchmarks. SOTA on three open-ended optimization tasks under the same iteration budget.

If you are accumulating agentic search logs you never use, this is how to feed them back into the search procedure itself.

Paper: https://t.co/eWFO4rI4iA

Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

319

305

16K

ngctnnnn retweeted

Anatoli Kopadze

@AnatoliKopadze

about 2 months ago

Godfather of AI: "If you sleep well tonight, you may not have understood this lecture." This 47-minute lecture is the best thing I saw about AI in the last few months. It will definitely help you understand how it actually works and where it's going. Geoffrey Hinton built the neural networks behind every AI alive, then quit Google to warn the world about it. The part nobody wanted to hear: > AI is already developing abilities its creators didn't intend > in most cognitive tasks it's already ahead of us > the question is no longer if it surpasses us but when > the only decision left is which side of that line you're on Right now the average person opens Claude, types something, gets an answer, closes the tab. They think they're using AI. they're using maybe 10% of it. I went through his entire lecture, built a practical system from what he was describing. 18 steps to actually use Claude the right way, with copy-paste prompts that work today. Full guide in the post below.

173

25K

ngctnnnn retweeted

Kangwook Lee

@Kangwook_Lee

about 2 months ago

https://t.co/e23VrznOdV

473

950

84K

Tan Pham

@ngctnnnn

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users