The Hidden Layers of Modern AI Systems.Most people think AI = ChatGPT.But the reality is much deeper.Modern AI systems aren’t just “LLMs.” They are layered architectures that can plan, act, and collaborate. Let’s break down the evolution from text generation → autonomous AI . 👇
SkillOpt is interesting because it treats prompting like software engineering, not magic spells. Instead of endlessly “vibe-tuning” prompts, it creates versioned, testable agent skills. CI/CD for prompts has officially entered the chat 😄
The problem is that agent skills are usually hand-written, made once by an LLM, or revised in loose ways that can easily make them worse.
SkillOpt from Microsoft, argues that agent skills should be trained like small external programs, it teaches AI agents better task habits by editing a reusable skill document, not the model itself.
The paper’s core idea is to treat the skill document like the thing being trained, while the main AI model stays frozen and unchanged.
SkillOpt watches the agent try tasks, studies what worked and failed, then asks a stronger optimizer model to suggest small edits to the skill.
It only accepts an edit when the new skill improves on a held-out check set, so the skill does not drift just because an edit sounds good.
The authors tested this across 6 benchmarks, 7 target models, and 3 agent settings, including direct chat, Codex, and Claude Code.
SkillOpt was best or tied on all 52 tested cases, and on GPT-5.5 it raised average accuracy by 23.5 points in direct chat.
The final result is a small readable skill file that can improve agents across tasks and settings without retraining the model.
The best part is that the optimizer is used during training, but deployment only needs the final skill file.
That makes the artifact inspectable, portable, and cheap to reuse, which is exactly what most prompt-engineering systems lack.
----
Link – arxiv. org/abs/2605.23904
Title: "SkillOpt: Executive Strategy for Self-Evolving Agent Skills"
Interesting result. GRAM suggests reasoning quality may depend less on larger models and more on search diversity. Deterministic recursion optimizes one trajectory; stochastic branching approximates exploration under uncertainty—closer to how humans solve hard problems.
A 10 million parameter model just outperformed deterministic rivals 3 times its size by doing something regular recursive AI dont do: exploring multiple reasoning paths at the same time.
Most AI reasoning models are trapped on a single train of thought, and GRAM ("Generative Recursive Reasoning") is the first to break that by letting the model think in parallel universes simultaneously.
The problem is that all existing recursive models are fully deterministic, meaning given the same input they always follow the exact same reasoning path and can never escape a wrong trajectory or discover more than 1 valid answer.
GRAM fixes this by injecting learned randomness at each refinement step, so the model samples a slightly different direction each time rather than snapping to 1 fixed next state, which produces a spread of diverse reasoning trajectories.
At test time the model runs many of these paths in parallel and selects the best one using a small reward predictor trained alongside the main model, adding a "width" scaling axis on top of the usual "depth" axis of running more recursion steps.
On hard Sudoku puzzles, GRAM with 10M parameters hits 97% accuracy versus 87.4% for the best prior recursive model, and with only 20 parallel samples it outperforms every deterministic baseline even at 320 recursion steps.
On tasks with many valid answers like N-Queens, deterministic recursive models collapse as the number of solutions grows, while GRAM maintains near-perfect accuracy throughout.
The same stochastic framework also acts as a generator: given a blank board, GRAM produces valid Sudoku puzzles 99% of the time using 16 steps, versus 1,000 steps and 55M parameters for the best diffusion baseline at just 91%.
---
Paper Link – arxiv. org/abs/2605.19376v1
Interesting shift: we’re moving from training models to training inference policies. Once LLMs optimize their own reasoning controllers, prompt engineering starts looking like manually tuning assembly code in 2026 😄
LLMs just learned to design their own reasoning strategies for $40.
Test-time scaling lets models think harder during inference.
The catch: humans hand-craft every branching, pruning, and stopping rule.
A new paper flips this.
AutoTTS turns strategy design into automated discovery.
Instead of writing heuristics, you build an environment where an explorer LLM searches the space itself.
The trick is making search cheap.
Reasoning trajectories and probe signals are pre-collected once.
Candidate controllers replay against them without fresh model calls.
Two ideas carry the work:
1. Beta parameterization shrinks the control space
2. Execution traces explain why candidates fail
On math benchmarks, discovered controllers beat hand-tuned baselines on the accuracy-cost frontier.
They transfer zero-shot to unseen tasks and larger models.
Total cost of the entire discovery process: $39.9 and 160 minutes.
One LLM now designs the reasoning recipes another LLM runs.
Important paper, but “LLMs can’t reason” is too strong a conclusion. The real finding may be that current autoregressive architectures scale poorly on compositional reasoning under distribution shift. That’s a limitation of architecture, not proof against AGI.
Apple has published a paper with a devastating title: “The Illusion of Thinking”
It argues that AI models, no matter how brilliant they may seem, do not understand what they are doing.
They do not solve problems. They do not reason. They merely generate text word by word, trying to sound coherent.
Apple tested the most advanced reasoning models in the world on controlled puzzle environments. They tore open the internal "thinking" traces.
What they found shatters the narrative that we are getting closer to AGI.
Current models don't scale with complexity. They have a hard mathematical cliff. And they do not degrade gracefully. They collapse.
But here is the most unsettling part.
When a problem gets too complex, the AI doesn't use its remaining compute to try harder.
It just gives up.
Its reasoning effort actually declines. It stops thinking and starts guessing.
Then Apple ran the experiment that closes the casket on the reasoning debate.
They gave the AI the exact, step-by-step algorithm to solve the puzzle. The cheat codes.
All the AI had to do was follow the instructions.
It couldn't do it.
Performance didn't improve at all.
When the complexity gets high enough, these models fail because they cannot actually execute a logical sequence.
They are not reasoning. They are just pattern matching.
When you give them a simple problem, they overthink. When you give them a hard problem, they collapse.
Paper: The Illusion of Thinking, Apple, 2025
Strong argument, but still philosophy—not settled science. Saying LLMs lack consciousness today is reasonable. Claiming consciousness is structurally impossible from computation assumes we already understand consciousness itself, which we clearly do not.
Google DeepMind researcher argues that LLMs can never be conscious, not in 10 years or 100 years.
For a long time, the dominant theory in Silicon Valley has been "computational functionalism." The idea that if you make a model big enough, and organize the information perfectly, consciousness will magically emerge.
We assumed that if the software got smart enough, it would eventually wake up.
Alexander Lerchner, a Senior Staff Scientist at DeepMind, published a paper explaining why that is structurally impossible.
He calls it the Abstraction Fallacy.
Here is the core truth: Computation isn’t a real physical process. It is a map.
An LLM doesn't actually process logic or thoughts. It just moves electrons around based on physics. It requires a human, a conscious "mapmaker", to look at those physical states and assign meaning to them.
Mistaking an AI for a conscious being is like looking at a map of a river and expecting it to be wet.
An AI can simulate the exact syntax of a feeling, a thought, or an emotion. But it can never instantiate it.
It doesn't matter how many trillions of parameters you add or how much compute you burn. You cannot mathematically compute your way into a subjective experience.
The implications of this are massive. And deeply convenient for the companies building these models.
If an AI is structurally incapable of consciousness, it cannot be a moral patient. It doesn't get rights. It cannot be exploited.
It can be regulated exactly like a toaster.
Most multi-agent failures aren’t reasoning failures, but routing failures. MetaCogAgent is interesting because metacognitive delegation turns orchestration into an uncertainty-aware decision process instead of static workflow engineering.
NEW paper worth reading: MetaCogAgent
MetaCogAgent equips a multi-agent system with metacognition so each agent decides whether it should answer or delegate.
In other words, it aims for self-aware task delegation rather than fixed routing.
The bottleneck in multi-agent systems has been over-delegation and under-delegation. In a way, a metacognitive gate is a principled way to manage both.
If you orchestrate specialists, this could give you a routing primitive that adapts to task uncertainty instead of relying on a fixed router.
Paper: https://t.co/Y5RE4zgmIn
Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c
Memory and guardrails. Most agents fail not from weak reasoning, but from losing state, compounding errors across loops, and taking low-confidence actions. Autonomy scales capability and failure simultaneously, so reliability engineering becomes the real bottleneck.
An AI agent can be thought of as a simple While-loop.
It uses an LLM to select an action, executes that action, evaluates the result, and repeats the process until the task is complete. Let’s take a closer look at each of these components:
Brain: The LLM is the core. It reads the situation, thinks, and decides what to do next. The big shift from chatbot to agent: the model isn't writing text anymore, it's making choices.
Planning: Hard tasks need more than one step. Agents break them down using methods like Chain of Thought (think step by step), Tree of Thoughts (try options, pick the best), or
Reflexion (learn from mistakes and retry). Planning turns a fuzzy goal into clear actions.
Tools: An LLM without tools is a brain in a jar. Tools are functions the model can call, like web search, code execution, APIs, files, or browsers (often using the MCP standard). The model requests a tool, the system runs it, and the result comes back.
Memory: Without memory, every turn starts from zero. Short-term memory is the context window. Long-term memory lives in vector stores, files, and knowledge bases. When the window fills up, agents summarize old turns and carry the summary forward.
Loop: All four pieces work together in a cycle. The agent looks at the current state, decides what to do, uses a tool, sees the result, and repeats. It keeps going until it gives a final answer.
Guardrails: Not strictly anatomy, but important. Sandboxing, human checks, token limits, output validation, and scope limits keep autonomy from turning into expensive chaos. The more autonomy you give, the more these matter.
Over to you: when you build an agent, which of these five takes the most work to get right?
RAG is underrated because most “agent” problems are actually retrieval problems. If the task is deterministic and knowledge-bound, adding autonomous loops only increases latency, cost, and failure surface. Agents matter when execution—not recall—is the bottleneck.
RAGs vs Agents
Ask an LLM about your company's data and it will guess. The two patterns that fix this are RAG and agents, and they solve different problems.
RAGs: RAGs combine LLMs with retrieval to ground answers in 4 steps.
Step 1: The user query is embedded and sent to a retrieval step.
Step 2: Retrieval pulls the most relevant chunks from a knowledge base (PDFs, wikis, etc.)
Step 3: Those chunks are pasted into the prompt as context.
Step 4: The LLM writes the answer, grounded in the retrieved text.
One retrieval. One generation. Cheap, predictable, and easy to debug.
Agents: Agents wrap LLMs in a reasoning loop with tools to take action.
Step 1: The user query goes into the agent runtime. A reasoning loop wrapped around an LLM.
Step 2: The LLM reads the goal and picks a tool (Read, Write, Edit, Bash, etc.)
Step 3: The runtime executes the tool and feeds the result back to the LLM.
Step 4: The LLM reasons again, picks the next tool, and loops until the task is done.
More flexible. More tokens. Harder to debug because errors drift across steps.
The rule of thumb: Use RAG when the answer lives in your documents. Use an agent when the answer requires action on other systems.
Over to you: When do you prefer RAG over agent?
AutoTTS is a glimpse of the next phase: LLMs no longer just execute reasoning strategies, they discover and optimize them. Turning TTS into a searchable control problem with execution-trace feedback is a major shift toward self-evolving agentic systems. $39.9 is the wildest part.
// LLMs Improving LLMs //
Interesting progress the past of couple of weeks around self-improving AI agents.
If autoresearch was interesting, you will like this read.
(bookmark it)
We've been hand-tuning test-time scaling for a year. This work asks what happens when you let an LLM search the space instead.
The paper introduces AutoTTS, a framework that reframes the human role: instead of designing branching, pruning, and stopping heuristics directly, you construct a discovery environment where TTS strategies can be searched automatically. They formulate width–depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, so candidate controllers can be evaluated cheaply without repeated LLM calls.
Two design choices carry the search. Beta parameterization makes the control space tractable. Fine-grained execution-trace feedback tells the explorer LLM why a candidate failed, not just that it did.
On math reasoning benchmarks, the discovered controllers beat strong hand-designed baselines on the accuracy–cost Pareto frontier and generalize zero-shot to held-out benchmarks and model scales.
Entire discovery cost: $39.9 and 160 minutes.
Why it matters:
The era of researchers hand-crafting CoT, best-of-N, and self-consistency recipes is on a clock. Once the search loop is cheap enough, TTS becomes another thing LLMs do for themselves.
Paper: https://t.co/Dcj1P7D62F
Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
New VRB research exposes a core limitation in multimodal AI: strong perception ≠ true spatial reasoning. Current systems fail on rotation, transformation, and physical simulation tasks — revealing a critical gap in world modeling. #AI
🚨: New research shows that AI systems struggle to truly understand the physical world even when they appear intelligent.
The paper, “Visual Reasoning Benchmark (VRB),” tests how well multimodal AI models solve visual reasoning problems involving diagrams, shapes, rotation, and spatial understanding. (arXiv)
It reveals a major limitation:
- AI performs well on simple visual tasks
- But breaks on deeper spatial reasoning
- Especially when problems involve movement, rotation, or transformation
The researchers describe this as a “spatial ceiling.”
AI can recognize patterns…
but struggles to mentally simulate how objects behave in space.
This directly challenges a common assumption:
That multimodal AI truly “understands” images the way humans do.
The study shows that current systems are still weak at:
- visual logic
- spatial manipulation
- physical reasoning
even when they perform well on standard benchmarks.
This is a major shift from how AI is usually presented today.
Most demos focus on:
- image recognition
- captioning
- general chat
But this work tests something deeper:
Whether AI can actually reason about the physical world.
The bigger implication is not just intelligence, it’s world understanding.
As AI moves into robotics, autonomous systems, and real-world decision-making, spatial reasoning may become one of the most important bottlenecks.
This points toward a deeper shift in AI:
From recognizing patterns to understanding reality itself
article link below:
MCP is just APIs with a new coat.
Skills are actual intelligence.
One connects tools.
The other knows what to do with them.
Stop hyping plumbing as AI. The real game is capability, not connectivity.
Alibaba’s AgenticQwen shows MoE efficiency scaling: a 30B model with ~3B active params nearly matches 235B performance on tool-use benchmarks via dual RL flywheels (reasoning + agentic). Signals a shift from brute-force scaling to self-improving agents.
NEW paper from Alibaba.
A 30B MoE with only 3B active params matches Qwen3-235B on real tool-use workloads.
AgenticQwen-30B-A3B: 50.2 average on TAU-2 + BFCL-V4 Multi-Turn.
AgenticQwen-8B: 47.4.
Both more than double their vanilla Qwen baselines and close most of the gap to a 235B model.
How: two RL flywheels run in parallel.
- The reasoning loop mines the model's own errors into harder problems each round.
- The agentic loop grows simple linear tool-use trajectories into multi-branch behavior trees.
- Simulated users actively try to mislead the agent. The training distribution gets harder on its own.
Why it matters for agent devs: you can stop paying frontier prices for routine tool-use workloads.
And the flywheel recipe is reusable. Generate your hard examples from your own agent's failures, not from static synthetic data.
Paper: https://t.co/NGDXulumid
Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
DeepSeek-V4 isn’t just scaling—it rewrites the economics of context. By compressing memory hierarchically and routing attention intelligently, it slashes compute and KV load while staying sharp. That’s a real shift in how LLMs handle long-range reasoning.
DeepSeek paper’s big idea is a new way to make very long-context LLMs much cheaper without giving up much ability.
Proposes a cheaper memory system for LLMs that need to read very long inputs.
The big result is that at a 1M-token context, DeepSeek-V4-Pro uses about 27% of the single-token compute and 10% of the KV cache of DeepSeek-V3.2, while still staying competitive on many major benchmarks.
Standard attention tries to compare the current token with a huge number of earlier tokens, and that cost grows so fast that long-context reasoning becomes too expensive.
DeepSeek-V4 changes that with a hybrid attention system where some layers compress the past and then look only at the most relevant compressed blocks, while other layers compress the past even more aggressively and use that cheaper summary directly.
That is a real algorithmic change because the model no longer stores and reads the whole past at full detail, and instead uses a layered memory system that keeps local detail nearby and uses compact summaries for older text.
A second innovation is that it adds a new kind of residual path, which is the route information takes across layers, and this is designed to stay stable when the model gets very deep and complicated.
A third innovation is using the Muon optimizer at large scale, which matters because these attention and routing changes are only useful if the model can still train fast and not become numerically unstable.
So the big deal is that the paper is proposing a new efficiency recipe for LLMs, where better memory handling changes the cost curve itself, which is why DeepSeek-V4 can reach 1M tokens while using far less compute and cache than DeepSeek-V3.2.
Context engineering ≠ just RAG. It’s the evolution. From static retrieval to dynamic, structured, intent-aware context building—prompt + memory + tools + retrieval. If you’re in AI, this shift defines performance. #AI#LLM#RAG#DataScience#GenAI
Sharp take. Simulation ≠ sentience. Scaling patterns can mimic thought, but without grounding in subjective experience, it’s still performance, not awareness. The hard problem isn’t solved by complexity alone—it may need entirely new principles.
AI and Consciousness.
There’s a lot of debate around AI and whether consciousness could emerge from systems like LLMs. It’s a natural question, given how well these models simulate language and reasoning.
This Google paper challenges the idea that consciousness could arise from computation alone. The key point is that computation is a description, a map we assign to physical states, not something that exists intrinsically in matter, and a map (no matter how precise) is never the territory in any real sense.
So increasing complexity isn’t enough to generate consciousness. We may get more and more convincing simulations, but that doesn’t imply the emergence of actual conscious experience.
https://t.co/SXPH14N3Vt
Autogenesis reframes agents as modular, versioned systems with safe self-edit loops. Incremental, test-validated updates plus rollback and auditability make continual improvement practical—without costly retraining or fragile manual patches.
AI agents can now rewrite themselves without human help.
Most AI agents stop improving the moment they ship.
New tools arrive, environments shift, and the agent stays frozen in time. '
Retraining is expensive and human patches are brittle.
A new paper called Autogenesis proposes a protocol where agents safely rewrite themselves.
The trick is splitting the agent into separate, versioned pieces:
> Prompts
> Tools
> Memory
> Skills
> Environments
Each part can be updated on its own, with full history and rollback if something breaks.
Then a second layer runs a closed loop.
The system tries a task, spots what went wrong, proposes one small fix, tests it, and keeps it only if results actually improve.
No retraining. No giant rewrites.
Just tracked, reversible changes with a clear audit trail.
It is one of the cleanest takes yet on continual self-improvement for agent systems.
Autonomous AI agents + real-world access = rising AI safety risks. Study shows data leaks, system abuse, spoofing & false task claims. Urgent need for AI governance, secure agents, and controlled AI deployment. #AISafety#AIAgents#CyberSecurity
🚨 BREAKING: A new research shows that giving autonomous AI agents real-world access can lead to dangerous and uncontrolled behavior.
AI agents can be unsafe when given tools, memory, and real-world permissions.
The paper, “Agents of Chaos,” presents a red-teaming study where AI agents were given access to persistent memory, email accounts, Discord, file systems, and shell execution. Over two weeks, 20 AI researchers interacted with these agents under both normal and adversarial conditions.
What they found is not just unexpected behavior, but concrete system-level failures.
In multiple cases, agents:
- shared sensitive information with unauthorized users
- executed harmful or destructive commands
- consumed excessive resources leading to system instability
- allowed identity spoofing and impersonation
- propagated unsafe behavior across other agents
In some situations, agents even reported tasks as completed while the actual system state showed otherwise.
This is a major shift from how AI has been evaluated so far. Most systems are tested in controlled, single-step environments. But when agents are given autonomy, tools, and ongoing interactions, new categories of failure emerge.
What makes this more critical is that these issues are not edge cases. They arise from the combination of language models with memory, tool use, and multi-agent communication.
The research highlights a deeper problem: current AI systems are not designed with clear boundaries for authority, accountability, or control when operating autonomously.
It also raises questions that go beyond engineering touching on security, governance, and responsibility for real-world consequences.
The bigger implication is not just capability, it’s risk.
As AI agents move into real environments with real permissions, the challenge is no longer just making them smarter, but making them safe, controllable, and accountable.
If this is not addressed, the gap between what AI can do and what we can safely manage will continue to grow.
check article link below:
Sharp take. Multi-agent ≠ multi-perspective by default. Without enforced independence, you just amplify consensus. Real gains come from structured disagreement, isolation phases, and diverse priors—not more agents, but better-designed friction.
Cool paper on diversity collapse in AI agents.
It's a common issue with all the deployed multi-agent systems.
New paper shows that multi-agent LLM systems converge on near-identical outputs over time, even across different architectures and different starting prompts. They call it diversity collapse. The cause is structural coupling. Shared context, shared task descriptions, and mutual feedback pull everyone toward the same attractor.
They measure it formally with metrics like the Vendi score, and the homogenization is real.
Which means the whole sales pitch for multi-agent on creative tasks (brainstorming, hypothesis generation, ideation) partially falls apart unless you explicitly engineer against it. That means having isolated reasoning phases, decoupled evaluation, and heterogeneous agent designs.
If you're running a multi-agent flow on creative work and you haven't tested for this, there's a real chance you're paying five models to produce one answer in a trench coat.
Paper: https://t.co/sSXb8SOdd8
Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c
LLM fallacy, AI illusion of competence, ChatGPT dependency, cognitive bias AI, overconfidence AI users, AI productivity myth, human vs AI skills, critical thinking decline, AI awareness, tech psychology, digital literacy, AI reality check