RLVR has become the recipe for agentic post-training. But for Computer-Use Agents, the bottleneck is not the algorithm, it is the data. 🐌
🚀 We introduce CUA-Gym: a scalable, lightweight synthesis engine that turns arbitrary task queries into verifiable RLVR data for computer-use agents. The largest open CUA RLVR dataset to date:
🎯 32,122 verifiable RLVR tasks with programmatic setup scripts + rewards
🌐 110 environments: 16 desktop apps + 94 synthesized mock web apps
🏆 Qwen3.5-based CUA models trained with GSPO reach 72.6% on OSWorld-Verified and 56.6% on WebArena
📄 Paper: https://t.co/cdvHJPzgb1
🏠 Homepage: https://t.co/kvhaOQxNx7
🤗 Dataset: https://t.co/w5vOIRdchR
💻 Codebase: https://t.co/CcRlNTlS1c
🧩 Environments: https://t.co/fNZ6YAI8LD
🧵[1/6]
Your agent shouldn’t just chat about work. It should use the apps where work happens.
We provide CLI Apps in nanobot via CLI-Anything. Install app adapters from Settings, mention them in chat, and let your agent use them safely.
Available today as a source preview, and coming in the next release.
This feels like a big step toward personal agents that actually do work.
remind me if anyone still cares about how multi-turn convs should be masked 🤔
last systematic review seems to be Instruction Tuning With Loss Over Instructions [Neurips'24]
Digital agent learning needs massive rollouts. But digital agent rollouts are painfully slow due to heavy environments. 🐌
🚀 We introduce NanoRollout, a lightweight open infra (900 lines core code) for digital agent rollout at scale, validated with three workloads:
🏋️ Large batchsize (4K) SWE Agent RL -> surpasses DeepSWE-32B
🧪 250k+ distilled coding trajectories -> SOTA ≤32B open coding agent
⚡ Fast evaluation on coding/cua/unified agent -> finish
Check our Blog: https://t.co/IBNqqbLqra
Codex grew programmatic policies with no neural nets: max score on Breakout, and SOTA-level scores on MuJoCo.
Maybe heuristics were not too weak. Maybe they were just too expensive to maintain. Maybe it's the next paradigm.
https://t.co/1ZaIneleuW
Code-as-a-service is eating software from the bottom up.
Anything that exists mainly to solve a narrow, repetitive workflow is vulnerable.
Disk cleanup. File triage. Log analysis. Batch transforms. Data cleanup. Internal glue tools.
I’m already doing these with Codex / Claude Code instead of dedicated apps. 😵
True or false: a lot of software isn’t a product moat, it’s just a temporary wrapper around a workflow that models can now execute directly
CLI-Anything × CLI-Hub v0.3.0 is out! A fun release for all of us.
With general-purpose agents plus CLI-Hub × CLI-Anything, we made one general agent handle complicated tasks that usually live in very different toolchains:
- a FreeCAD Curiosity-style rover
- a Blender orbital relay drone scene with motion
- a real 2026 game played through generated CLIs (shout out to @t1anyufan !!)
v0.3.0 brings together several pieces that made these demos easier to build, inspect, and share:
- meta preview bundles and trajectories
- updated skills and docs for agent usage
- more real-world complex software and services converted into agent-native forms
The common thread here is reachability. Once software becomes reachable to an agent, the agent can inspect it, control it, recover from mistakes, and gradually turn intent into artifacts.
CLI-Hub now includes 66 software harnesses across 30 categories, with more community contributions coming in. One fun signal from command usage: roughly 20% comes from humans, and 80% comes from agents.
A few takeaways from this round:
1. The CLI is only the transport layer.
Harness engineering is the elephant in the room: the hard and valuable work is turning existing software and services into agent-native surfaces that can be inspected, controlled, previewed, and recovered from.
2. Better harnesses improve agents without retraining them.
A good harness is like a bridge to a new world. The agent does not need a new brain for every destination; once the bridge is stable enough, it can carry over its existing planning, coding, debugging, and iteration skills, then use them to create new things in FreeCAD, Blender, video editors, games, and beyond.
3. Preview is for both humans and agents.
A preview trajectory gives the agent a feedback loop. It connects “I ran this command” to “the artifact now looks like this.” That makes long creative and build tasks less blind for agents, and much easier for people to follow.
4. CLI-Hub is becoming the distribution layer for these harnesses.
The goal is to move beyond one-off scripts for individual demos and make high-quality harnesses easier to find, install, test, improve, and reuse across projects.
Our current view: stronger harnesses are one of the most practical ways to expand what general agents can do today. Longer term, we expect agentic model training to absorb these environments and feedback loops more directly.
CLI-Hub: https://t.co/9yJ1yA0Kv4
Release: https://t.co/Q203cHUWyz
Code-as-a-service is eating software from the bottom up.
Anything that exists mainly to solve a narrow, repetitive workflow is vulnerable.
Disk cleanup. File triage. Log analysis. Batch transforms. Data cleanup. Internal glue tools.
I’m already doing these with Codex / Claude Code instead of dedicated apps. 😵
True or false: a lot of software isn’t a product moat, it’s just a temporary wrapper around a workflow that models can now execute directly
👩💻 A new tool for making Android apps is here: Android CLI!
It's the primary interface for Android development from the terminal and is designed to make your agents more efficient, effective, & capable of following the latest development best practices!