Another microsoft interesting release besides from their models.
Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models
https://t.co/0XTFzT1ukT
@classiclarryd if the target is a dense model, would it make sense to begin training as moe and then densifying (maybe pruning less important experts) for faster dense training?
Jensen just launched NVIDIA Cosmos 3.
Pitched as the first fully open omnimodel for physical AI: a mixture-of-transformers (reasoning + generation) with native vision reasoning and generation across text, image, video, sound, and action.
Tops open-model leaderboards on physics, world generation, and action policy.
Three jobs in one:
- VLM for robots and autonomous vehicles
- world model that simulates environments and predicts future states
- backbone for world-action models trained on specific tasks
Three options
- Super (32B): for post-training robotics models that need the highest physics accuracy and generation quality.
- Nano (8B) for high-quality video and action reasoning in fractions of a second.
- Edge, coming soon, for real-time inference at the edge.
Silent quick release, VoxCPM2 running on the Apple Neural Engine, using cached voices I’m getting a decent ~0.5s TTFB and ~0.6 RTF on M4 Air.
https://t.co/f3xaSzeijZ
We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn.
This bottlenecks even very intelligent agents to a single stream. The models cannot read while writing, cannot act while thinking and cannot think while processing information.
In our new paper, see below, we discuss LLMs with parallel streams. We show that multi-stream LLMs can …
🔵Be created by instruction-tuning for the stream format
🔵Simplify user and tool use UX removing many pain points with agents and chat models (such as having to interrupt the model to get a word in)
🔵Multi-Stream LLMs are fast, they can predict+read tokens in all streams in parallel in each forward pass, improving latency
🔵 LLMs with multiple streams have an easier time encoding a separation of concerns, improving security
🔵 LLMs with many internal streams provide a legible form of parallel/cont. reasoning. Even if the main CoT stream is accidentally pressured or too focused on a particular task to voice concerns, other internal streams can subvocalize concerns that would otherwise not be verbalized.
Does this sound related to a recent thinky post :) - Yes, but I don’t feel so bad about being outshipped with such a cool report on their side by 23 hours. I’ll link a 2nd thread below with a more direct comparison. I actually think both are complementary in interesting ways.
restarted a convo (with V4's + 3 more papers) ≈48 hours old. cache hits
they do store cache for "days", not minutes-hours
Gemini TTL default is 1 hour, Claude's is 5 minutes
Nah bros I don't think they have > V4 kv efficiency, whatever Reiner Pope says