Our mission is to make it easy for anyone to deploy a robot to help them in the real world
We wrote an intuitive guide to understanding modern robotics, catered toward an audience that understands technology but not AI robotics
We hope that this short blog post embeds in you the core principles that will bring further curiosity.
BREAKING NEWS: Anthropic's latest model will NOT help you if it thinks your ML research/ML engineering is interesting, and/or will secretly degrade its IQ so that the average engineer won't notice. We are already seeing Anthropic's latest model's moderation filters our GPU inference research and programming 😭
For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited that we're releasing our latest model aligned with this theme:
Gemma 4 12B, a dense encoder-free model which processes raw text, image, and audio inputs!
1/
Today we release a study on decoupling the benefits of subword tokenization for language model training, by simulating each suspected benefit one at a time inside a 1.7B byte-level pretraining pipeline.
We formulate seven hypotheses for why subword LLMs outperform byte-level LLMs (covering computational efficiency, structural priors over subword boundaries and positions, and the optimization objective) and implement each as a controlled intervention against a byte-level baseline. Three of the seven move the validation loss at this scale; the rest either have negligible effect or hurt.
Validated at 1.7B parameters on fineweb-edu with a LLaMA-3 architecture, with 68M-parameter replications in the appendix.
The work was led by Théo Gigant, Bowen Peng, and Jeffrey Quesnelle.
Paper: https://t.co/Blk7YdVLnc
Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating sparse circuits in the MLP basis without training a sparse autoencoder, modifying weights, or degrading general capability benchmarks.
Given a small set of contrastive prompt pairs that elicit a target behavior and its opposite, CNA isolates the top 0.1% of MLP neurons whose activations differ most between the two sets. Ablating that small circuit removes the behavior while leaving the rest of the model intact, and the intervention remains robust at high strengths where residual-stream methods like Contrastive Activation Addition (CAA) start to degrade.
Validated on the refusal circuit across 8 instruct-tuned models, including Llama-3.1-70B, Llama-3.2-3B, Qwen2.5-72B, and Qwen2.5-14B.
The work on CNA was led by @yaboilyrical, with support from @qorprate and @karan4d.
Today we release Lighthouse Attention, a selection-based hierarchical attention for long-context pre-training that delivers a 1.4-1.7× wall-clock speedup at 98K context.
It runs the same forward+backward pass ~17× faster than standard attention at 512K context on a single B200, without a custom sparse attention kernel, a straight-through estimator, or an auxiliary loss.
During training, queries, keys, and values are pooled symmetrically into a multi-resolution pyramid. We then score every pyramid heads, and a top-k cascade selects a small hierarchical dense sub-sequence, and after a sorting pass that enforces causality, we use standard attention for token mixing. A brief full attention resume at the end converts the checkpoint back into a competent dense-attention model.
Validated this using 530M parameter Llama-3 models across 50B tokens, with up to 1M-token benchmarks across 32 B200s under context parallelism.
The work on Lighthouse Attention was led by @bloc97_, @SubhoGhosh02, and @theemozilla.