Today we launched Gemini 3.1 Flash TTS, our most expressive and controllable text-to-speech model yet.
This launch [excitement] includes audio tags! π£π· Audio tags [explanatory] are a seamless way to guide vocal style, pace, and delivery using natural language commands embedded directly in your text. Want a different tempo or tone? [amazement] Just tag the audio to steer the AI-speech output!
The model supports 70+ languages (24 of which are high-quality evaluated languages, including: Japanese, Hindi, and Arabic). Watch the audio tags in action in the demo below β
Have been doing some stem remixing work with Stable Audio 3.
π¦ Medium model
π init_audio holding the original audio file
πΆβπ«οΈ init_noise_level between 0.4-0.5 seems to be the sweet spot
πͺ Empty promps
Introducing HRM-Text.
An ultra-lean 1B-parameter reasoning language model designed to deliver strong general performance with a fraction of the data, compute, and infrastructure.
Trained on just 40B structured tokens, HRM-Text achieves competitive performance while using ~1/1000 of the training data of comparable models.
The kicker? The full model trains in roughly one day on a $1,000 budget.
This opens the door to a new generation of AI that is powerful, accessible, and radically easier to adapt. Theories and research concepts once deemed too expensive to test are officially back in the game.
Sapient Intelligence invites you to help us shape a new paradigm for general intelligence.
Reproduced HRM-Text XL (1B).
Training completed in ~38 hours wall-clock on 16 H200 GPUs, and evaluation performance matches the numbers reported in the paper.
Great job, team!
W&B report:
https://t.co/bjoMNQ043k
Using Stable Audio 3 to generate variations of an existing loop.
Unconditional generation (no prompt), renoising the latents to 0.5, and just using different seeds seems to generate a nice neighbourhood around the original. Generally keeps the harmonic context and feel.
Your RL post-training may be sabotaging your LLMβs test-time scaling!
Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*.
We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.
Your drifting model is secretly a fixed point for the Wasserstein gradient flow on...
...the KL?
...an approximation to the Sinkhorn?
...Is it even a Wasserstein gradient flow at all?
https://t.co/QJLh86Hi0d
@liwenliang@agalashov@JamesTThorn@ValentinDeBort1@ArnaudDoucet1
I gave this this video with prompt like this. It also works with an image input.
Prompt: Generate a 3x3 split screen video based on different details you see here. Make each cell different, varying the perspective, composition, zoom, angle, camera movement (some static, some moving). Make some of the cells extreme close-ups with detailed textures. Keep it photorealistic, handheld, raw. Only natural sounds.
[πpreprint] Diffusion models π€ MCMC !
Diffusion model samplers are biased due to discretisation
π‘The fix: Metropolis-type adjustment on corrector steps
βοΈChallenge: no access to the density ratio, only the score
πInsight: the score (and some maths) is all you need...
[1/3]
Guide with examples, not rewards π
Controlling what a pretrained generative model produces is still mostly a choice between three slow options: fine-tune it, attach a reward network, or search at inference. We found flow matching allows a fourth, and it costs almost nothing.
In deterministic interpolants, the velocity of the flow is determined by where the trajectory is headed: the endpoint mean. Shift that mean, and the entire flow shifts with it.
This turns control into a matter of reference. Change the examples that define the endpoint, and you change the direction the model follows. The examples need not be perfect. They only need to point the flow toward the attribute you want.
Color, identity, style, and structure, all controllable through examples. π§΅π
π§ We introduce "Generative Recursive Reasoning"!
Recursive Reasoning Models like HRM, TRM, and Looped Transformers are deterministic β same input, same reasoning, every time. They collapse the entire space of plausible reasoning paths into a single attractor.
Our model GRAM (Generative Recursive reAsoning Models) turns recursion itself into a stochastic latent trajectory. Multiple hypotheses, alternative solution strategies, and inference-time scaling not just by depth, but by width β parallel trajectory sampling.
And here's the kicker: the same formulation that gives us conditional reasoning p(y|x) also makes GRAM a general generative model p(x).
With only 10M params:
β’ Sudoku-Extreme: 97.0% (TRM 87.4%)
β’ ARC-AGI-1: 52.0%
β’ ARC-AGI-2: 11.1%
β’ N-Queens coverage: 90%+
π Paper: https://t.co/JC7EyXYc9Y
π Project page: https://t.co/LRT1dQiWLZ
w/
Junyeob Baek @JunyeobB (KAIST),
Mingyu Jo @pyross0000 (KAIST),
Minsu Kim @minsuuukim (KAIST & Mila),
Mengye Ren @mengyer (NYU),
Yoshua Bengio @Yoshua_Bengio (Mila),
Sungjin Ahn @SungjinAhn_ (KAIST)
Check out our new work on examining what LLMs learn and when!
We posit that LLMs have an implicit curriculum where they learn gradually more complex skills, and attempt to uncover some details of how this curriculum develops over time across model families.
Real-world models are here! Stoked to share how we're bringing real-world locations to life by integrating Street View into Genie. Try it now at https://t.co/j6c1N38tRS and read the blog for more info:
https://t.co/6ZOi9d9rah
Weβre dropping Gemini Omni: our first step towards a model that can create anything from anything - starting with video.
It combines Geminiβs intelligence with our generative media systems - representing a leap forward in world understanding, multimodality, and editing π§΅
Fixed point iterations for parallelizing nonlinear dynamics is all the rage:
- Newton for RNNs
- Picard for diffusion models
- Jacobi for parallel decode of LLMs
But how do these techniques relate, and when should you use them?
We show you how in our new paper π§΅
Today we released the code for our CVPR 2026 paper, Flowception.
Flowception bridges fully bidirectional sequence modeling and autoregressive generation by inserting frames via learned order, then denoising them with continuous flow.
Website: https://t.co/BujPaGRJ7h
Code: https://t.co/VJNBFBvIbV