Jason Stanley

6 months ago

1/5 🚀Apriel-1.6-15B-Thinker: a 15B multimodal reasoner scoring 57 on the Artificial Analysis Intelligence Index - approaching the performance of ~200B-scale frontier models while remaining an order of magnitude smaller. 🧠Model weights: https://t.co/cynqpZaphz 📄Blog: https://t.co/vt2Egf712K 💬Chat demo: https://t.co/aHq62h3iXr @SathwikTejaswi @sagardavasam @tscholak @NVIDIAAI @nvidianewsroom @togethercompute @turingcom @ArtificialAnlys

ServiceNowRSRCH's tweet photo. 1/5

🚀Apriel-1.6-15B-Thinker: a 15B multimodal reasoner scoring 57 on the Artificial Analysis Intelligence Index - approaching the performance of ~200B-scale frontier models while remaining an order of magnitude smaller.

🧠Model weights: https://t.co/cynqpZaphz

📄Blog: https://t.co/vt2Egf712K

💬Chat demo: https://t.co/aHq62h3iXr

@SathwikTejaswi @sagardavasam @tscholak @NVIDIAAI @nvidianewsroom @togethercompute @turingcom @ArtificialAnlys

226

24K

jstanl retweeted

6 months ago

2/5 Apriel-1.6 mid-training includes: • Depth-upscaling with 35% curated high-quality data + 15% NVIDIA Nemotron + replay • Multi-stage CPT for expanded synthetic reasoning + document/chart/OCR-style multimodal tasks • 49K context length text-only CPT for long-context stability All trained on the NVIDIA DGX GB200, totaling ~10K GPU-hours.

6 months ago

A lot of talk about AI for operations skips this architectural layer. If we don’t get the pattern right, we’ll build systems that are hard to steer or secure. I sketched four design patterns and their risk profiles here: https://t.co/YeGIge7acb

Chief Academic Officer at Learn21. Staying outside my comfort zone & encouraging you to do the same. Eternal student. #CETLproud

6 months ago

We’re starting to run high-stakes systems (grids, fabs, hospitals) with 3 layers at once: - deterministic cores (simulators, rules) - surrogate models (physics + ops) - large reasoning models My new essay is about the architecture between them: https://t.co/YeGIge7acb

Who to follow

Stacy Hawthorne, Ed.D.

@StacyHaw

6 months ago

Physics surrogates vs operational surrogates matter. - One emulates simulators; the other mirrors messy telemetry. - They fail differently, can be poisoned differently, need different monitoring. - Bridging them and deciding who gets to veto whom is a governance problem.

6 months ago

Same ingredients, very different risk: - Cores as hard gates around a planner - Surrogates as drivers, cores as auditors - Watchdogs wrapped around legacy systems - Parallel models where disagreement is an escalation signal Pattern choice defines the risk and attack surface.

6 months ago

I end with design principles for AI + agents: make dependencies observable, structure incidents for cross-org patterns, and move from single-agent tests to portfolio stress. Full piece: https://t.co/KZR4L04P0I

6 months ago

Cloudflare’s outage and insurers pulling back from AI cover both point at the same thing: systemic risk built on shared models, gateways and infrastructure. New blog post on what we can learn from finance, aviation and cyber: https://t.co/KZR4L04P0I

110

6 months ago

In my blog post I look at three sectors that already treat systemic risk as first-class: finance (macro-prudential tools and stress tests), aviation (incident reporting and fleet-wide directives), and cyber (voluntary standards and federated sensing).

6 months ago

@harleyf @Shopify And as CTV says, Shopify has more than 1 merchant! 😅

6 months ago

New blog post on logit steering and sparse autoencoders (SAEs), how they reshape AI security + reliability, open new attack surfaces, and why cost-efficient monitoring matters. Internal Representations as a Governance Surface for AI https://t.co/Z85cWH9i24

6 months ago

Automated red-teaming is moving from fixed rubrics to learning evaluation systems. AMIS co-evolves jailbreak prompts and the judge’s scoring template. Great for dense signal—but are we optimizing for benchmark ASR or threat-modelled risk? https://t.co/A9FKYd38Gm

6 months ago

So many AI benchmarks lack strong construct validity, and even the ones that have it are used in a way that suggests teams mis-understand the construct they map to.

6 months ago

We are hiring a Sr Applied / Frontier Research Scientist focused on secure, trustworthy agents for enterprise. @ServiceNowRSRCH If you care about agent security & reliability, apply: https://t.co/87GLCfSkeG

6 months ago

@ServiceNowRSRCH is hiring a Sr Applied / Frontier Research Scientist focused on secure, trustworthy agents for enterprise. If you care about agent security & reliability, apply: https://t.co/87GLCfSkeG

jstanl retweeted

Alexandre

@alexpiche_

7 months ago

In-flight weight updates have gone from a “weird trick” to a must to train LLMs with RL in the last few weeks. If you want to understand the on-policy and throughput benefits here’s the CoLM talk @DBahdanau and I gave: https://t.co/p3KMZLFg4l

142

111

69K

jstanl retweeted

Rishabh Agarwal

@agarwl_

7 months ago

Don't sleep on PipelineRL -- this is one of the biggest jumps in compute efficiency of RL setups that we found in the ScaleRL paper (also validated by Magistral & others before)! What's the problem PipelineRL solves? In RL for LLMs, we need to send weight updates from trainer to generator (to generate data from our latest policy being trained). (Conventional PPO-off-policy) A naive approach would be to "start generators on a batch, wait for all sequences to complete, update the model weights for both trainers and generators, and repeat. Unfortunately, this approach leads to idle generators and low pipeline efficiency due to heterogeneous completion times. (Pipeline-RL) Instead, we simply let the generators continue generating tokens without discarding or finishing ongoing generations in-flight whenever we need to do a weight update -- doing an "in-flight" weight update. As such our KV caches for these generations would be stale, as they would come from LLM with earlier copy(ies) of the weights) but this is ok (see below).

agarwl_'s tweet photo. Don't sleep on PipelineRL -- this is one of the biggest jumps in compute efficiency of RL setups that we found in the ScaleRL paper (also validated by Magistral & others before)!

What's the problem PipelineRL solves? In RL for LLMs, we need to send weight updates from trainer to generator (to generate data from our latest policy being trained).

(Conventional PPO-off-policy) A naive approach would be to "start generators on a batch, wait for all sequences to complete, update the model weights for both trainers and generators, and repeat. Unfortunately, this approach leads to idle generators and low pipeline efficiency due to heterogeneous completion times.

(Pipeline-RL) Instead, we simply let the generators continue generating tokens without discarding or finishing ongoing generations in-flight whenever we need to do a weight update -- doing an "in-flight" weight update. As such our KV caches for these generations would be stale, as they would come from LLM with earlier copy(ies) of the weights) but this is ok (see below).

474

433

132K

jstanl retweeted

7 months ago

ServiceNow AI Research presents PipelineRL — one of the most impactful efficiency tricks in modern RL training. An elegant solution to a noisy, expensive problem. Worth the read 👇

jstanl retweeted