1️⃣We are excited to open-source syftr: a powerful tool for automatically finding Pareto-optimal generative AI flows! syftr searches a large search space of agentic and non-agentic flows to surface optimal tradeoffs between accuracy, cost and latency.
🧵
@MainzOnX If the compiler does too much it takes a lot of effort to keep extracting perf from different generations of even same hardware (triton->gluon for example). If we keep enriching the DSL as hardware evolves we do less work in the compiler and let AI do more heavy lifting.
@MainzOnX Agreed with all the points above. My personal speculation is that rich DSLs coupled with a simple thin compiler which only implements the most obvious tricks that we don’t need LLMs to rediscover from scratch will win out. Humans still need to read code that AI generates.
we published autokernel on arxiv
inspired by @karpathy 's autoresearch, we applied the same keep/revert agent loop to GPU kernel optimization
you give it any pytorch model, it profiles it, ranks bottlenecks by amdahl's law, writes triton or CUDA C++ replacements, and runs 300+ experiments overnight with no human in the loop
- 5.29x over pytorch eager on rmsnorm
- 2.82x on softmax
- beats torch.compile by 3.44x on softmax and 2.94x on cross entropy
- #1 on the vectorsum_v2 B200 leaderboard
- single prompt triton FP4 matmul that beats CUTLASS by up to 2.15x
every candidate passes a 5-stage correctness harness before any speedup counts, and the whole thing runs at ~40 experiments/hour so you wake up to a faster model
arxiv: https://t.co/FjXSIG7qp9
github: https://t.co/45z8Z7nP3N
Introducing Critique, a new multi-model deep research system in M365 Copilot.
You can use multiple models together to generate optimal responses and reports.
Breaking News: The U.S. was responsible for a missile strike on an Iranian school, an ongoing military investigation found. The inquiry said the strike — which Iranian officials said killed at least 175 people — was the result of a targeting mistake. https://t.co/88FIdIJOQi
AI is writing a growing share of the world's software. No one is formally verifying any of it.
New essay: "When AI Writes the World's Software, Who Verifies It?"
https://t.co/8zjS9FkdA8
RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong."
Can we train LLMs on this human-AI interaction?
We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵
What if you could bring your task and have a system generate a set of AI workflows which work well out-of-the-box! No manual trial-and-error on which of the numerous agents (from single-agent to multi-agent workflows) to use. We built exactly that https://t.co/qrsbAXaHFs
We took a leaf out of this literature and find that by cross-pollinating search across many different datasets (metatraining), one can find a set of flows we term as "silver bullets" that can do well (in the Pareto-sense) *across* tasks *without* running syftr from scratch.
Does RL actually learn positively under random rewards when optimizing Qwen on MATH? Is Qwen really that magical such that even RLing on random rewards can make it reason better?
Following prior work on spurious rewards on RL, we ablated algorithms. It turns out that if you deploy algorithms like Reinforce and REBEL (a generalization of Natural Policy Gradient), RL does not learn under random rewards. These two simple algorithms simply behave as we would expect in this case.
GRPO and PPO indeed can behave strangely. They can learn positively or negatively, depending on different random seeds. The clipping heuristic introduces certain bias in the objective function, which causes such unexpected behaviors (this even happens in bandit which has nothing to do w/ LLM or reasoning). Perhaps it is time to abandon the clipping heuristic...
@allenainie Thank you for the kind words and for making Trace! We love how easy it is to weave into complicated workflows. Excited for what Trace is cooking next.
A different and interesting work from my ex-colleague Dey: How do you generate Pareto frontier for the agentic workflow?
Many practical applications must balance cost vs performance for agents and this pioneering work shows the way!
✨Meet syftr, a new OSS framework to find the best RAG workflows (both agentic and not) balancing cost/latency/accuracy using multi-objective Bayesian Optimization
1️⃣We are excited to open-source syftr: a powerful tool for automatically finding Pareto-optimal generative AI flows! syftr searches a large search space of agentic and non-agentic flows to surface optimal tradeoffs between accuracy, cost and latency.
🧵
5️⃣syftr is made possible thanks to:
Ray for distributed search orchestration. @anyscalecompute
LlamaIndex for building advanced workflows. @llama_index
HuggingFace Datasets for fast dataset interfaces. @huggingface
Starting with question-answering and actively expanding tasks