Andriy Mulyar @Andriy_Mulyar - Twitter Profile

4 days ago

Today, we’re excited to announce our $50M Series B, led by @GreenfieldVC (formerly TPG Capital), with participation from @lightspeed and @notablecap. 🚀 At @PatronusAI, we develop simulations and evals to train and improve AI. The first phase of AI was built on static benchmarks, but that era is over now. As agents are used to solve longer and longer tasks, they need to practice in dynamic, living worlds to get better. Simulations are the critical infrastructure powering this next phase. As a company, we’re behind the most influential research and products in AI evaluation, like FinanceBench, Lynx, and Percival. And things have moved at the speed of light since. ⚡ We partner with the world's leading frontier AI labs and enterprises, and our revenue has grown more than 15x over the past year. Additionally, today, we’re introducing a preview of the first Digital World Model for AI agent training and simulation: Patronus-DWM. Digital World Models are language diffusion world models that predict realistic environment behaviors and steer agent actions across digital workflows. Just as physical world models predict how objects move through space, we’re developing the equivalent for the digital world: predicting how agents act in digital workflows, then using that to scale the creation of high-quality training data for LLMs. Digital World Models help us push the frontier of ultra long horizon workflows, and unlock a new class of self-improving RL environments. This is our scalable approach to simulating all of the world’s intelligence. The round was also joined by @datadoghq, @SamsungVentures, @gokulr, @factorialcap, and a large cohort of amazing AI leaders and researchers across @AnthropicAI, @OpenAI, @GoogleDeepMind, @nvidia, @Recursive_SI, and more. ✨ It has been the ride of a lifetime. But we’re just getting started. The best is yet to come. "Do not go gentle into that good night, Rage, rage against the dying of the light" - Dylan Thomas (1954)

26

270

23

142

37K

Andriy Mulyar

@andriy_mulyar

5 days ago

Exciting launching from Engram! Congrats @jxmnop !

Engram

@EngramLab

6 days ago

https://t.co/CGIef5lIBI

168

2K

220

1K

2M

2

0

1

1K

Andriy Mulyar

@andriy_mulyar

24 days ago

@ChainZenit skill issue

0

9

Andriy Mulyar

@andriy_mulyar

24 days ago

533 days ripping the first good coding agent harness!

eric zakariasson

@ericzakariasson

25 days ago

introducing cursor profiles! go claim your handle at https://t.co/6t5lg2jqvg

356

2K

96

716

711K

2

4

0

394

Who to follow

Databricks AI Research

@DbrxMosaicAI

We remove the barriers to state-of-the-art generative AI model development and make data + AI available to all.

AK

@_akhaliq

AI research paper tweets, ML @Gradio (acq. by @HuggingFace 🤗) dm for promo ,submit papers here: https://t.co/UzmYN5XOCi

LangChain

@LangChain

Powering the Agent Development Lifecycle. Makers of LangSmith and @LangChain_OSS and @LangChain_JS.

andriy_mulyar retweeted

Jaya Gupta

@JayaGup10

about 1 month ago

https://t.co/eUp5waUIwl

60

488

61

893

405K

Andriy Mulyar

@andriy_mulyar

29 days ago

a good harness is hard to find

1

2

0

288

andriy_mulyar retweeted

Jesse Michael Han

@jessemhan

about 1 month ago

most of the effort for @mathematics_inc's spherepacking formalization was spent on compressing and cleaning up a first-pass 500K LOC formalization to <200K LOC. creating infrastructure that can scale autoformalization to the frontiers of mathematics, software, and everything else is our #1 priority here - if this excites you, come work with us!

1

63

9

22

11K

Andriy Mulyar

@andriy_mulyar

about 1 month ago

skill issue, been switching between 2.5 and opus since 2.5 came out - great combo. there's a reason surgeons have both scalpels and saws in the same kit

Dan Kulkov

@DanKulkov

about 1 month ago

cancelled after 30 min of using composer 2.5 is fucking retarded

214

2K

33

256

539K

0

2

0

385

andriy_mulyar retweeted

Zachary Lipton

@zacharylipton

about 1 month ago

deep learning research was the original vibe math

14

310

23

31

74K

andriy_mulyar retweeted

Garry Tan

@garrytan

about 1 month ago

The companies I love working with in office hours are the ones where the founder has a specific, weird, earned insight that nobody else has. Not "AI for X." A genuine edge that came from living inside a problem. The ones that are dying almost always have the same pattern: technically competent founders building something nobody asked for, moving metrics that don't matter, avoiding the conversation with the one user who'd tell them the truth. The lucky thing is that 2nd type of founder can become the 1st kind if they don't stand still, they are willing to talk to people, try things, and always seek high rate of learning.

221

3K

214

1K

758K

Andriy Mulyar

@andriy_mulyar

about 1 month ago

Topics of interest include: Subword Tokenization. Examination of current techniques such as WordPiece, BPE, and UnigramLM, as well as extensions to improve their efficiency and applicability. Tokenization for Various Modalities. Techniques of tokenization for images, audio, and video. Study of representation alignment across modalities. Multilingual Tokenization. Focus on ensuring tokenization methods are equitable and effective across various languages. Identification of relevant failure modes caused by tokenization. Tokenizer Modification. Methods for updating tokenizers after model training to improve the model’s efficiency or performance without retraining from scratch. Alternative Approaches to Represent Input. Investigation into alternative input representations for data such as patches, bytes, or pixels. Tokenization and Statistics. Statistical analysis of subword properties. For instance, the study of compression effectiveness of different tokenization methods.

0

314

Andriy Mulyar

@andriy_mulyar

about 1 month ago

very cool new colm workshop on tokenization just dropped! The Second Tokenization Workshop (TokShop) at COLM 2026 aims to bring together researchers and practitioners from all corners of machine learning to explore tokenization in its broadest sense. https://t.co/frZXHC6mOa

1

8

0

7

1K

Andriy Mulyar

@andriy_mulyar

about 1 month ago

@adelbucetta @nomic_ai to speak to someone you must first get their attention. skip.

0

100

Andriy Mulyar

@andriy_mulyar

about 1 month ago

hiring a growth engineer at @nomic_ai the job: build agentic systems that get us in front of every built environment company in the U.S. (~20k companies). orchestrate agents to automate non-spammy outbound, ad campaigns, linkedin touchpoints, events, seo — all wired together. i've personally been building our internal gtm system myself from scratch the last 6 months with very impressive results - time to scale! this isn't a marketing role. it's engineering role where your measured output is qualified customer calls and sign ups. if you like low latency feedback loops between prompt and customer demand surges this might be the role for you. link: https://t.co/mr8rxxkrqm

7

48

10

21

8K

andriy_mulyar retweeted

PatronusAI

@PatronusAI

about 2 months ago

Spotlighting our benchmark for agentic search: DETOUR which was accepted to ACL 2026 🎊! When people try to recall something in conversation, they rarely give a perfect query upfront. They say things like “that movie with the scene where…” or “the paper about…” and the assistant has to ask the right follow-up questions to get there. Existing search and agent benchmarks often miss this multi-turn, tip-of-the-tongue behavior. To more realistically evaluate it, we introduce DETOUR: Dual-agent based Evaluation Through Obscure Under-specified Retrieval, an interactive benchmark for dual-agent search and reasoning. DETOUR contains 1,011 prompts across text, image, audio, and video. In the benchmark, a Primary Agent is evaluated on its ability to identify a target entity by querying a consistent Memory Agent, testing whether models can resolve ambiguity through useful follow-up questions. Current state-of-the-art models still struggle: performance reaches only 36% accuracy across all modalities, showing that today’s agents remain weak at clarification-seeking in underspecified, real-world search settings. We hope DETOUR helps push the next generation of search agents toward better reasoning, better questions, and more robust multi-turn retrieval. arXiv Paper: https://t.co/obnKSnjgF0 @getdarshan @anandnk24 @rebeccatqian

PatronusAI's tweet photo. Spotlighting our benchmark for agentic search: DETOUR which was accepted to ACL 2026 🎊!

When people try to recall something in conversation, they rarely give a perfect query upfront. They say things like “that movie with the scene where…” or “the paper about…” and the assistant has to ask the right follow-up questions to get there.

Existing search and agent benchmarks often miss this multi-turn, tip-of-the-tongue behavior. To more realistically evaluate it, we introduce DETOUR: Dual-agent based Evaluation Through Obscure Under-specified Retrieval, an interactive benchmark for dual-agent search and reasoning.

DETOUR contains 1,011 prompts across text, image, audio, and video. In the benchmark, a Primary Agent is evaluated on its ability to identify a target entity by querying a consistent Memory Agent, testing whether models can resolve ambiguity through useful follow-up questions.

Current state-of-the-art models still struggle: performance reaches only 36% accuracy across all modalities, showing that today’s agents remain weak at clarification-seeking in underspecified, real-world search settings.

We hope DETOUR helps push the next generation of search agents toward better reasoning, better questions, and more robust multi-turn retrieval.

arXiv Paper: https://t.co/obnKSnjgF0
@getdarshan @anandnk24 @rebeccatqian

1

13

3

863

Andriy Mulyar

@andriy_mulyar

about 2 months ago

@tmophoto 'good coding agents'

1

0

25

Andriy Mulyar

@andriy_mulyar

about 2 months ago

the fact you can't use good coding agents to debug why your internet connection isn't working is frustrating

1

2

0

620

Andriy Mulyar

@andriy_mulyar

2 months ago

more like startups are terrible at declaring incidents of infra downtime

Steve Derico

@stevederico

2 months ago

@GergelyOrosz github uptime since microsoft acquisition

4

150

12

15

11K

1

2

0

971

Andriy Mulyar

@andriy_mulyar

2 months ago

@f_ili_p_ziva Because our users don't get to read our codebase. Also docs that only document your code are crap. The real advantage is the agent can see code + marketing copy + relevant slack conversations internally to make sure the docs actually short-circuits getting users to value.

1

0

32

Andriy Mulyar

@andriy_mulyar

2 months ago

What's the most useful agent that you have running on autopilot? Mine is a product documentation housekeeper. Everyday at 9am, it looks at all the codebase changes in the last 24 hours and identifies gaps in the product docs. It flags it in slack and proposes a fix.

1

2

0

605

Andriy Mulyar

@andriy_mulyar

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users