Ashutosh Baheti @abaheti95 - Twitter Profile

Pinned Tweet

15 days ago

In 1945, Vannevar Bush imagined a machine to extend a scientist's memory. He called it the MemEx. 80 years later, we built one for LLM agents. Tool outputs become Python objects; only print statements reach the model's context. 🧵 https://t.co/YyrGsn3TB7

abaheti95's tweet photo. In 1945, Vannevar Bush imagined a machine to extend a scientist's memory. He called it the MemEx.

80 years later, we built one for LLM agents.

Tool outputs become Python objects; only print statements reach the model's context.

🧵 https://t.co/YyrGsn3TB7 https://t.co/p9dWNhPNYV

2

69

15

56

13K

Ashutosh Baheti

@abaheti95

about 22 hours ago

Scriptable subagents are the key ingredient that makes dynamic workflows possible! @DbrxMosaicAI we built MemEx harness which allows composing subagents via python code🐍. Subagents can also return live objects for further manipulation. Read more here https://t.co/YyrGsn3TB7

abaheti95's tweet photo. Scriptable subagents are the key ingredient that makes dynamic workflows possible!

@DbrxMosaicAI we built MemEx harness which allows composing subagents via python code🐍. Subagents can also return live objects for further manipulation.

Read more here https://t.co/YyrGsn3TB7 https://t.co/4YTPPNJQ4k

Thariq

@trq212

1 day ago

https://t.co/R6exTuF7P8

199

9K

1K

20K

2M

0

3

0

228

Ashutosh Baheti

@abaheti95

10 days ago

@samdotb Checkout MemEx from @DbrxMosaicAI We built a programmable scratchpad for LLM agents which does just that and a lot more! https://t.co/FRKTls9xEa

Ashutosh Baheti

@abaheti95

15 days ago

In 1945, Vannevar Bush imagined a machine to extend a scientist's memory. He called it the MemEx. 80 years later, we built one for LLM agents. Tool outputs become Python objects; only print statements reach the model's context. 🧵 https://t.co/YyrGsn3TB7

2

69

15

56

13K

0

1

0

49

abaheti95 retweeted

Prithviraj (Raj) Ammanabrolu

@rajammanabrolu

12 days ago

Ever wished we had fewer X-training hyphenates? Pre, mid, post etc. Why not just Training? Trying to bridge the divides (and get all our friends into one team again), we intro *Introspective X Training*, an offline RL inspired method that scales effectively across any LLM stage by annotating your data with a thinking reward generated language critique! Up to 2.8x FLOP efficiency + 5-10 point score gains (esp with math and code) at any stage from scratch to 24T tokens on 8b (active) sized models!! We burned much compute ablating so you wouldn't have to Moral of the story is‼️don't throw out any data via filtering, just feedback condition it‼️ You can spend FLOPs up front on inference to *classify* data quality and then train so that tokens aren't all treated equally based on the feedback starting early in training itself. Right now they're really only separated out much later during mid/post training This improves overall compute efficiency and gives us benchmark perf not possible with just baseline methods! Paper here: https://t.co/9oSYwQEpbi Thanks to @BrandoCui and @GXiming for leading this w/ @__SyedaAkter @davidjesusacu @hyunw_kim @jaehunjung_com Yuxiao Qu @shrimai_ @YejinChoinka

2

114

20

87

25K

Who to follow

Bodhisattwa Majumder

@mbodhisattwa

I lead AI x (Data-driven) Discovery @allen_ai. 🧬 Agents + Search. @AdobeResearch Fellow. Prev Google, MSR, Meta. PhD @ucsd_cse.

Hanna Hajishirzi

@HannaHajishirzi

VP@Microsoft-AI; past: Olmo, Tulu

Jesse Dodge

@JesseDodge

Research Scientist at Meta. 10-yr test-of-time ACL 22, Best Demo ACL 25, Best Resource Paper ACL 24, Best Theme Paper ACL 24, Best Student Paper NAACL 15 🏳️‍🌈

abaheti95 retweeted

Ben Clavié

@bclavie

15 days ago

Extremely excited to see this hit the timeline the same day I give a talk where I spend 2 minutes ranting about how As We May Think might be the most relevant essay to today's information retrieval world. And on top of that, it's great work going in the right direction!

0

19

3

11

2K

abaheti95 retweeted

Ivan Zhou

@ivanzhouyq

15 days ago

We're pushing the frontier of enterprise agents that reason over massive amounts of structured and unstructured data at @databricks. A recurring barrier is that agents burn tokens reading data and grow fuzzy as their context fills up. MemEx is an elegant solution. It lifts performance on both frontier and smaller OSS models, while significantly cutting the cost and latency of complex agentic tasks.

0

8

1

0

387

abaheti95 retweeted

Databricks AI Research

@DbrxMosaicAI

15 days ago

New research from Databricks: the context window is the only persistent substrate today's LLM agents have, and it floods fast. A single SQL query can return millions of rows that ride along in every subsequent turn, even when only one cell ever mattered. We hit this constraint every day in the agents we run in production, from Genie to Agent Bricks' Supervisor Agent to KARL. In a new post from the Databricks research team, we introduce MemEx: a programmable Python scratchpad that lets agents transform, slice, and persist tool outputs as typed objects in a live kernel. Same observe-act loop. Different action space. Across nine frontier and open-weight models on two enterprise agentic tasks (OfficeQA Pro and Enterprise Structured Retrieval): • Frontier models (Opus 4.6, Sonnet 4.6, Gemini 3.1 Pro) gain 2 to 5 accuracy points at 25 to 30% lower cost • Qwen 122B and Qwen 397B nearly double accuracy at 40 to 50% lower cost • Four of the five points on the OfficeQA Pro cost-accuracy Pareto frontier are MemEx configurations MemEx extends the code-as-action line (CodeAct, Anthropic Programmatic Tool Calling, Cloudflare Code Mode) with persistent scope across turns, eager spawn_agent for parallel sub-agents that share the parent's namespace, typed submit() for validated returns, and live-object scope injection. Built on aroll, the same Databricks agentic rollouts framework already powering those production systems. MemEx is rolling out across Databricks first-party agents and Agent Bricks soon. If you build on Databricks agents today, you'll be able to try it. Full write-up: https://t.co/WmyAQAmWEd

DbrxMosaicAI's tweet photo. New research from Databricks: the context window is the only persistent substrate today's LLM agents have, and it floods fast. A single SQL query can return millions of rows that ride along in every subsequent turn, even when only one cell ever mattered. We hit this constraint every day in the agents we run in production, from Genie to Agent Bricks' Supervisor Agent to KARL.

In a new post from the Databricks research team, we introduce MemEx: a programmable Python scratchpad that lets agents transform, slice, and persist tool outputs as typed objects in a live kernel. Same observe-act loop. Different action space.

Across nine frontier and open-weight models on two enterprise agentic tasks (OfficeQA Pro and Enterprise Structured Retrieval):
• Frontier models (Opus 4.6, Sonnet 4.6, Gemini 3.1 Pro) gain 2 to 5 accuracy points at 25 to 30% lower cost
• Qwen 122B and Qwen 397B nearly double accuracy at 40 to 50% lower cost
• Four of the five points on the OfficeQA Pro cost-accuracy Pareto frontier are MemEx configurations

MemEx extends the code-as-action line (CodeAct, Anthropic Programmatic Tool Calling, Cloudflare Code Mode) with persistent scope across turns, eager spawn_agent for parallel sub-agents that share the parent's namespace, typed submit() for validated returns, and live-object scope injection. Built on aroll, the same Databricks agentic rollouts framework already powering those production systems.

MemEx is rolling out across Databricks first-party agents and Agent Bricks soon. If you build on Databricks agents today, you'll be able to try it.

Full write-up: https://t.co/WmyAQAmWEd

21

195

20

364

155K

abaheti95 retweeted

Shubham Toshniwal @ShubhamToshniw6

15 days ago

Agents are bottlenecked by the current tool-calling based harness. Outputs get flattened to text, added to context, and re-parsed each turn. The model spends most of its tokens transcribing. We just shipped MemEx where the agent gets supercharged with a Python scratchpad!

1

12

5

9

2K

Ashutosh Baheti

@abaheti95

15 days ago

🙏Huge thanks to my co-authors @ShubhamToshniw6 , Arnav Singhvi, @kristahopsalong, @seankski, @sam_havens, Jonathan Li, @Mdjxjxnsk, @j_nadan_chang, @WenSun1, @alexrtrott, @jefrankle, Xing Chen, and @matei_zaharia. MemEx is the future of agentic harnesses!

1

6

0

2

235

Ashutosh Baheti

@abaheti95

15 days ago

🚀 MemEx is rolling out across @databricks's first-party agents and Agent Bricks. Full write-up (numbers, design, trace analysis): https://t.co/YyrGsn3TB7

1

4

0

260

Ashutosh Baheti

@abaheti95

15 days ago

Same pattern for test-time scaling. We aggregated 8 Qwen rollouts of OfficeQA-Pro. The Tool Calling aggregator worked from lossy summaries (full traces don't fit in context). The MemEx aggregator received the full trajectories as scope variables, and won.

abaheti95's tweet photo. Same pattern for test-time scaling.

We aggregated 8 Qwen rollouts of OfficeQA-Pro. The Tool Calling aggregator worked from lossy summaries (full traces don't fit in context). The MemEx aggregator received the full trajectories as scope variables, and won. https://t.co/dawdeh3Ch5

1

4

0

259

Ashutosh Baheti

@abaheti95

15 days ago

📈 On complex long-horizon enterprise tasks like OfficeQA Pro and Enterprise Structured Retrieval: Frontier models like Opus 4.6: +5pp at 30% less cost. OSS like Qwen3.5-122B: doubles, 18% → 36%. Same agent. Same model. Same tools. Same prompts. Different action space.

abaheti95's tweet photo. 📈 On complex long-horizon enterprise tasks like OfficeQA Pro and Enterprise Structured Retrieval:

Frontier models like Opus 4.6: +5pp at 30% less cost.
OSS like Qwen3.5-122B: doubles, 18% → 36%.

Same agent. Same model. Same tools. Same prompts.
Different action space. https://t.co/BnpmSAbdjl

1

7

1

465

Ashutosh Baheti

@abaheti95

15 days ago

🤖 We ran MemEx on the agents' OWN trajectories. An audit agent loaded 6 of them (3 MemEx, 3 Tool Calling) into Python scope and classified failure modes. MemEx had 2x fewer search/execution errors. Retrieval stays in variables, never copied between calls.

abaheti95's tweet photo. 🤖 We ran MemEx on the agents' OWN trajectories.

An audit agent loaded 6 of them (3 MemEx, 3 Tool Calling) into Python scope and classified failure modes.

MemEx had 2x fewer search/execution errors. Retrieval stays in variables, never copied between calls. https://t.co/GJqYmfgoDs

1

5

0

283

Ashutosh Baheti

@abaheti95

15 days ago

At Databricks, 🧞Genie hits this wall every day! Its queries span an entire workspace and pulls data from tables, vector indices, and other sources via many tool calls. Here's how MemEx can convert complex workflows like these into streamlined code with far less token repetition.

abaheti95's tweet photo. At Databricks, 🧞Genie hits this wall every day!
Its queries span an entire workspace and pulls data from tables, vector indices, and other sources via many tool calls.

Here's how MemEx can convert complex workflows like these into streamlined code with far less token repetition.

1

13

1

10

4K

Ashutosh Baheti

@abaheti95

15 days ago

In 1945, Vannevar Bush imagined a machine to extend a scientist's memory. He called it the MemEx. 80 years later, we built one for LLM agents. Tool outputs become Python objects; only print statements reach the model's context. 🧵 https://t.co/YyrGsn3TB7

2

69

15

56

13K

abaheti95 retweeted

Julia Neagu

@julianeagu

19 days ago

I'm building a new team at @databricks AI Research and we're hiring. We're focused on one of the hardest open problems in AI right now: how do you measure and continuously improve agents that operate on enterprise data at scale. We're looking for founding engineers to build the flywheel that turns evaluation results directly into better agents — from development and training all the way to production. If you want to work on problems that actually matter at the frontier of AI research, I'd love to talk. Link in comments 👇

82

1K

62

990

171K

Ashutosh Baheti

@abaheti95

19 days ago

@KushaSareen @LakshyAAAgrawal @Cameron_Chann @rish2k1 @agarwl_ @Devvrit_Khatri @inderjit_ml @profjoeyg @KurtKeutzer Interesting. What is the performance in the no prompt case after FST? Does the model without GEPA prompt also improve as much as with the GEPA prompt?

1

0

125

Ashutosh Baheti

@abaheti95

23 days ago

Ash Ketchum is basically a phd advisor. He brings pokemon in, evolves them, and then just as they get good, he sets them free!

0

5

0

136

Ashutosh Baheti

@abaheti95

27 days ago

🧞 is out of the bottle and answering every enterprise question I throw at it. The pace of agent development has been incredible @databricks. Excited for what's next. Lots more to come!

Matei Zaharia @matei_zaharia

27 days ago

Genie has transformed how Databricks users work with data, with 3x the accuracy of generic agents. We're sharing some of the research behind it and what makes building data agents challenging. Super proud of our research team's impact with this! https://t.co/eLB2ElVo8S

8

275

44

248

105K

0

6

0

542

abaheti95 retweeted

Databricks AI Research

@DbrxMosaicAI

about 2 months ago

Most enterprise questions don't live in one dataset. They span structured systems and unstructured sources like documents, reviews, and reports. In our latest research, we show how Agent Bricks Supervisor Agent handles this by decomposing queries across structured and unstructured tools, then synthesizing results over multiple reasoning steps. The results across STaRK and KARLBench: 20%+ improvement over SoTA baselines, with the biggest gains on tasks requiring tight integration of structured and unstructured data. All built declaratively — no custom code, just precise instructions and the right tools. https://t.co/EBSM6iU89g

DbrxMosaicAI's tweet photo. Most enterprise questions don't live in one dataset. They span structured systems and unstructured sources like documents, reviews, and reports.

In our latest research, we show how Agent Bricks Supervisor Agent handles this by decomposing queries across structured and unstructured tools, then synthesizing results over multiple reasoning steps.

The results across STaRK and KARLBench: 20%+ improvement over SoTA baselines, with the biggest gains on tasks requiring tight integration of structured and unstructured data.

All built declaratively — no custom code, just precise instructions and the right tools. https://t.co/EBSM6iU89g

5

49

15

19

10K

abaheti95 retweeted

Matei Zaharia @matei_zaharia

about 2 months ago

As AI reasoning gets good enough, we think memory will be the next bottleneck for agents. Can your agent improve with more experience? We call this Memory Scaling, and it's related but different from continual learning. A few examples and challenges: https://t.co/raIa0U7MPs

matei_zaharia's tweet photo. As AI reasoning gets good enough, we think memory will be the next bottleneck for agents. Can your agent improve with more experience?

We call this Memory Scaling, and it's related but different from continual learning. A few examples and challenges:
https://t.co/raIa0U7MPs https://t.co/Dx00lSi8YJ

9

380

50

278

30K

Ashutosh Baheti

@abaheti95

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users