Yang

@Yangg40

Prev AWS 7x hackathon winner Research @UCLA Bridging research and product

Los Angeles

Joined May 2025

62 Following

1 Followers

24 Posts

Yangg40 retweeted

Diana

@sdianahu

2 days ago

the cost of shipping code went to zero taste didn't but "taste" sounds mystical and unfixable, so nobody teaches it. here's the unmystical version: taste is just an eval you haven't written down yet how you choose what to measure is what matters 1/8

616

557

69K

Yangg40 retweeted

Nikhil Pareek

@itsjustnikhil

2 days ago

What does it cost to evaluate 100% of our agent runs? If you're running an LLM-as-judge, the number that comes back is high enough that you end up sampling 10% and moving on. But sampling 10% doesn't really make evaluation cheaper. The 90% you skipped still has broken outputs in it, you just don't see them until a user does. That's still a cost. It's not on the bill, it just shows up later on a different line, and nobody traces it back to the sampling call. Here's what else doesn't show up on that bill: the engineer-months to build the pipeline, the time to keep it from rotting every time prompts change, the drift checks, the trace storage, the compliance audit when someone asks what you actually evaluated, the observability vendor charging you per span. Add it all up and "what evals cost" stops being a token line and becomes a total-cost-of-ownership number. Usually a much bigger, much more uncomfortable one. So we built a calculator that counts all these parameters. Put in your volume, your sampling rate, and what one bad incident costs you, and it shows your real annual cost end to end - then runs the same all-in math across every option: frontier judge, self-host, human review, and our own eval model (TURING), which comes out around 99% cheaper per call. The honest surprise for most people: evaluating 100% of your traffic costs less than you think. Don't take the number from me though, run it on your own volume and sampling rate, frontier judge vs TURING side by side - https://t.co/7jpwnFqkil

175

Yang @Yangg40

5 days ago

@HenryL_AI Based on the reasons weak models not improving as much - interesting if we can train small models to use harness/leverage it better? Since that’s the case for models trained for agentic tool calling, maybe SLM trained on using harness/tool calling is ideal for CS agents or such

161

Yang @Yangg40

5 days ago

@trajectorylabs Pretty cool stuff - though determining success/failure for external facing agents seems difficult. Customer gives a negative review because they didn’t get a refund they wanted cant be classified as a model failure probably. LLM to analyze the traces?

164

Yangg40 retweeted

elvis

@omarsar0

5 days ago

Very good advice on self-improving agents. (bookmark it) This is something I am seeing in my own experiments with coding agents and harnesses for long-horizon tasks. What I have found is that stronger models do not always evolve better agents. The current believe in self-evolving agents is that a bigger model writes better prompt and skill edits, so devs put their best model in the evolver seat. New research shows that intuition is mostly wrong. The work separates two abilities that usually get conflated. Producing harness updates stays flat across model capability, so Qwen3.5-9B writes edits roughly as good as Claude Opus 4.6. Benefiting from those updates follows an inverted-U that peaks at mid-tier models, while weak models fail to even activate the edits and strong models have little headroom left. This is important to understand as it tells you where to spend. Put a cheap model on the evolver and your expensive model on the solver, because the gains land solver-side, not evolver-side. Paper: https://t.co/8kJwR7NhmV Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. Very good advice on self-improving agents.

(bookmark it)

This is something I am seeing in my own experiments with coding agents and harnesses for long-horizon tasks.

What I have found is that stronger models do not always evolve better agents.

The current believe in self-evolving agents is that a bigger model writes better prompt and skill edits, so devs put their best model in the evolver seat.

New research shows that intuition is mostly wrong.

The work separates two abilities that usually get conflated. Producing harness updates stays flat across model capability, so Qwen3.5-9B writes edits roughly as good as Claude Opus 4.6. Benefiting from those updates follows an inverted-U that peaks at mid-tier models, while weak models fail to even activate the edits and strong models have little headroom left.

This is important to understand as it tells you where to spend. Put a cheap model on the evolver and your expensive model on the solver, because the gains land solver-side, not evolver-side.

Paper: https://t.co/8kJwR7NhmV

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

742

109

55K

Yang @Yangg40

5 days ago

You can find the paper here: https://t.co/mlbWeqwNdo By Nanyang Technological University and 2Zhejiang University Yanzhou Li, Yiran Zhang, Xiaoyu Zhang, Xiaoxia Liu, and Yang Liu

Yang @Yangg40

5 days ago

We see it in 3rd-party repos/skills, but it makes me wonder how long until it's a default in code agents. CODESKILL (2605.25430) provides a method to automatically build and maintain a compact skill bank that boosts coding agent performance - no manual prompt engineering.

Yangg40's tweet photo. We see it in 3rd-party repos/skills, but it makes me wonder how long until it's a default in code agents.

CODESKILL (2605.25430) provides a method to automatically build and maintain a compact skill bank that boosts coding agent performance - no manual prompt engineering. https://t.co/B7iUWEkINr

Yang @Yangg40

6 days ago

@serafimcloud Love the UI - will try myself. If you want to collab, happy to give you an API that drives https://t.co/smUibuZO54 - it backfills all arXiv data daily - but doesn’t have the updated benchmark data you have. Looking to find alternative after PwC

156

Yang @Yangg40

7 days ago

@Vtrivedy10 agreed - but imo the problem is not so much running the models on every arxiv paper (been there, done that) - but exactly the "connecting the dots" from high impact past the training cutoff... the model can't granularly connect paper X to product Y

Yang @Yangg40

7 days ago

@Vtrivedy10 Really cool concept - maybe if proprietary models grow in cost (which they probably will) - this becomes the sustainable future. How did you benchmark the performance of the smaller model?

Yang @Yangg40

7 days ago

@omarsar0 I love your work with dair and funny enough I recently built an MCP to query ArXiv: https://t.co/smUibuZO54 Since you're working on Dair, wondering if you'd want to trade notes on research agents

132

Yang @Yangg40

7 days ago

@victor207755822 This is cool work! I'm also trying to build an auto-research agent - but I started with the retrieval first: https://t.co/smUibuZO54 How do you keep track of what the agent has/hasn't seen and how papers relate to each other over an extended time?

125

Yang @Yangg40

7 days ago

3) Self-Correcting RAG https://t.co/2sgwYpyYv7 LLMs Should Express Uncertainty https://t.co/t27bdwFTCc LatentAudit https://t.co/TCWfRMkQHT Guaranteeing Knowledge Integration https://t.co/W9vWReX0EI Facet-Level Tracing of Evidence Uncertainty https://t.co/KceT8z1RDb

Yang @Yangg40

7 days ago

1/ ~23,000 new AI papers hit arXiv in the last 2 months. I had AI rank all of them, clustered the top 250, and 3 themes jumped out that will make up the hot startups of 2027.

Yang @Yangg40

7 days ago

2) The Defense Trilemma https://t.co/L73TwGIuyB The Two Boundaries https://t.co/WOwp9QwlZT The Granularity Mismatch https://t.co/hYva2BRNqS Intent-to-Execution Integrity https://t.co/mM2srfK9Uc Aligning Provenance with Authorization https://t.co/Ij7cYtACAY

Yang

@Yangg40

Last Seen Users on Sotwe

Trends for you

Most Popular Users