One year ago I accepted a $1M / year offer from @alexandr_wang to leave Uber to lead Scale 's ML data engine optimization team.
The next week I reneged it after @sdianahu gave @erikqu_ and I an offer to join the first ever YC spring batch.
Now we service the same customers that Scale does, and we do it better. Starting a company through YC has been the best thing to ever happen to us.
Apply to YC, even if you apply late. Happy to review your application now or in the future, DMs open!
Computer-use evals like OSWorld still don’t really test personal assistant use cases: logged-in accounts, user data, personalized workflows, or realistic desktop/web environments.
so we made MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents, with 184 tasks across 17 popular website clones, seeded with realistic user data and centered on Michael Scott’s hypothetical desktop.
It’s easy to adopt if you already use OSWorld-style runners - I view it as an personalization-focused, more realistic upgrade for CUA evals.
Website: https://t.co/sYplwXU40w
Paper: https://t.co/BmshW8P2TK
Code: https://t.co/xNpv6cspdl
Excited to see labs running SWE-Marathon for long horizon task evaluation!
As agents continue to solve more and more ambiguous tasks, CUA-as-a-judge may end up being the only way to validate whether development of user interfaces actually take effect!
🚀🚀🚀
Amy’s meticulous with most aspects of her life, but especially lately with her health.
A lot of the gap is not access to AI, but the literacy to use it well: bringing context, tracking signals, and becoming a better observer and collaborator in your own care.
I’m an AI researcher turned brain tumor patient, and recently I used the models to crack my mystery fatigue faster than my PCP could.
I believe everyone can do the same with their own symptoms. Here’s how:
Can coding agents stay coherent over a 1 billion token budget?
Can they build Slack from scratch?
Rewrite a JAX codebase in PyTorch?
Build a C compiler in Rust?
Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.
Inference Chips for Agent Workflows
@sdianahu
Most AI chips are designed for "prompt in, response out." Agents don't work that way. They loop, branch, and hold context across dozens of steps, and current GPUs hit 30–40% utilization as a result.
That gap is where purpose-built silicon wins.
Supply Chain 2.0 for Semiconductors
@sdianahu
A single advanced AI chip crosses a dozen countries and takes five months to build, managed mostly with spreadsheets and phone calls.
Real-time allocation tracking, multi-tier risk monitoring, and export compliance tooling barely exist, which is exactly why this is a startup opportunity and not a feature inside SAP.
Background Computer Use
Computer Use in Codex has some deep OS-level wizardry. Codex can see/click/type in apps in the background, without taking over your computer, and you can work in parallel.
@AriX and team absolutely crushed here. Windows soon.
interestingly this led to multiple DMs and intros.
the kind of talent required is basically “new grads with high agency”. this is significant bcz its not a traditional role. you need to be:
> quantitative enough to understand what makes a good training data / reward design
> operationally obsessive enough to manage a cluster of contractors
> potentially brand new out of college
and it makes sense as the domain is this new and moving this fast, adaption is the key.
If you want to learn the skills you’ll need to become a founder, our open roles are below!
We’re always looking out for strong engineers who have a knack for staying up to date with llm training research.
👇
I grew up at my dad's medical practice. Quickly, I realized that clinicians just want to treat patients, not deal with administrative tasks.
So, my cofounder @nandaguntupalli and I are now building Taiga, a full stack medical billing service for independent practices. We handle coding, claim submission, and denials so providers can focus on patients. We’re already working with practices and helping them resolve their trickiest claims.
I’ll be at Pri-Med West in Anaheim later this week. If you run or work at a small practice, I’d love to buy you coffee and learn about your billing workflow :)
https://t.co/7Dn7VKJLKP