• research scientist at @meta • virtualization is the name of the game rn • work on Linux kernel • dm me if you are building on ai agents or infra!
I don't think a lot of input tokens go into producing code. Almost all of the tokens spent in a mature product /codebase is from investigations.
For sending this patch: https://t.co/k7ludOHRr5
I roughly ended up using 1.2 million tokens across debugging kernel debug logs which are verbose, writing reproducers which never got checked in and understanding the codebase.
Everyone wants multi-agent workflows, but the mental model required to build them is not straightforward. It's the same roadblock as multithreaded programming where you're trying to manage a bunch of distinct contexts at once. Until you nail the communication layer, the shared goals just get lost in translation.
Is anyone in my network currently working on Optimal Transport for generative modeling of multimodal data? I'm expanding my scope into this space and would love some recommendations for introductory papers, resources. What should I be reading?
There’s a massive blind spot in the benchmarks. By the time an issue makes it to GitHub with a reproducible state, 80% of the hardest engineering work is already done. Current benchmarks hand models extremely precise problem statements. But in the real world, like when debugging the Linux kernel, you rarely start knowing what the problem actually is. All a user will report is “the app is OOMing, and increasing memory doesn’t help.” Digging into that requires intuition built from past issues. The root cause could be memory leaks, memory fragmentation, or a race condition where threads acquire memory and never release it leading to starvation. We desperately need benchmarks with highly ambiguous starting conditions to test if a model can navigate a state with multiple distinct root-cause scenarios. Right now, models like Opus easily get stuck in loops during open ended investigations. They rarely move forward unless I ask it to check for hypotheses A, B, or C. The next frontier for SWE evals should also include cases where the model is trying to figure out what's actually broken in the first place.
C is the most secure language right now because it has no package manager. Zero supply chain attacks. You want to know if a number is even? You don't npm install. You write x & 1 and manipulate those bits yourself.
@LeylaKuni@upInYerCommentz Assuming they could figure out a way to do cabling, a lot of old office buildings are in downtown area. There’s only so much power that can be drawn from the power lines today, and getting more power into DCs means all of the other building around lose capacity.
LLM just like me during my entrance exam. Look at options and think a human definitely can’t walk more than 10kms in 20 minutes and then derive an answer!
🚨 New Paper! 🚨
One of my first Ph.D. papers found that LLMs can answer multiple-choice questions without seeing the question 🤔
At #ACL2026, I'm presenting a follow-up showing that current reasoning LLMs can still do this! And quite similarly to a clever test-taker 🧑🎓🧵