How do we setup end-to-end testing and QA with this? Any advice? I mean already using gstack but is there any better way? how to write best PRDs with testing and QA prompts with MCPs/agents/Skills setup? @garrytan@Saboo_Shubham_
I don't think long running tasks are a problem anymore for agents, this is Codex running continuously for the last 53 hours, the problem now is how to define your problem in such a way that these hours are never wasted...@OpenAIDevs@ClaudeDevs@claudeai@OpenAI
I don't think long running tasks are a problem anymore for agents, this is Codex running continuously for the last 53 hours, the problem now is how to define your problem in such a way that these hours are never wasted...@OpenAIDevs@ClaudeDevs@claudeai@OpenAI
Qwen 3.5 has the best SLMs to fine-tune!
Its 4B model is really smart if you train it on a well structured dataset.
I fine-tuned the model on a 135M dataset generated by Codex 5.5 + DeepSeek v4 Pro.
I achieved 96%+ accurate results with Qwen 3.5 4B.
And 95% on Qwen 3.5 2B (that only requires 3.5GB RAM).
For context, on the same pipeline:
> Sonnet 4.6 achieved 89%
> GPT 5.4 Mini achieved 85%
> Haiku 4.5 achieved 72%
I don't trust evals, so I ran a 7000+ row hard-boundary test, and the results of Qwen 3.5 were consistent.
A 4B fine-tuned model beating a 20x bigger model in accuracy and latency is no joke.
It cost me $173 in total to generate the dataset and cover the cloud GPU cost to fine-tune both models.
I said this before, and I'll say it again: not everything requires a 1T-parameter LLM. We need ELMs (Expert Language Models) that are specialized for one domain only.
ELMs > LLMs.
I'll be writing more about how SLM fine-tuning works. So stay tuned.
Yes! my solo-authored paper Reward Hacking Benchmark was accepted to ICML :)))
We put LLM agents in a tool-rich sandbox, give them multi-step workflows, and measure when they solve the intended task vs take unexpected shortcuts (like monkeypatching files at runtime!)
1/3
India has never been short on talent.
We've been short on top-down focus in deep technology sectors.
@narendramodi ji picked Space Tech in 2020. Look what 5 years did: 300+ companies. $700M+ raised.
Skyroot — $100M, India's first private rocket
Pixxel — $95M, hyperspectral constellation live
Agnikul — $86M, world's first 3D-printed engine rocket Digantara — $50M, full-stack space surveillance
Dhruva, Bellatrix, and Galaxeye are building the rest.
And this week, GalaxEye put up the world's first OptoSAR satellite — India's largest privately built bird at 190 kg — on a Falcon 9.
Talent was never the bottleneck. Focus from the top was.
Me: are you stupid why cant you just choose....
ChatGPT: Relax, you’re right on the core point:....
Is it only me or is this low-key offensive at this point lol
I am proud to publish my personal stack but the coolest thing I have enjoyed so far is getting direct feedback from thousands of others who tell me what they want
And I can launch a fix that same day. Or even build the feature with them in mind.
https://t.co/xPjlf0WgWY
Dang! I have my first AI agent running. I built a small workflow for myself to identify spam emails using Google Studio. The best part about the tool is that you can define the rules. For example, this is one of the rules (image):
Spam emails are my biggest personal problem. I was wasting at least 30 minutes a day marking emails as spam, even with different filters. And I’m addicted to seeing my inbox empty.
So, to everyone who has been sending me unwanted emails: please spam my inbox now. 🙂
I use a very specific prompt to push Claude to check its work and do a lot of testing and thinking about perf and refactoring. I find I can do big features (4K LOC+ with full testing) in about an hour.