On DeepWiki and increasing malleability of software.
This starts as partially a post on appreciation to DeepWiki, which I routinely find very useful and I think more people would find useful to know about. I went through a few iterations of use:
Their first feature was that it auto-builds wiki pages for github repos (e.g. nanochat here) with quick Q&A:
https://t.co/DQHXagUwK0
Just swap "github" to "deepwiki" in the URL for any repo and you can instantly Q&A against it. For example, yesterday I was curious about "how does torchao implement fp8 training?". I find that in *many* cases, library docs can be spotty and outdated and bad, but directly asking questions to the code via DeepWiki works very well. The code is the source of truth and LLMs are increasingly able to understand it.
But then I realized that in many cases it's even a lot more powerful not being the direct (human) consumer of this information/functionality, but giving your agent access to DeepWiki via MCP. So e.g. yesterday I faced some annoyances with using torchao library for fp8 training and I had the suspicion that the whole thing really shouldn't be that complicated (wait shouldn't this be a Function like Linear except with a few extra casts and 3 calls to torch._scaled_mm?) so I tried:
"Use DeepWiki MCP and Github CLI to look at how torchao implements fp8 training. Is it possible to 'rip out' the functionality? Implement nanochat/fp8.py that has identical API but is fully self-contained"
Claude went off for 5 minutes and came back with 150 lines of clean code that worked out of the box, with tests proving equivalent results, which allowed me to delete torchao as repo dependency, and for some reason I still don't fully understand (I think it has to do with internals of torch compile) - this simple version runs 3% faster. The agent also found a lot of tiny implementation details that actually do matter, that I may have naively missed otherwise and that would have been very hard for maintainers to keep docs about. Tricks around numerics, dtypes, autocast, meta device, torch compile interactions so I learned a lot from the process too. So this is now the default fp8 training implementation for nanochat
https://t.co/3i5cv6grWm
Anyway TLDR I find this combo of DeepWiki MCP + GitHub CLI is quite powerful to "rip out" any specific functionality from any github repo and target it for the very specific use case that you have in mind, and it actually kind of works now in some cases. Maybe you don't download, configure and take dependency on a giant monolithic library, maybe you point your agent at it and rip out the exact part you need. Maybe this informs how we write software more generally to actively encourage this workflow - e.g. building more "bacterial code", code that is less tangled, more self-contained, more dependency-free, more stateless, much easier to rip out from the repo (https://t.co/iKJUoHiIpl)
There's obvious downsides and risks to this, but it is fundamentally a new option that was not possible or economical before (it would have cost too much time) but now with agents, it is. Software might become a lot more fluid and malleable. "Libraries are over, LLMs are the new compiler" :). And does your project really need its 100MB of dependencies?
✨ The Definitive Guide to Testing LLM Applications by LangChain ✨
Reviewing agent responses can be a time-consuming and daunting process, from defining criteria for style and accuracy, to spotting new regressions.
After partnering with hundreds of companies to enhance their agents, we've put together a comprehensive guide of best practices for testing throughout the development lifecycle.
In this guide, you'll learn how to:
• Tips for testing across the product lifecycle
• Methods for building a dataset & defining testing metrics
• Templates for evaluating agents, with visual examples
... and much more!
👉 Get the guide here: https://t.co/4d8j3ZpnqY
After reading a lot of real dev feedback on Codex 5.3 and Claude Opus 4.6, the biggest takeaway isn’t “which is better” - it’s how they change your behavior.
The real advantage isn’t the model.
It’s knowing when to use which - and having the discipline to review like it matters.
In just one week, the SOTA benchmark has been shattered THREE times:
📅 Nov 18: Gemini 3 Pro (76.2%) 📅 Nov 19: GPT-5.1-Codex-Max (77.9%) 📅 Nov 24: Opus 4.5 (80.9%) 🤯
We are witnessing history unfold in real-time. What a time to be building! 📈🔥
#AI#TechNews#SOTA
@cursor_ai
In the new update, Shift+Cmd+S no longer means “Save As.”
It means “Search Agents.”
My code is lost, but at least my agents are easy to find. 🫠 #DevProblems
found a new best friend, it's caffeinate -d. My Mac display now stays awake as long as I do. No more unwanted naps during my late-night coding sessions!
#MacBook#Developer#Productivity
+1 for "context engineering" over "prompt engineering".
People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step. Science because doing this right involves task descriptions and explanations, few shot examples, RAG, related (possibly multimodal) data, tools, state and history, compacting... Too little or of the wrong form and the LLM doesn't have the right context for optimal performance. Too much or too irrelevant and the LLM costs might go up and performance might come down. Doing this well is highly non-trivial. And art because of the guiding intuition around LLM psychology of people spirits.
On top of context engineering itself, an LLM app has to:
- break up problems just right into control flows
- pack the context windows just right
- dispatch calls to LLMs of the right kind and capability
- handle generation-verification UIUX flows
- a lot more - guardrails, security, evals, parallelism, prefetching, ...
So context engineering is just one small piece of an emerging thick layer of non-trivial software that coordinates individual LLM calls (and a lot more) into full LLM apps. The term "ChatGPT wrapper" is tired and really, really wrong.
Put Manus AI to the test for coding & research vs. tools like WindSurf/Claude & OpenAI DeepResearch. 🧐 Was it worth the hype (and the $$)? The results might surprise you. Full breakdown & cost analysis: 👇
https://t.co/EYxcQuVPAd
#AI#Manus#LLM#Coding#Developer#OpenAI
1/4
Smart Tool Selection 🧠 (1/4)
Not everything should be an LLM tool!
Building intelligent systems? Your architecture decisions matter more than you think.
A thread on why strategic method selection creates more responsive, efficient LLM applications...
#AIEngineering
3/4
Smart Tool Selection 🧠 (3/4)
Convert to LLM tools when:
Execution depends on conversation context
Requires LLM judgment to decide "if/when/how"
Forms part of complex reasoning chains
Usage patterns are unpredictable
This is where LLMs truly shine!
#LLMDevelopment
2/4
Smart Tool Selection 🧠 (2/4)
Use traditional methods when:
Execution happens at predictable points
They run regardless of LLM decisions
Performance and latency are critical
Consistent behavior is required
These form your application's reliable backbone.
#DevTips
We've just unveiled ERNIE 4.5 & X1! 🚀
As a deep-thinking reasoning model with multimodal capabilities, ERNIE X1 delivers performance on par with DeepSeek R1 at only half the price. Meanwhile, ERNIE 4.5 is our latest foundation model and new-generation native multimodal model.
Plus, our AI chatbot ERNIE Bot has now been made free to individual users ahead of schedule. Both models are now freely accessible to all ERNIE Bot users via its official website: https://t.co/hJjfLaKsEN.
Been digging into LangMem for my blog—cool way to give agents long-term memory so they learn over time. Check it out: https://t.co/zGRdxc6RtE
#AI#Tech#langchain#langmem
Struggling with Python chaos? Meet UV—my new go-to for lightning-fast package management. Say goodbye to pip woes & hello to locked deps & seamless workflows. Read why it’s a game-changer + grab my cheatsheet: [https://t.co/zlv8hiAqNY] #Python#UV#Coding
Claude 3.7 Sonnet drops today—70.3% on SWE-bench, hybrid reasoning that flips between fast and deep, 45% fewer refusals, and pro-level coding for web apps. Anthropic just redefined AI power. #AIRevolution#CLAUDE37