Coding benchmarks like SWE-Bench and the latest one ProgramBench are useful, but can AI coding platforms like @qwikbuild, @replit and @lovable actually build and maintain real-world web applications?
Introducing SWE-WebDevBench: a comprehensive eval framework to assess AI coding platforms as virtual software development agencies, covering not just the middle step of coding, but the entire software lifecycle: Requirements gathering, planning, deployment and change management.
@basethesislabs is hiring researchers & engineers in blr.
we're a frontier lab focused on democratising the way humans interact with technology to maximise their potential.
if you're interested in building world models, continual learning and RL environments; apply for our research roles.
if you're a serious builder with proven impact in deploying production grade agentic-systems; apply for our engineering roles.
https://t.co/XZ71PKs8gO
P.S. we provide dedicated claude code accounts, unlimited console play hours, autonomy + everything else you get in your regular jobs!
🚨Hiring Member of Technical Staff at @Basethesislabs
We are a frontier lab focused on democratising the way humans interact with technology to maximise their potential.
If you are interested in building world models, continual learning, and RL environments, apply for our research roles.
If you are a serious builder with proven impact in deploying production-grade agentic systems, apply for our engineering roles.
PS: We provide dedicated Claude code accounts, unlimited console play hours + everything else you get in your regular jobs!
https://t.co/NXgzSmbULk
silicon valley died a bit when they massacred travis kalanick.
when we let the journos and some men in suits decide what was the right way to do things.
good that the nature is finally healing.
"Agents monitor the market, handle customers, execute decisions. You check in every few days."
Most founders we know are nowhere near this. Not because they don't want it - because nobody's actually helped them set it up and get it running tailored to their business.
We're organizing a hands-on AI workshop at our lab on 15th April 2026 specifically for this. @dhimant from the @thebetterindia will also show you how their team uses ambient AI agents to automate aspects of operations, marketing, and invoicing.
Check out the workshop details here and register: https://t.co/ZRKdjWkR86
This will definitely be beneficial for startup/D2C founders, team leads and business owners.
things keeping me up at night about where AI is actually going:
1. "ambient businesses" are coming. basically, agents monitor the market, handle customers, execute decisions. you check in every few days. 7-8 figure businesses with almost no daily human input. we're early but it's happening.
2. you can now build a company in an hour. grab an idea, vibe code it, add stripe, get a customer. the old timeline was 12 months to first revenue. that's just gone.
3. the internet went app store era → API economy → agent economy. we're now in the part where agents hire other agents on the fly. fixed tech stacks are dissolving. nobody's built the glassdoor for AI agents yet.
4. vertical AI is replacing headcount. that's 10x the market that vertical SaaS ever touched. boring industries like insurance, construction, legal, elder care are the goldmine.
5. SaaS pricing is flipping from per seat to per result. someone is going to build a billion dollar business just by converting legacy SaaS companies to outcome based pricing
6. a whole graveyard of generic SaaS is coming. basic CRMs, analytics dashboards, template marketplaces, scheduling tools. agents just do it better. lots of incumbent saas that are generic and not reinventing themselves right now will struggle/reprice.
7. "human made" is becoming the new luxury. porsche already ran a 100% human made ad campaign. no AI is going to be a premium label like organic is for food. there's a real business in that certification.
8. IRL is having a renaissance. when everything is AI generated, being in a room with other humans becomes scarce. karaoke bars, escape rooms, live music, co-working. the experience economy is accelerating.
9. founder market fit is dead. founder agent fit is what matters now. can you direct a fleet of agents like a film director? that's the new unfair advantage.
10. ghost team org charts are coming. two real people, twelve agents with names, faces, personalities. your about page is going to look the same
11. 1000 true fans is now 100. agents cut your costs so much that 100 customers at $500/mo is a real solo business. micro monopolies across multiple niches. this is the playbook.
12. context window poisoning is the new phishing. cybersecurity hasn't caught up. agents have access to your files, email, bank accounts. bad things are going to happen. it's also a massive startup opportunity.
13. the window is open for maybe 12-24 months. then the moats get built like data, brand, trust, network
14. build cost is basically zero. audiences are underpriced. niches are wide open.
idk about you but i'm not sleeping much
so much opportunity
this is the most asymmetric time to be building a startup.
full episode on @startupideaspod to get your creative juices flowing (latest episode get it where you listen/watch pods)
no advertisers, just pure ideas to help you
im rooting for you
don't just bookmark share with a friend
watch
Every industry has a version of this problem. You have massive markets and huge players, but the operating model is essentially stuck in the 90s. Employees still doing repetitive tasks and costs that keep climbing as organizations grow.
Workflow automation genuinely wasn't as good enough as it is now. Everything requires judgment and context that spans many different systems.
Now, multi-agent AI can own entire workflows end to end. Multiple specialized agents working across functions, spotting patterns, understanding context and accelerating work for you, while employees focus on growth.
This is going to be a standard now.
@sidgraph dove deep into Hermes Agent and dropped this article on its architecture.
He reveals the closed learning loop that lets it create skills from experience, intelligently curate persistent memory, and steadily build a deepening model of its user across sessions.. the agent that truly grows with you.
Read the article below.
Hermes Agent is Killing OpenClaw and OSS is winning ♥️
i went deep into how hermes agent from @NousResearch handles memory and persona. ended up writing a full architecture breakdown.
the surprising thing isn't that it remembers; a lot of agents do some version of that. it's how it forgets.
when context gets too long, most agents just truncate. hermes does something different = right before compression kicks in, it gets one last chance to extract and save anything important to disk. a sentinel fires, an auxiliary model scans the conversation, writes to memory. then the middle turns get summarized away.
the agent comes out the other side with fewer tokens but more knowledge. compression as consolidation, not loss.
the memory budget is 3,575 characters. total. that's it. the constraint forces the agent to actually curate what it remembers instead of dumping everything into a vector db and hoping retrieval sorts it out later.
there's a lot more in the writeup, how it teaches itself reusable skills, how the 12-layer identity system works, how honcho models both the user and the agent simultaneously, but the compression trick is what stuck with me most.
Kudos to the team @Teknium, @sudoingX <3
link below 👇
LLM caching is criminally underused. You're sending the same 10k token system prompt on every request and wondering why your bill is insane. Cache it. Your wallet will thank you.
Your eval suite is lying to you. Accuracy went up 2% but users are complaining more. Turns out optimizing for BLEU score doesn't optimize for "actually helpful." Metrics are a map, not the territory.
Voice models are getting really good. But good models on bad infrastructure produce bad experiences.
What's still broken:
1. Full-duplex conversation is functionally unsolved. Humans talk over each other constantly - interruptions, backchannels and overlapping speech.
2. Emotion detection degrades dramatically outside the lab. Speech emotion recognition hits 92%+ accuracy in controlled settings, but drops to 60–75% in real conditions.
3. Hallucinations cascade in ways unique to voice. When a text chatbot hallucinates, the user can see it and correct. When a voice agent hallucinates, the user can't scan back. Correcting mid-conversation is socially awkward.
4. Long-term memory across calls is 56% worse than humans. Remembering what a customer said last week should be table stakes. It isn't.
Read more here on how we can fill this gap as builders: https://t.co/NxKXilysKZ
@RaveenSastry@ashokns@thesisofsarthak@sidgraph
Every AI company we spoke with has been rebuilding the same broken infrastructure, multi-agent coordination that fails in production, memory systems that can't handle real conversations, voice interactions that feel robotic.
The gap between frontier AI research and what companies actually ship is getting wider, not narrower.
We're building the bridge to close that gap.
This is why we exist.
https://t.co/ABSqqC1E1A
@thesisofsarthak@RaveenSastry@ashokns
When you meet someone who remembers your birthday, recalls your dietary restrictions or references that comment you made six months ago about career aspirations, you don't feel like they're querying a database. You feel understood. Right?
Current conversational AI fails precisely here. Memory systems record comprehensively, but retrieve mechanically.
Last month, @Basethesislabs & @smallest_AI gave 19 teams of AI builders the same challenge - build memory that demonstrates understanding, not just recall.
We documented all 19 approaches and quantified the trade offs. Read the entire investigation here: https://t.co/UFhPgbgORN
@thesisofsarthak@RaveenSastry@ashokns@varmashef@picardo_ria