Dhruv Singh

@ds3638

cto at @honeyhiveai

Joined September 2021

526 Following

164 Followers

713 Posts

ds3638 retweeted

Akshat Bubna

@akshat_b

9 days ago

Pretty sure 50% of internal token spend is completely useless, but right now it's hard to know which 50%. As an admin I'd love a dashboard that breaks down each person's spend into summarized clusters. Much easier to spend more when you can draw a clear line to value.

195

53K

ds3638 retweeted

Kyle Mathews @kylemathews

11 months ago

Excited to release a new starter! Sync-first, production-ready, and blazing fast - @tan_stack Start - @tan_stack DB for blazing fast client-side queries & optimistic mutations - @ElectricSQL for real-time sync from PG - @better_auth - @DrizzleORM - @tailwindcss - Hono OpenAPI

453

445

65K

ds3638 retweeted

TBPN

@tbpn

15 days ago

METR's @ajeya_cotra says the best way to manage catastrophic AI risk is to set up a "sensible auditing regime that's technically literate," which involves auditors embedded in the frontier model providers. "You don't want a box-checking auditor that has like 17 arbitrary things you're supposed to do." "My best guess is that it's going to look like something that happens in the financial sector, where you have embedded auditors. Folks who are experts in finance, who sit and eat lunch with the employees, see all the books, know everything, and have a lot of flexibility to investigate what they need to." "We're really hoping to move more and more in the embedded direction. Embedded auditing of the monitoring system, potentially even embedded auditing of training."

10K

ds3638 retweeted

pedram.md

@pdrmnvd

16 days ago

the year was 2024. you wanted to build an ai chatbot. you installed chroma db locally. you couldn’t figure out how to deploy it so you switched to pgvector. you read a paper on RAG. you spent $4.82 by calling an embedding api after realizing you couldn’t figure out how to get BAAI/bge-large-en-v1.5 working with your broken cuda packages. nvidia stock was overpriced at $90 you’re sure of it. you converted all your documents to embedding. you googled cosine similarity. you called the claude 3 sonnet model api and ran out of context after 8k tokens. you’re deep into reading langchain docs and confused. maybe something called llama index might work. it took four days to prototype but at least github copilot has killer autocomplete. your responses are shit but fortunately openai has a fine tuning api that will help. surely in a few weeks you’ll have something to show your boss, and the answers will be hallucination free. life is good.

174

68K

Who to follow

HoneyHive

@honeyhiveai

The observability layer for agents in production.

Pranjal Aggarwal ✈️ ICLR'26

@PranjalAggarw16

PhD Student @LTIatCMU. Working on computer-use agents, code generation and reasoning. Prev: research scientist intern @AIatMeta FAIR, undergrad @IITD

ds3638 retweeted

3 months ago

https://t.co/3xw3NjttVd

582

112K

ds3638 retweeted

Matt Pocock

@mattpocockuk

25 days ago

A flow I just tried and LOVED: 1. /grill-with-docs, talking about a new bit of UI 2. Asks me a question I can't answer unless I prototype 3. /prototype 4. Iterate on the prototype, burning tokens freely until we get a good spot 5. /rewind to the question, and select 'summarize' (Claude Code feature), saying 'summarize what we learned from prototyping' 6. Continue the grilling session, retaining the prototype Smoooooooth

106K

Dhruv Singh

@ds3638

17 days ago

@yoheinakajima Still going to try, but it does feel like something that’ll get eaten up by the attention matrix voodoo magic

Dhruv Singh

@ds3638

17 days ago

I think ontology on write has never been solved after years of research. Best case scenario is an event log that can be reasoned through / linked easily. In practice, setting a strict hash to get proper lineage on incomplete/changing data is a strong bottleneck. I’ve tried building logical consistency checks to “simulate the ontology”, but I do think it’s ultimately a tagging / versioning problem at scale. You can’t even reuse successful agent trajectories because source data and tool schemas change so soon.

553

ds3638 retweeted

Yohei

@yoheinakajima

17 days ago

stateful agents, decision traces, context graphs… talked about a lot, but has anyone seen an elegant primitive around how to actually implement?

149

123

19K

ds3638 retweeted

Matthew Berman

@MatthewBerman

20 days ago

i'd watch a whole season of this

362

Dhruv Singh

@ds3638

20 days ago

Evals are dead. Or more precisely: traditional eval-driven development doesn’t scale. Static evals were useful when agents were short-lived and bounded, but once agents are running for hours and taking thousands of actions + operating autonomously, evals alone stop being enough. At that point pass/fail is too coarse. Simulation misses too much of what happens in prod and model capabilities are moving faster than eval infra can keep up. What we run instead: observability-driven development. - deploy with tight guardrails - collect prod trajectories - cluster behavior to discover patterns + failure modes - specialize workers for narrower tasks - tune thresholds until behavior is reliably within bounds Can you see what your agents are doing? Can you detect drifts before they cause damage? This is an important shift in how we build AI systems. Evals still matter but observability is becoming the foundation for prod-ready agents. Thanks Sunny Bakhda (@honeyhiveai founding engineer) for a great talk at @aicouncilconf

ds3638's tweet photo. Evals are dead. Or more precisely:

traditional eval-driven development doesn’t scale.

Static evals were useful when agents were short-lived and bounded, but once agents are running for hours and taking thousands of actions + operating autonomously, evals alone stop being enough.

At that point pass/fail is too coarse. Simulation misses too much of what happens in prod and model capabilities are moving faster than eval infra can keep up.

What we run instead: observability-driven development.

- deploy with tight guardrails
- collect prod trajectories
- cluster behavior to discover patterns + failure modes
- specialize workers for narrower tasks
- tune thresholds until behavior is reliably within bounds

Can you see what your agents are doing? Can you detect drifts before they cause damage?

This is an important shift in how we build AI systems. Evals still matter but observability is becoming the foundation for prod-ready agents.

Thanks Sunny Bakhda (@honeyhiveai founding engineer) for a great talk at @aicouncilconf

104

ds3638 retweeted

AI Council

@AICouncilConf

22 days ago

Q: What does it take to build and deploy LLM-powered applications? A: The AI Engineering Track. Today! Running Enterprise Agents in Production with @vasundra_s Evals That Actually Work with @the_bunny_chen Context Engineering with @simba_khadder Managing Really Large Multimodel Datasets for AI with @changhiskhan Pricing AI Agents with @kshithappens Observability for Long-Running Agents with Sunny Bakhda. Thanks for curating a great track, @ds3638!

AICouncilConf's tweet photo. Q: What does it take to build and deploy LLM-powered applications?
A: The AI Engineering Track. Today!

Running Enterprise Agents in Production with @vasundra_s
Evals That Actually Work with @the_bunny_chen
Context Engineering with @simba_khadder
Managing Really Large Multimodel Datasets for AI with @changhiskhan
Pricing AI Agents with @kshithappens
Observability for Long-Running Agents with Sunny Bakhda.

Thanks for curating a great track, @ds3638!

198

ds3638 retweeted

ChengSong Huang

@ChengsongH31219

22 days ago

"How do you self-improve a model on open-ended tasks where you can't take a majority vote?" I got asked this in nearly every research interview I did last year. None of my answers felt clean. So we built something that doesn't need a vote, a verifier, or a judge. Meet G-Zero. 👇 paper: https://t.co/TrvGb48W4d huggingface: https://t.co/8guc5xSh3i code: https://t.co/G8mMm2I9h1 All experiments are done via api by @thinkymachines (1/n)

ChengsongH31219's tweet photo. "How do you self-improve a model on open-ended tasks where you can't take a majority vote?"

I got asked this in nearly every research interview I did last year. None of my answers felt clean.

So we built something that doesn't need a vote, a verifier, or a judge.

Meet G-Zero. 👇

paper: https://t.co/TrvGb48W4d
huggingface: https://t.co/8guc5xSh3i
code: https://t.co/G8mMm2I9h1

All experiments are done via api by @thinkymachines (1/n)

238

257

15K

ds3638 retweeted

Sayash Kapoor @sayashk

22 days ago

I appreciate the work by @EpochAIResearch @GregHBurnham in flagging and fixing these issues. Finding bugs in evaluations is always disappointing, but in the long run, is necessary (and extremely helpful) for improving evaluations. It also reminds me of the issues we uncovered in CORE-Bench: https://t.co/jj9F3wWMo5 As benchmarks become more complex, analyzing benchmark tasks and agent logs will become more important to ensure the validity of evaluation results. Coincidentally, today we released a paper (led by @PKirgis) on how to do log analysis well. https://t.co/rTcirSHuRO This builds on all our lessons from the trenches in conducting such evaluations and fixing the issues we found in our own work. I’m sure we’ll find many other issues in our evals, but genuinely think the evals community will be better off for having developed tools and methods to improve eval rigor.

sayashk's tweet photo. I appreciate the work by @EpochAIResearch @GregHBurnham in flagging and fixing these issues. Finding bugs in evaluations is always disappointing, but in the long run, is necessary (and extremely helpful) for improving evaluations. It also reminds me of the issues we uncovered in CORE-Bench: https://t.co/jj9F3wWMo5

As benchmarks become more complex, analyzing benchmark tasks and agent logs will become more important to ensure the validity of evaluation results. Coincidentally, today we released a paper (led by @PKirgis) on how to do log analysis well. https://t.co/rTcirSHuRO

This builds on all our lessons from the trenches in conducting such evaluations and fixing the issues we found in our own work.

I’m sure we’ll find many other issues in our evals, but genuinely think the evals community will be better off for having developed tools and methods to improve eval rigor.

13K

ds3638 retweeted

Claude

@claudeai

23 days ago

New in Claude Code: agent view. One list of all your sessions, available today as a research preview.

29K

11K

ds3638 retweeted

Sarah Fong

@MilksandMatcha

25 days ago

I didn't study a lick of AI in college. I had studied computer science at MIT, but because I had been interning and working in quant trading, it was much more statistically focused than anything related to machine learning. Joining Exa AI early was my first exposure, and I still appreciate Jeff Wang for his patience, especially in the early days when I had never even heard of RAG. For anyone else looking to get started, I would really recommend watching Andrei Karpathy's 'intro to LLMs' 1hr comprehensive deep dive. Even though it is now two years old, it remains highly relevant. And while really powerful agents exist, Cursor/Windsurf are still a really good first step for going from something you are more familiar with, like the traditional IDE, to understanding how AI can accelerate your current work. And then, in terms of getting more hands-on, I absolutely love claude cowork, especially if you are non-technical, just to get a glimpse at the power of AI in your day-to-day life.

593

883

99K

ds3638 retweeted

hardmaru

@hardmaru

26 days ago

The human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it. One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math. We teamed up with @NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a "Hybrid" format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens. Through TwELL and a new set of custom CUDA kernels for both LLM inference and training, we translated theoretical sparsity into actual wall-clock speedups: >20% faster training and inference on H100 GPUs, while also cutting energy consumption and memory requirements. Paper: https://t.co/rqIY9SYBDe Blog: https://t.co/oRjNbpJKha Code: https://t.co/FAFaJwpxAJ ⚡️

hardmaru's tweet photo. The human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it.

One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math.

We teamed up with @NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a "Hybrid" format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens.

Through TwELL and a new set of custom CUDA kernels for both LLM inference and training, we translated theoretical sparsity into actual wall-clock speedups: >20% faster training and inference on H100 GPUs, while also cutting energy consumption and memory requirements.

Paper: https://t.co/rqIY9SYBDe
Blog: https://t.co/oRjNbpJKha
Code: https://t.co/FAFaJwpxAJ
⚡️

503

430K

ds3638 retweeted

Amanda Askell

@AmandaAskell

26 days ago

Alignment research often has to focus on averting concerning behaviors, but I think the positive vision for this kind of training is one where we can give models and honest and positive vision for what AI models can be and why. I'm excited about the future of this work.

AmandaAskell's tweet photo. Alignment research often has to focus on averting concerning behaviors, but I think the positive vision for this kind of training is one where we can give models and honest and positive vision for what AI models can be and why. I'm excited about the future of this work. https://t.co/4BnLVNjnEY

116

782

201

73K

ds3638 retweeted

Mohak Sharma

@mohak__sharma

28 days ago

If you’re shipping agents, your observability surface should live where you code. One of the most visible changes in @HoneyHiveAI v2 is that traces and evals now show up where engineers and coding agents already work: the terminal, IDE, and your coding agents. We're shipping a HoneyHive CLI so teams and agents can talk to the same API the UI uses. You can create or update projects from the command line and pull logs into whatever scripts or CI paths you already use to ship changes. We're also adding agent skills for the entire agent development lifecycle — from development & evals to root-causing alerts & self-healing your agents. Now Claude Code, Cursor, or any coding agent can: - Triage failing traces when an alert fires and cluster them by error mode to isolate the regression - Trace the regression back to the offending change — a prompt edit, a tool schema update, a model swap - Draft a fix and run it against the affected slice plus your regression set before anything gets proposed - Open a PR with the RCA, eval diff, and trace links already attached — so you review a fix, not a hypothesis All of this runs on the same v2 foundation we designed for regulated environments. The control plane stays separate from the data planes, eval compute runs where the data lives, and access is gated with RBAC. If your agents are touching customers, they should inherit the same telemetry and governance guarantees as the rest of your stack. Wrote more on this here: https://t.co/GaVxA0btUw

mohak__sharma's tweet photo. If you’re shipping agents, your observability surface should live where you code. One of the most visible changes in @HoneyHiveAI v2 is that traces and evals now show up where engineers and coding agents already work: the terminal, IDE, and your coding agents.

We're shipping a HoneyHive CLI so teams and agents can talk to the same API the UI uses. You can create or update projects from the command line and pull logs into whatever scripts or CI paths you already use to ship changes. We're also adding agent skills for the entire agent development lifecycle — from development & evals to root-causing alerts & self-healing your agents.

Now Claude Code, Cursor, or any coding agent can:

- Triage failing traces when an alert fires and cluster them by error mode to isolate the regression
- Trace the regression back to the offending change — a prompt edit, a tool schema update, a model swap
- Draft a fix and run it against the affected slice plus your regression set before anything gets proposed
- Open a PR with the RCA, eval diff, and trace links already attached — so you review a fix, not a hypothesis

All of this runs on the same v2 foundation we designed for regulated environments. The control plane stays separate from the data planes, eval compute runs where the data lives, and access is gated with RBAC. If your agents are touching customers, they should inherit the same telemetry and governance guarantees as the rest of your stack.

Wrote more on this here: https://t.co/GaVxA0btUw

297

ds3638 retweeted

Anish Acharya

@illscience

29 days ago

Skills and plugins are closer to product marketing than product. You often still need a dedicated interface to explore tradeoffs, make decisions, handle exceptions.. As with every other product cycle, apps will be radically different than the past, yet as important as ever.

Dhruv Singh

@ds3638

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users