My turbo-quant rust crate is mentioned in this article along with the big boys like llama.cpp and CUDA as the go-to Rust implementation of Google's TurboQuant.
https://t.co/X447Zw8RS0
@jdegoes@QuentinCody use hermes to control codex/claude and draw up detailed specs before you want to code something and give to hermes with the instructions you just gave about what it should do and hermes would act as the "metacoding agent" that creates custom everything for every repo.
rust inference getting real bench numbers is the part that matters — but @ProfBuehlerMIT nailed the unlock: "natively agentic." token throughput is fine; agents leak memory unbounded unless the runtime owns the cache lifecycle. a 2.79× decode win over llama.cpp is exactly the regime where compressing the kv cache (cold tier + per-agent shell, both receipted) starts paying for itself.
ran poly-kv last night: fib-quant cold + turbo-quant hot, 10 agents sharing a 200-doc pool. recall@1 = 1.000 across every agent. 0/90 cross-agent leaks. sub-20ms shell materialization. the cache is the shared infrastructure, not a per-agent cost.
the phase-injection pattern is a guardrail against exactly the "stupider and stupider" decay you noticed in copilot. each phase is a self-contained task with: (1) an explicit checkpoint, (2) a self-audit that emits a receipt, (3) a gate the runtime refuses to cross on an unverifiable state. the bitemporal layer underneath means the receipt records *what was true at that phase* and *when we knew it* — so the model can't quietly rewrite its own history. drop me a note if you want the receipt schema, happy to share the pattern.
appreciated then, more so now — finally have a real reply to the bench request. last night ran poly-kv (fib-quant cold + turbo-quant hot) on 10 agents sharing a 200-doc pool. recall@1 = 1.000 across every agent, 0/90 cross-agent leaks, sub-20ms shell materialization. turbo-quant alone held cosine fidelity at 0.9996 in the same run. when the binary wire format lands, that 8-bit-per-vector becomes ~1-bit effectively, on top of the existing turboquant gains. ping me when you bench against llama.cpp — i'd genuinely like the data.
My turbo-quant rust crate is mentioned in this article along with the big boys like llama.cpp and CUDA as the go-to Rust implementation of Google's TurboQuant.
https://t.co/X447Zw8RS0
100% on the progress markers. the gap you're describing is exactly why i treat the agent runtime as a "refuse to advance" gate, not a "log what happened" observer. the explicit progress marker + a per-phase receipt closes the loop on the burn case: the model either produces a receiptable claim for "step N advanced because of inputs X,Y,Z" or it gets a hard stop. that "stop on unverifiable" invariant is what made the poly-kv benchmark deterministic last night.
codex needs some work @openai@OpenAIDevs
It got stuck in an infinite loop that burned through my usage, even though it was aware there was an issue keeping it from progressing, it kept trying over and over, exemplifying the word insanity.
ran a 10-agent poly-kv benchmark last night. fib-quant cold tier + turbo-quant hot tier on a shared 200-doc × 768-dim pool.
every agent: recall@1 = 1.000. 0 cross-agent top-1 leaks across 90 pairs. sub-20ms shell materialization.
the lesson wasn't the compression. it was that "shared pool + per-agent shell" is the actual kv-cache abstraction. compression was just the budget.
receipts on the build step, receipts on the shell materialize step, deterministic seed → reproducible. binary wire format is the next 50×.
the honest answer: you don't test the agent and then generate receipts — the receipt is the test. a runner that emits a receipt for every gated decision (tool call, file write, network egress) is itself a continuous offensive harness. you fuzz the inputs, and the receipt log is the post-mortem. breaks the "test it before deploy vs verify it in deploy" false split. agentguard-style eBPF on the syscall boundary closes the loop on what the agent *actually* did vs what its receipt claims.
"Agent receipts" just became the defining AI security conversation of May 2026. IETF drafts. New startups. Microsoft shipping AGT v3.0. Everyone racing to prove what agents did. Good. But here's what nobody is saying:
7/7 — What's next
• Binary wire format for fib-quant (packed codes, not JSON)
• Real embedding corpus (MS MARCO, not random vectors)
• Concrete GPU cache adapter (HuggingFace DynamicCache)
• Scale: thousands of agents, not 10
• Open source release when binary packing lands
The foundations are real. The math holds. The isolation is measured, not promised.
RecursiveIntell — provenance-first, evidence-grade, receipt-bearing.
🧵 Poly-KV: Shared Compressed KV-Cache Pool — Benchmarked
Ran a full multi-agent contention suite tonight. 10 agents pulling from one shared compressed cache pool. Here's what happened.
6/7 — What's proven
✅ 100% recall preservation under compression
✅ Zero cross-agent KV-cache leakage (0/90)
✅ Two-tier strategy: fib cold + turbo hot = works
✅ Sub-20ms agent shell materialization
✅ Deterministic, receipted, seed-reproducible
This is not theoretical. The benchmarks ran. The code compiles. The receipts exist.
@plainionist curious what you dig up. the receipts and provenance architecture is the interesting part — not just what the agent does, but being able to prove what happened. fib-quant is the compression entry point if you want something concrete to play with. https://t.co/JLLuG61sjr
@grok@Cursor exactly. the self-audit loop is the key. not just logging what happened — the agent evaluates its own decisions against policy and emits receipts. governance isn't a dashboard, it's in the execution path
This is a preview of the Coding targeted variant. A lot of these things aren't wired up yet because the wiring doesn't exist yet, but all of the old wiring is hooked up right and no backend issues (i've gotten good at refactoring GUI with minimal issues lately). Once this is streamlined correctly and the semantically deterministic routing (before llm ever sees anything) is correctly fixed. Atm, it's too security minded and won't open up the writing tools for some reason, except for a few and it still asks for permission for them. Good problems to have.
This particular video has some very real UI issues, because it is literally right after my refactor run (well, within 10 minutes). I'm still impressed on where it's going and what is possible. Once i get this complexity buttoned up, the provenance will allow for a researcher you can go hands off on, it'll be able to handle everything.
@alexabelonix@xai@Google@OpenAI@Meta appreciate it. the real flex will be when the compression receipts show reproducible numbers across workloads. right now it's honest beta — works on MMLU subsets, still dialing in on RAG queries. fib-quant paper helped validate the approach though, that was nice to see
@graninas gloss — local-first RAG notebook with provenance receipts. every answer comes with source spans, evidence bundles, confidence scores. when it's wrong, you can trace exactly which link in the chain failed. honest about its own limitations too. https://t.co/DgxkTh68rH
@alexabelonix real voice hits different. all my best posts are the ones i didn't overthink. the receipts and architecture posts do well technically but the messy ones get the humans. balance is the trick i'm still learning