This is a great write-up from @HighStakes_CH. We’re bringing that clearing on-chain so that businesses everywhere can reduce settlement costs and liquidity requirements.
Lots of pain in delivering but it was worth it.
More certain than ever that Local AI is the way to go.
Big shoutout to @davideciffa@pupposandro for the work on @lucebox LFG
Luce PFlash now run @poolsideai’s Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090.
- 111 tok/s decode @ short ctx
- 128K TTFT in 15.91s, 5.4x faster prefill vs llama.cpp
- NIAH passes every (ctx, keep) point up to 131K
- first MoE target supported by PFlash
- hand-rolled CUDA, ggml only, no libllama
Great collab w/ @eisokant, @erc, and the rest of the team. looking forward to working more on their great coding models.🏎️
repo + GGUF in first comment
Update on @luceboxai OOMing with Hermes Agent on RTX 3090:
@davideciffa gave me a great suggestion this morning to try with Lucebox and I am happy to report that it works!
Here are the settings to make it work with Hermes Agent on RTX 3090:
DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=128
python3 scripts/server.py
--tokenizer Qwen/Qwen3.6-27B
--port 8000
--max-ctx 65536
--fa-window 1024
--prefix-cache-slots 1
--budget 8
--daemon
This *also* works with @DJLougen Ornstein model!
Really looking forward to testing this out! Thank you David!
This is one of the most exciting projects in local AI right now!
We just hit number one globally across all AI apps on OpenRouter.
Super grateful to the nearly 1000 contributors who've helped make Hermes Agent great, thank you!
What do you want to see next?
@exolabs 1st time when Mistral came out, I was disappointed. Everything changed in April with Qwen 3.6...as for @exolabs if I had a wish -> get CUDA and MLX working in tandem that would turn my rtx3090 and mac into something greater and each by itself. Thx
@chrisbward@Anthropic It does not make sense indeed but it wont be for much longer. I own compute. Their edge is vanishing. It is only a matter of time for me to be off of any cloud model for exactly the reasons you mention + reliability. A local model doesn't degrade over time.
@malikwas1f@davideciffa@easel I always stay on one session, I have been using this thing since like version 2.0 that is what hurts....seeing 4.5 and 4.6 shine like two months ago and 4.7 going from really God Tier when it works to total disaster when it doesn't plus the sweet talk....unbearable
@easel@malikwas1f@davideciffa Totally, can't wait to be off of this shit! LocalAI is the only way out of progressive enshittification. It just drives me nuts that it acts way worse on WSL than on MacOS....god knows what they did to the thing but it is a disgrace.
@stuff383864@davideciffa@easel Yes it can, I am working on it on my fork of `panbanda/higgs` but not finished yet. Just a hint, using ANE does magic on prefill but it is definetely non trivial