Spadav @Spadav_ - Twitter Profile

Pinned Tweet

Spadav @Spadav_

3 months ago

It's up. https://t.co/kjOB9BauLS

0

158

Spadav_ retweeted

witcheer

@witcheer

about 22 hours ago

everyone says NVFP4 makes blackwell cards "faster." I benchmarked Qwen3.6-27B three ways on my 5090: >NVFP4 >plain Q4_K_M (same 4-bit budget) >Q6_K - same llama.cpp b9365 and same harness. ~~~ prefill (processing your prompt): NVFP4 wins big, and it's real. +32 to 42% over equal-bit Q4_K_M at every context from 512 to 16k, so that gain is pure FP4-tensor-core compute. vs Q6 it's +52 to 68%. concretely at pp512: 5415 tok/s vs 3826 (Q4) vs 3222 (Q6). ~~~ decode (generating tokens): here's the myth. vs an equal-size Q4 it moves only +9% (84 vs 77 tok/s). the headline "+36% vs Q6" decode number isn't the FP4 cores at all but it's just NVFP4 being smaller (14.6GB vs 21GB). decode is memory-bandwidth bound, so it tracks footprint, not how the weights are packed. prefill = compute, decode = size. ~~~ the 4-bit tax is almost nothing: 93.2 vs 94.0 q_avg across five tasks vs Q6. MMLU, ARC, HellaSwag, GSM8K all land within half a point; only code dips meaningfully (HumanEval 90.2 vs 92.7). net, vs the Q6 a lot of people serve: ~+60% prefill +36% decode -30% VRAM (17.3 vs 23.5GB) for -0.8 quality. for an always-on local agent that's an easy yes - faster replies, more context headroom, and 6GB of VRAM handed back.

witcheer's tweet photo. everyone says NVFP4 makes blackwell cards "faster."

I benchmarked Qwen3.6-27B three ways on my 5090:

>NVFP4
>plain Q4_K_M (same 4-bit budget)
>Q6_K - same llama.cpp b9365 and same harness.

~~~ prefill (processing your prompt):

NVFP4 wins big, and it's real. +32 to 42% over equal-bit Q4_K_M at every context from 512 to 16k, so that gain is pure FP4-tensor-core compute.

vs Q6 it's +52 to 68%. concretely at pp512: 5415 tok/s vs 3826 (Q4) vs 3222 (Q6).

~~~ decode (generating tokens):

here's the myth. vs an equal-size Q4 it moves only +9% (84 vs 77 tok/s). the headline "+36% vs Q6" decode number isn't the FP4 cores at all but it's just NVFP4 being smaller (14.6GB vs 21GB).

decode is memory-bandwidth bound, so it tracks footprint, not how the weights are packed.

prefill = compute, decode = size.

~~~
the 4-bit tax is almost nothing: 93.2 vs 94.0 q_avg across five tasks vs Q6. MMLU, ARC, HellaSwag, GSM8K all land within half a point; only code dips meaningfully (HumanEval 90.2 vs 92.7).

net, vs the Q6 a lot of people serve:
~+60% prefill
+36% decode
-30% VRAM (17.3 vs 23.5GB) for -0.8 quality.

for an always-on local agent that's an easy yes - faster replies, more context headroom, and 6GB of VRAM handed back.

17

111

8

90

16K

Spadav_ retweeted

Ahmad

@TheAhmadOsman

1 day ago

All it takes to get started with Local AI is a single RTX 3090, so go buy that GPU anon

31

169

5

30

13K

Spadav @Spadav_

1 day ago

Claude >> ChatGPT Codex >> Claude change my mind

0

15

Spadav_ retweeted

nullptr 🐱🍩 @notnullptr

3 days ago

i’m glad google are still releasing open models. i just wish they were good models

27

641

7

60

44K

Spadav @Spadav_

2 days ago

@JakeKAllDay @slippyfox @sudoingX Yeah this could be a good solution if you want a bigger model to use on a limited spec computer

0

14

Spadav @Spadav_

2 days ago

@LottoLabs I have been using Qwen 9b as second model running on the second gpu, mainly memory extraction, compression and summary. Worked decently with a very strict prompt, im testing Gemma now, hopefully this will be better (apparently Gemma models are good at writing)

0

3

0

933

Spadav @Spadav_

2 days ago

@slippyfox @sudoingX I got a 3080 10gb, the Unsloth Q4_k_m fits perfectly at 160 ctx, probably you can squeeze more context if you quantize kv cache at q4

1

0

62

Spadav @Spadav_

3 days ago

🤣

Ahmad

@TheAhmadOsman

3 days ago

Using Windows in the age of AI is a permanent underclass move btw

58

282

23

22

16K

0

10

Spadav @Spadav_

3 days ago

hermes when

airplanestar

@airplanestar_

3 days ago

Siap siap banyak pengangguran 🥶 Peter Steinberger, creator OpenClaw, datang ke Microsoft Build buat jelasin gimana OpenClaw bakal diintegrasi jadi aplikasi native Windows, lengkap sama fitur keamanan baru yang namanya Microsoft Execution Containers Dia bilang, "Sekarang lu bisa jalanin OpenClaw langsung di lingkungan perusahaan lu dengan lebih aman" Demo-nya juga sempat ditampilin langsung di atas panggung pake Surface Laptop Ultra #microsoft #ai #tech

79

777

68

582

146K

0

1

0

3

Spadav @Spadav_

4 days ago

@meabed 😅 my bad, got mixed up with another tailscale project. What I actually meant is Docker support, most of my services run in containers without host ports so tsp can't discover them. I'll try something with docker API, this would make everything discoverable at the same time

0

14

Spadav @Spadav_

4 days ago

@meabed would you mind a for for Linux? I could work on that on my free time

1

0

13

Spadav_ retweeted

Lotto

@LottoLabs

5 days ago

Qwen 27b 3.7 soon 👀

49

689

21

66

40K

Spadav @Spadav_

5 days ago

Just a reminder, if you are using llama.cpp as backend, this might be helpful. Full llm serving, voice pipeline, config for hot swapping, playtest, model management. Based on llama.swap.

Spadav @Spadav_

3 months ago

It's up. https://t.co/kjOB9BauLS

0

158

0

6

Spadav @Spadav_

6 days ago

@loktar00 I have a project folder with 18 different sub-folders. Each one with half baked code and architecure files. Guess how many are on my Github? Two.

0

1

0

32

Spadav @Spadav_

6 days ago

@Senpai_Gideon Claude Design for UI, Claude to iterate on projects and architecture just by chatting, then GPT 5.5 to do the work.

0

1

0

48

Spadav @Spadav_

6 days ago

@aijoey @llm_wizard One big family I guess

1

0

29

Spadav @Spadav_

6 days ago

@rozzabuilds I use the Codex App on my laptop to work on projects on my main PC, instead of Terminal, SSH, and Tmux. The app does it natively

0

34

Spadav @Spadav_

6 days ago

@pangshuo1981 I’m all local with this Qwen3-TTS setup, but I see how a unified workflow layer like yours could help organize these kinds of pipelines. Great project

0

16

Spadav @Spadav_

7 days ago

Git my agent pretty close to real time voice on local hardware. STT: Parakeet TDT 0.6 (ONNX INT8, CPU) TTS: Qwen3-TTS 0.6B on RTX 3080 (torch.compile + CUDA graphs) — RTF ~0.41x Bottleneck is of course TTS path , but the quality is unmatched so i kept it. Continue down below

Hermes Agent Tips

@HermesAgentTips

8 days ago

nobody told me Hermes Agent could just... join your Discord VC and talk back for those using Discord w/ Hermes Agent theres a feature where you can just have your Hermes agent jump in on a vc call with you and have a normal conversation like if you were having it with another human being its pretty cool

HermesAgentTips's tweet photo. nobody told me Hermes Agent could just... join your Discord VC and talk back

for those using Discord w/ Hermes Agent theres a feature where you can just have your Hermes agent jump in on a vc call with you and have a normal conversation like if you were having it with another human being its pretty cool

57

593

35

438

46K

4

0

101

Spadav @Spadav_

6 days ago

@TeksEdge I was doing the same, PC always on, me and my partner using the chatbot for different reasons at all time, I tried debloating windows as much as possible but there always was some sort of friction. Then I moved to Ubuntu, you should try if you are willing to change OS

0

10

Spadav

@Spadav_

Last Seen Users on Sotwe

Trends for you

Most Popular Users