El Deffo @eldeffo - Twitter Profile

11 days ago

@bnjmn_marie @WaleedAhmad1a10 any plans to do larger models (122b mtp, stepfun 3.7, v4 flash, 397b nex, mimo etc.) at low (iq1-q3) quants? would be nice to be able to squeeze more intelligence into commodity hardware and have it be usable still.

1

0

358

El Deffo @eldeffo

24 days ago

@0xSero REAP is idiotic. replicate SlimQwen instead.

0

8

El Deffo @eldeffo

29 days ago

@RyanNg26101 @Montrey82631182 donbas "locals" only joined after russia already occupied crimea and RU started the same shit there. video clearly shows well trained, professional soldiers in matching uniforms. also, finding 1000 local bandits in a region with millions of people, doesn't justify an occupation.

0

2

1

0

28

El Deffo @eldeffo

about 1 month ago

@HououinTyouma @0xSero it's been done and it works (e.g. ExpertFlow)

0

7

El Deffo @eldeffo

about 1 month ago

@0xSero sorry, ExpertFlow predictor paper: https://t.co/R6djY0fUuc

0

9

El Deffo @eldeffo

about 1 month ago

@0xSero read the FlashMoE predictor paper. but it's possible to just use statistics, e.g. https://t.co/ra8acHJGZE

1

0

30

El Deffo @eldeffo

about 1 month ago

@0xSero well, ram+cpu can be probably be just ranked as well, as it's possible for them to be faster than gpus for inference

0

2

El Deffo @eldeffo

about 1 month ago

@0xSero it would be much easier/better to figure out layered caching instead: L1 - GPUs, ranked separately both compute for prefill & vram speed for generation L2 - ram+cpu L3 - ssd there has been quite a bit of research & implementations too, but vllm and llama wouldn't let the PRs in

1

0

28

El Deffo @eldeffo

about 2 months ago

@bnjmn_marie there's some talk about broken chat templates in 3.6 on hf

0

1

0

214

El Deffo @eldeffo

2 months ago

@leftcurvedev_ @JamesNumb3rs connect your display to motherboard iGPU, you'll save yourself 1-3GB of VRAM and will be able to use Q4

0

3

El Deffo @eldeffo

2 months ago

@Posledniskaut v slove kokot tam mas nejak privela preklepov.

0

5

0

203

El Deffo @eldeffo

3 months ago

@mudler_it you should compare UD_Q4_K_M instead

0

35

El Deffo @eldeffo

3 months ago

@bnjmn_marie even with -np X and perhaps several server copies at the same time? maybe it would be nice to have a proper comparison of the engines too.

0

2

0

66

El Deffo @eldeffo

3 months ago

@bnjmn_marie generally the most interesting question is what can you fit into 11GB, 15GB, 23GB, 31GB... past that, it's just macs and rtx pros, and those can run almost anything anyway.

0

10

El Deffo @eldeffo

3 months ago

@bnjmn_marie how so? llama with quants is consistently faster than vllm, at least every time I tried. also, maybe the battery could be reduced, small models - Q4_K_M, maybe IQ4_NL & IQ3_XSS, + some smarter Q2s on 200B+ models? those are probably the only ones that need to be tested really

2

0

96

El Deffo @eldeffo

3 months ago

@LenSeaside @0xSero you can fit 27B UD-IQ3_XSS into 12GB, but only if you connect the display to motherboard.

0

48

El Deffo @eldeffo

3 months ago

@LenSeaside @stevibe 27B UD-IQ3_XXS [-ngl 65 + 36K Q4 kv cache] 1100-1200pp 36-37 t/s but only if you connect the display to motherboard/CPU's iGPU, that will get you 1-3GB VRAM back.

0

1

0

23

El Deffo @eldeffo

3 months ago

@0xSero btw. https://t.co/ra8acHJGZE

1

4

0

80

El Deffo @eldeffo

4 months ago

@0xSero if llama had api to move layers in and out of VRAM, with this info the performance gains could be quite substantial

1

0

195

El Deffo

@eldeffo

Last Seen Users on Sotwe

Trends for you

Most Popular Users