@STUD_MAN_X@thsottiaux Yes, they can shard the full MoE weights across many GPUs using expert parallelism, and each token only computes the routed active experts. But the full weights still need to be stored across the serving replica. They can’t have one central copy and stream experts to many fleets
@STUD_MAN_X@thsottiaux If you say, “5T total and 100B active” so that means only ~100B params are used for compute per token, but the full 5T weights must be sharded somewhere.
They also can’t rely on heavy offloading at scale, because that would destroy throughput.
@STUD_MAN_X@thsottiaux Q4 is never the actual training weight precision. Models are usually pre-trained in FP16/BF16 or FP32, and then quantized to Q4 later. After that, they may run Q4-oriented post-training or fine-tuning to improve performance in the quantized format.
@STUD_MAN_X@thsottiaux They use moe no doubt, but this time it is most probably recurring loop transformer, and precision matters if not then kimi k2.6 can not even run on 8x h100 with f32
@STUD_MAN_X@thsottiaux It is not 4x costlier than 5.4, I think it is Larger model and they did not quantised it to 4bit, but for sure it is token efficient, I let it build my project from the plan.md and it uses 40% less output tokens than 5.4
@deedydas it's just the web traffic data from similarweb not the actual user count if you see the appstore / playstore download stats you will get the actuall view how dominant is the chatgpt in consumer ecosystem.
Chatgpt has 1.3B+ downloads in mobile
Claude has 40M+ downloads
@f_demaku@thsottiaux join the trusted access for cyber to reduce the guardrails on cyber related work, otherwise it is quite painful. Most probably your Agent.md or skills contain something related to code review like "try penetrating", or some other cyber related phrase
@scaling01 gemini 3.5 flash has higher error rates, so it needs multiple runs (more tokens) to complete a same task successfully that other models can do in a single run.