@corsaren Yeah few people on x realized #2 - such async multi-stream "think + say + tool" has been hacked in various ways across a few different voice consumer products. Still nice to see it coming from a model company which must mean some advancement in model layer.
@nrehiew_ Great read - been wondering how some models are much better in the long horizon practical SWE tasks i.e. due to teacher trained on diverse high quality human data or due to crazy amount of compute in stages around OPD. Seems both eventually point to importance of on policy data!
@dylan_works_ Consolidation itself is okay imo, but the current approach is definitely far from effective, not to mention those "strategic forgetting" approach which gives false impression of mirroring human brain. Maybe the key is to simulate the ICL process to extract the real "experience".
@eliebakouch Noticed some pretty wild hallucinations today when analyzing code samples and I notice that it’s doing thinking/backtracking directly in the output (not in thinking). Wondering if that has to do with the regression here.
@teortaxesTex So input:cached:output for v4f is 1:1/50:2 while v4p is 1:1/120:2. That unique ratio is really showing the scenario they want to utility-maxx for, or the nature of their underlying attention optimizations.
@badlogicgames@deepseek_ai yeah several already raised this. I've forwarded this feedback to their deployment team again so hope they'll fix it quickly
@basedjensen When your definition of AGI is letting 1 billion users talk freely with their AI friend who can remember 1 million context. Terrifying execution and focus.
@MParakhin Genuine question - why frame the 35% as a benchmark issue rather than a data-pipeline one? Reverse-engineering from completed workflows structurally can't produce edit trajectories (no starting state) or Q&A (no workflow artifact). Feels like the gap was baked in upstream already
@michaelyli__ Softmax renormalizes over survivors, so eviction shifts mass onto remaining keys regardless of true relevances. Your scoring avoids evicting high-attention blocks, but doesn't bound this drift. Is the cliff at high compression (fig 3r/8) due to mass redistribution accumulating?
@teortaxesTex yeah this one is confusing to read i.e. 1) Mythos is indeed memorizing more problems for marginal improv 2) sweb-v and sweb-p have very different leakage shape. Strange to conclude that "memorization does not explain improvements"