realitymask @dethudner - Twitter Profile

15 days ago

@SkylerMiao7 @Leo_R_UK AA index tells qwen 3.7 is the best but we all know that model is benchmark contaminated. V4 really is better than anything else open source

1

4

0

254

realitymask @dethudner

24 days ago

@NosytLabs @itsjustmarky @testingcatalog retard

0

9

realitymask @dethudner

26 days ago

@Pakirapi3t @reddit_lies Shit vs Trash

0

9

0

271

realitymask @dethudner

about 1 month ago

@teortaxesTex and still get that deep learning that human data gives u

0

6

Who to follow

God, Truth, Freedom, Fresh air!

スズキノリオ

@nhS5jJLO8fA408c

realitymask @dethudner

about 1 month ago

@teortaxesTex i want to know if you can deliberately reintroduce controlled entropy, mild inconsistencies, alt paths, fuck even stylistic variation?

1

0

10

realitymask @dethudner

4 months ago

@teortaxesTex @thkostolansky Where do you find it 'flaky'?

1

0

174

realitymask @dethudner

4 months ago

@AntLingAGI RIP to people with ears

0

4

0

45

realitymask @dethudner

5 months ago

@AdaptionDigital @arena @Kimi_Moonshot @Zai_org GLM 4.7 doesn't come close

1

0

33

realitymask @dethudner

5 months ago

@ImNotTheWolf @arena @Kimi_Moonshot found the NEET

0

107

realitymask @dethudner

6 months ago

@synthwavedd t. hypocrite and or ppl who have no idea about anything

0

69

realitymask @dethudner

6 months ago

@crystalsssup crazy results!

elie

@eliebakouch

6 months ago

some thoughts about kimi linear INSANE score on the long context benchmark MRCR, almost matching gemini 3.0 pro and gpt 5.2 thinking xhigh > i don't understand the very big gap between the reported results in the context arena and in the paper > qwen3 next has a similar number of params and is hybrid as well, yet it performs poorly. the differences in arch are: -> kimi delta attention vs gated linear attention -> MLA without rope vs gated attention -> probably other stuff like muon, init, norm (?), etc.. and of course the data kinda excluding the possibility that kimi trained explicitly on MRCR data (they have no reason to do that here + i trust them) and that it is just due to MRCR v1 vs v2 (feels like too much of a gap)

eliebakouch's tweet photo. some thoughts about kimi linear INSANE score on the long context benchmark MRCR, almost matching gemini 3.0 pro and gpt 5.2 thinking xhigh

> i don't understand the very big gap between the reported results in the context arena and in the paper
> qwen3 next has a similar number of params and is hybrid as well, yet it performs poorly. the differences in arch are:
-> kimi delta attention vs gated linear attention
-> MLA without rope vs gated attention
-> probably other stuff like muon, init, norm (?), etc.. and of course the data

kinda excluding the possibility that kimi trained explicitly on MRCR data (they have no reason to do that here + i trust them) and that it is just due to MRCR v1 vs v2 (feels like too much of a gap)

6

135

10

60

17K

1

0

19

realitymask @dethudner

7 months ago

Kimi's (supposed) k3 architecure preview model scores above gemini 3 pro on long context.

Dillon Uzar

@DillonUzar

7 months ago

Context Arena Update: Added kimi-linear-48b-a3b-instruct [11-08] and kimi-k2 (Thinking) [11-06] to the MRCR leaderboards. The Linear 48b results are fascinating! It actually outperforms the new Gemini 3.0 Pro Thinking on 4-needle and 8-needle tasks at higher context lengths (512k+). I've added it to 2needle, 4needle, and 8needle. kimi-k2 (Thinking) lands lower on the leaderboards (Rank #22 for 2-needle AUC @ 128k), with a hard context ceiling around 262k. I did not run it for 2needle and 4needle. All results at: https://t.co/gLEWzxoXWG The performance curve for the Linear model is distinct: while it underperforms Gemini 3 significantly at shorter contexts (<=256k) on the difficult 8-needle test, its degradation slope is much flatter. Gemini starts higher and drops fast; Kimi starts lower but holds steady, overtaking Gemini at the higher end. However, note that kimi-linear-48b has noticeable performance drops past 128k on the easier 2 & 4 needle tests. Additionally, due to lower token efficiency compared to Gemini/GPT, only ~60% of the 1M token tests successfully ran (hitting limits/OOM). So some caution with the results at the 1M level. kimi-linear-48b results: 2-Needle Performance (@ 128k / @ 1M): - AUC: 96.5% (vs Gem 3: 99.5%) / 81.7% (vs Gem 3: 85.5%) - Pointwise: 96.0% (vs Gem 3: 99.0%) / 77.0% (vs Gem 3: 72.2%) 4-Needle Performance (@ 128k / @ 1M): - AUC: 85.5% (vs 85.8%) / 62.7% (#1, beating Gem 3: 57.3%) - Pointwise: 83.7% (vs 80.8%) / 51.5% (#1, beating Gem 3: 34.3%) 8-Needle Performance (@ 128k / @ 1M): - AUC: 54.9% (vs 73.0%) / 43.8% (#1, beating Gem 3: 39.0%) - Pointwise: 49.0% (vs 54.2%) / 35.3% (#1, beating Gem 3: 24.5%) A very different architectural approach yielding impressive stability at scale. Because of its current price point, it is very competitive for long context (MRCR). Enjoy. @Kimi_Moonshot @GoogleDeepMind @googleaidevs @OpenAI @OpenAIDevs