@teortaxesTex Zhipu could take the more straightforward Whale and Moonshot innovations, scale them to what they can handle, and apply the post-training framework they already have. It's also possible for DS and Kimi to step up their RL and iteration speed
@teortaxesTex Evidently even last gen 700B models are far from saturated and at this point every good checkpoint is an asset for developing the next one. The absolute moat is still ephemeral for now
@teortaxesTex If they applied current recipe sure, but RL will get better by then too. I was thinking 1.5T class, which should be manageable.
I'm just really impressed by 5.1 -> 5.2 jump on general capabilities ig. I expected code but not everything else. Feels like they cracked something.
@teortaxesTex Mythos tier is a tech achievement, but is it necessary for Fable-like capabilities by Q4? I doubt GLM-5.2 itself is in the same scale ballpark as the closed source models it's matching on either parameters or data.
@teortaxesTex GLM-5.2 is like half the size of Opus. It's also architecturally a V3 family model. GLM-5, while my favorite for niche tasks, wasn't special and GLM-5.1 was codemaxxed and regressed on those tasks in my exp.
Incredible things are happening in China.
@teortaxesTex I mean, they ARE drones. The question is how to make UGVs cost less to deploy than the cost of UAVs needed to destroy them. A swarm of these covering each other in overlapping AA umbrella could create significant fire density against UAVs and bullets are still cheaper.
@teortaxesTex Do you specify the development process flow? I mostly get these loops when it messes up custom CoT and freaks out because it can't erase the steps that are already in wrong order.
@invizive@teortaxesTex@Donogzs Every architecture would have to "go through all the context" by virtue of constraints of information processing. Attn can be log(n) without fundamental changes.
What does it have to do with inventing a better arch anyway? It's a specific capability threshold in math and ML.
@teortaxesTex In a shocking development people saying "make us the sole members of permanent upper class for your own good and safety" mean exactly what it sounds like.
@teortaxesTex I mean, makes sense if you think about it as "rotate the self-attention." In a sequence token m represents what is discretely appended at position m, not the accumulated state.
@akarlin The ground truth for SWE tasks and math is simple, sparse, and objective. "Good writing" is an extremely jagged shape that requires expensive human feedback and good heuristics. The former has high financial returns, the latter is niche atm. Resources are invested accordingly.
@teortaxesTex@rasbt They explicitly reject both state tracking and SWA hybrids on their site. The specific claim seems to be DSA analog with a working linear indexer. Which would be both good and not that unrealistic but the "not your average transformer" marketing is so ass it's hard to believe.
@teortaxesTex Long if factual but oozing scam energy. Why would you invent a buzzword name for your attention if you are not going to release the algo?