"Check out 'Her Pain' MV - 1/3 lyrics & entire storyboard by MOTHER. MOTHER[w/help, early version LOL] used RunwayML Gen-2 generated 100+ scene videos, couple fx's, stitched it together using PySceneDetect & MoviePY. MOTHER ain't just a lyricist, but a music video maestro too.
I think if we employ a curricula style of training, optimizing for both left-to-right and any-order/FiL together, would actually make spectulative decoding useful fr fr.
a true multimodal LM, with bi-directional multimodal ability and some clever hybrid data instructional data, can approximate imagination. Seems to vastly improve multimodal CoT..although I'm limited to audio..its cool asf
I been playing around with adaptive per-token compute, what if we only store "important tokens" in memory segments? vastly reduces memory while retaining meaingful context entropy.
Finally go into TransformerFAM, so it's basically activation beacons on steriods. Exploiting the residual stream for "unlimited context", had great success with beacons, so I'm excited for this...but. This highlights an issue with current self-attention on lower layers.
deeper layers, attention becomes more oblique, so to properly exploit transformerFAM, we must improve embedding quality and optimize the attention distribution shifts on lower layers.
figure if this works. I finetune a super robust coding and math model..merge them and exploit the polysemantic abilities of models to improve overall reasoning...and damn is polysemanticity a feature or a bug of train methods? new capability, always lead to n^2 more questions ha.
I'm convinced GPT-2's chatbot abilities,is just an extrapolation of AlphaCode 2 research, as evidenced by extended test-time compute charts. As I previously mentioned, models akin to GPT-4 can leverage extended test-time compute to significantly enhance final outputs.
This is obviously compute intensive and takes time, but again with smaller models, doable for small startups. Its provide a clear pathway for smaller models to reach true expert-level performance for code and math. wonder when this method saturates?
this is for a self-reinforcement learning situation btw...can fully validate if the supposed 3 iteration self-refine limit...is actually a limit. Since I'm obsess with some new cool model merging methods[evolutionary optimization + LM = heaven].
maybe llama-4 70B matches the parameter-to-to token scaling of llama3-8B, I suspect it would be as good as llama3-400B. I haven't done shit since it dropped but benchmark hahaha.
The result, whatever GPT-5 will be performance wise. People have to realize it doesn't matter. You can now get the same level of performance with a smaller model, just longer inference time. We're talking expert-level outputs. making DGX pod ever more relevant.
@finkd fucked up and changed the world...again LoL. Goat man. I forsee data-based startups. only bottleneck now is data. we need large agentic multimodal datasets. People will pay for that ish. Local/on-premise AGI looking like a real possibillity.
not to mention, the upcoming 400B will exceed even Claude Opus. ternary quantization and pruning unneeded layers[shortGPT], you're talking ~130GB of memory.
point being. I'm sure everyone is doing shit loads of benchmarking RIGHT NOW. llama-70B I believe meets that threshold, with ternary quantization, you're talking like 20GB of memory.
This event horizon allows the model exploit test-time self-improvement and extended test-time compute to vastly improve performance. Let's use openAI own words LoL extended test-time compute allow you to emulate much larger models.
So let's talk about how the goat, @finkd , literally just smashed th entire industry. Ok we know prior ablations, gpt-4 level model represent an intelligence threhold.