@SemiAnalysis_ I think the transformer architecture (sans-attention) is the real critical piece! Attention can be swapped out at the head level. The transformer architecture just lets you build giant systems without gradients blowing up
Top 5 on BabyLM (non-challenge submission) with my custom constant-memory linear-scaling StateHead architecture.
Beat several strong baselines on the Strict-Small track (10M words). Trained efficiently on my macbook air.
Lots of performance still on the table!
Working on an experiment where I take a transformer model and replace the attention heads with super-basic RNN's, each head has its own state.
Calling it StateHead, and I'm getting super unexpected results compared to vanilla transformers and even mamba2, same layer/head/params