haybales @hybls - Twitter Profile

haybales @hybls

2 days ago

@cremieuxrecueil Anthropic has really killer models and aesthetic but the rest of their company kinda sucks

0

317

haybales @hybls

4 days ago

@theo Need CUDA or MLX native for the work I’m doing. Whenever they update the mac minis I will be there

0

69

haybales @hybls

5 days ago

@SemiAnalysis_ I think the transformer architecture (sans-attention) is the real critical piece! Attention can be swapped out at the head level. The transformer architecture just lets you build giant systems without gradients blowing up

0

6

0

4

2K

haybales @hybls

5 days ago

Top 5 on BabyLM (non-challenge submission) with my custom constant-memory linear-scaling StateHead architecture. Beat several strong baselines on the Strict-Small track (10M words). Trained efficiently on my macbook air. Lots of performance still on the table!

hybls's tweet photo. Top 5 on BabyLM (non-challenge submission) with my custom constant-memory linear-scaling StateHead architecture.
Beat several strong baselines on the Strict-Small track (10M words). Trained efficiently on my macbook air.
Lots of performance still on the table! https://t.co/QUWA5xAzc7

0

39

Who to follow

Ali Raza

@Ialiraza0

Web Developer| Video Editor| Learning React| PTI Supporter

Heath Hardigree

@HeathHardigree

One should judge a man mainly from his depravities. Virtues can be faked. Depravities are real.

🔴 #nähmalwieder #Masken🔴

@naehmalwieder

mum wife blogger diy #infektionsangebotsverweigereer #maskenpflichtJETZT #schattenfamilie #herzfehler #hlhs #ehrenamt https://t.co/5BE7rupOaf

haybales @hybls

8 days ago

@Halo Hate the fat halo. Even if it’s lore-inaccurate I liked the elegance of the big skinny ring arcing overhead

0

793

haybales @hybls

10 days ago

@ID_AA_Carmack Should we bury our data centers

0

11

haybales @hybls

16 days ago

All models ~1M params, trained on 1 epoch of 20M tokens of tinystories.

0

14

haybales @hybls

16 days ago

Working on an experiment where I take a transformer model and replace the attention heads with super-basic RNN's, each head has its own state. Calling it StateHead, and I'm getting super unexpected results compared to vanilla transformers and even mamba2, same layer/head/params