FrontiersMind @frontiersmind - Twitter Profile

2 days ago

Thank you everyone for the love and support. We will keep bringing great models and techniques for training LLMs efficiently !!

Rohan Paul

@rohanpaul_ai

3 days ago

This paper makes long-context attention cheaper and faster by letting each token use only the query heads it needs. Reached about 1.7 to 1.8 times faster prefill when context length became large. Standard attention makes every token run through every attention head, even when some heads are not useful for that token. The paper’s idea, called Grouped Query Experts, keeps the normal key and value cache from grouped-query attention but routes each token to only a few query-head experts. Grouped Query Experts sits on top of grouped-query attention, the trick many long-context models already use to reduce key-value cache cost. This is like giving the model many possible attention patterns, while making each token pay for only the small set that seems useful. The authors trained 250M-parameter models on 30B tokens and compared the method with a normal grouped-query attention baseline. The best version matched the baseline’s average accuracy, 56.04 versus 55.86, while using 9 of 16 query-attention computations. shows that attention can be made sparse inside grouped-query attention without hurting quality, but only when the router gets a strong learning signal and one shared head stays always on. ---- Link – arxiv. org/abs/2606.20945 Title: "Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention"

rohanpaul_ai's tweet photo. This paper makes long-context attention cheaper and faster by letting each token use only the query heads it needs.

Reached about 1.7 to 1.8 times faster prefill when context length became large.

Standard attention makes every token run through every attention head, even when some heads are not useful for that token.

The paper’s idea, called Grouped Query Experts, keeps the normal key and value cache from grouped-query attention but routes each token to only a few query-head experts.

Grouped Query Experts sits on top of grouped-query attention, the trick many long-context models already use to reduce key-value cache cost.

This is like giving the model many possible attention patterns, while making each token pay for only the small set that seems useful.

The authors trained 250M-parameter models on 30B tokens and compared the method with a normal grouped-query attention baseline.

The best version matched the baseline’s average accuracy, 56.04 versus 55.86, while using 9 of 16 query-attention computations.

shows that attention can be made sparse inside grouped-query attention without hurting quality, but only when the router gets a strong learning signal and one shared head stays always on.

----

Link – arxiv. org/abs/2606.20945

Title: "Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention"

4

72

21

41

8K

0

11

2

540

FrontiersMind retweeted

Rohan Paul

@rohanpaul_ai

3 days ago

This paper makes long-context attention cheaper and faster by letting each token use only the query heads it needs. Reached about 1.7 to 1.8 times faster prefill when context length became large. Standard attention makes every token run through every attention head, even when some heads are not useful for that token. The paper’s idea, called Grouped Query Experts, keeps the normal key and value cache from grouped-query attention but routes each token to only a few query-head experts. Grouped Query Experts sits on top of grouped-query attention, the trick many long-context models already use to reduce key-value cache cost. This is like giving the model many possible attention patterns, while making each token pay for only the small set that seems useful. The authors trained 250M-parameter models on 30B tokens and compared the method with a normal grouped-query attention baseline. The best version matched the baseline’s average accuracy, 56.04 versus 55.86, while using 9 of 16 query-attention computations. shows that attention can be made sparse inside grouped-query attention without hurting quality, but only when the router gets a strong learning signal and one shared head stays always on. ---- Link – arxiv. org/abs/2606.20945 Title: "Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention"

4

72

21

41

8K

FrontiersMind

@FrontiersMind

Last Seen Users on Sotwe

Trends for you

Most Popular Users