GRadient INformation make MoE 😁
achieve 79.4 on MMLU with 6.6B active parameters & correctly answers the straberry question occasionally
highlight:
- push 16x3.8B to reach 14B capacity
- trained experts have expertise
- trained routing invents shared expert
- sound gradient
Microsoft GenAI is looking for a summer intern to work on Sparse LLMs, if you are interested, please DM me or send a resume to yaliu10 at microsoft dot com