How do you scale Transformers to infinite depth while ensuring numerical stability? In fact, LayerNorm is not enough.
But *shaping* the attention mechanism works!
https://t.co/4DbIfYMQr3
w/ @ChuningLi@mufan_li@bobby_he@THofmann2017@cjmaddison@roydanroy