So what about compressing the input during encoding, into what we call “nuggets”? Surprisingly, we found that transformers can use 10% or even 5% of the tokens to represent the texts with negligible information loss. How did we achieve that?
We found that many approaches to long-range transformers (such as sparsified pattern, recurrence, and kernels) don’t actually translate into NLP task performance (https://t.co/54pkBWR4q2, #EACL2023). Can we fix it?