arXiv preprint:
When does removing LayerNorm help?
Main finding: DyT is not uniformly helpful; it behaves like a regime-dependent implicit regularizer.
Paper: https://t.co/bRz1CYmX3c
KAN
Kolmogorov-Arnold Networks
Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"),
Knowledge Fusion of LLMs
Is it possible to merge existing models into a more potent model?
We have already seen a few ways that show the potential to effectively do this using approaches like weight merging and ensembling of models.
This work proposes FuseLLM with the core idea of externalizing knowledge from multiple LLMs and transferring their capabilities to a target LLM.
It leverages the generative distributions of source LLMs to externalize both their collective knowledge and individual strengths and transfer them to the target LLM through continual training.
To put it simply, the idea is to benefit from the strengths of all the LLMs and combine them into one integrated model.
Finds that the FuseLLM can improve the performance of the target model across a range of capabilities such as reasoning, common sense, and code generation.
By the way, you can also perform the fusion among fine-tuned LLMs that specialize in specific tasks.
This continues to be an interesting research area so hoping to document more on any new ideas and findings I come across.