Ever got out-of-memory errors while training neural nets?⚡️
Happy to present Iterative Patch Selection, a simple method to process arbitrarily large images with consumer-grade GPUs, at #ICLR2023
📜Paper: https://t.co/yUYqYy0ETG
🐍Code: https://t.co/3RVeQgKrYE
Find out more 🧵
For verifiable rewards, how could this be scaled beyond easily verifiable math and coding problems to arbitrary tasks? Or could it be that a few math/coding problems are sufficient to learn general reasoning across tasks?
Is it the quality of the base model?
Is it the training process (RL vs. SFT)?
Is it PPO vs. GRPO for RL?
Is it verifiable rewards vs. using a reward model?
@giffmana@ylecun Thanks for your post. Just wanted to add that if you work with very high resolution images (megapixel/gigapixel) and small GPUs, IPS Transformer might be interesting: https://t.co/pfHv7Movfv
@mkwng cool idea. just wanted to try on your linked website but got an error when inserting a random location+distance: "An error occurred while generating the trail. Please try again."
🚀 Excited to share our latest work "Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding" now on arXiv! We're taking strides in making language models faster & more efficient on text generation tasks like translation & summarization.🔍 [https://t.co/QJbUDQOwAV]
I ran hundreds if not thousands of LoRA & QLoRA experiments to finetune open-source LLMs, and here’s what I learned:
1. Despite the inherent randomness of LLM training (or when training models on GPUs in general), the outcomes remain remarkably consistent across multiple runs.
2. QLoRA presents a trade-off that might be worthwhile if you're constrained by GPU memory. It offers 33% memory savings at the cost of a 33% increase in runtime.
3. When finetuning LLMs, the choice of optimizer shouldn't be a major concern. While SGD on its own is suboptimal, there's minimal variation in outcomes whether you employ AdamW, SGD with a scheduler, or AdamW with a scheduler.
4. While Adam is often labeled a memory-intensive optimizer due to its introduction of two new parameters for every model parameter, this doesn't significantly affect the peak memory demands of the LLM. This is because the majority of the memory is allocated for large matrix multiplications rather than retaining extra parameters.
5. For static datasets, iterating multiple times as done in multi-epoch training might not be beneficial. It often deteriorates the results, probably due to overfitting.
6. If you're incorporating LoRA, ensure it's applied across all layers, not just to the Key and Value matrices, to maximize model performance.
7. Adjusting the LoRA rank is essential, and so is selecting an apt alpha value. A good heuristic is setting alpha at twice the rank's value.
8. 7B models can be finetuned efficiently within a few hours on a single GPU possessing 14 Gb of RAM.
With a static dataset, optimizing an LLM to excel across all benchmark tasks is unattainable. Addressing this requires diverse data sources, or perhaps LoRA might not be the ideal tool.
Ever got out-of-memory errors while training neural nets?⚡️
Happy to present Iterative Patch Selection, a simple method to process arbitrarily large images with consumer-grade GPUs, at #ICLR2023
📜Paper: https://t.co/yUYqYy0ETG
🐍Code: https://t.co/3RVeQgKrYE
Find out more 🧵