Benjamin Bergner

@bergbenj

PhD student at HPI

Berlin

Joined July 2014

107 Following

165 Followers

60 Posts

Pinned Tweet

Benjamin Bergner @bergbenj

about 3 years ago

Ever got out-of-memory errors while training neural nets?⚡️ Happy to present Iterative Patch Selection, a simple method to process arbitrarily large images with consumer-grade GPUs, at #ICLR2023 📜Paper: https://t.co/yUYqYy0ETG 🐍Code: https://t.co/3RVeQgKrYE Find out more 🧵

bergbenj's tweet photo. Ever got out-of-memory errors while training neural nets?⚡️

Happy to present Iterative Patch Selection, a simple method to process arbitrarily large images with consumer-grade GPUs, at #ICLR2023

📜Paper: https://t.co/yUYqYy0ETG
🐍Code: https://t.co/3RVeQgKrYE

Find out more 🧵 https://t.co/DXrArVWSyP

1

5

0

0

541

Benjamin Bergner @bergbenj

about 1 year ago

🚀 Heading to #CVPR2025? Check out our Token Cropr poster — a token pruning method that boosts inference throughput across quite a few vision tasks! 📍 Friday, 06/13 | 4–6 PM CDT 📌 ExHall D, Poster #416 🔗 https://t.co/JzJ5Xne1HZ 👀 @CVPR #CVPR25

bergbenj's tweet photo. 🚀 Heading to #CVPR2025?

Check out our Token Cropr poster — a token pruning method that boosts inference throughput across quite a few vision tasks!

📍 Friday, 06/13 | 4–6 PM CDT
📌 ExHall D, Poster #416
🔗 https://t.co/JzJ5Xne1HZ
👀 @CVPR #CVPR25 https://t.co/UEoRWUlv59

0

2

0

0

210

Benjamin Bergner @bergbenj

over 1 year ago

@Swarooprm7 Various open questions: https://t.co/V47KXNLWTp

Benjamin Bergner @bergbenj

over 1 year ago

What are the main reasons why DeepSeek-R1, even the Zero version, works so well?

1

0

0

1

194

0

0

0

0

104

Benjamin Bergner @bergbenj

over 1 year ago

For verifiable rewards, how could this be scaled beyond easily verifiable math and coding problems to arbitrary tasks? Or could it be that a few math/coding problems are sufficient to learn general reasoning across tasks?

0

0

0

1

40

Who to follow

Charles Bronfman Inst for Personalized Medicine

@mountsinaicbipm

CBIPM - where innovation meets precision in healthcare.

FourPointZero Recruitment

@FourPointZeroIO

🌐 Specialists in AI Talent for Creative Tech 🚀 Connecting Global Innovators 🔍 We help Creative Tech AI Professionals land their ideal roles!

Hasso Plattner Institute for Digital Health @MountSinaiNYC A global research institute shaping the future of #digitalhealth to transform #healthcare w/ @HPI_DE

Benjamin Bergner @bergbenj

over 1 year ago

What are the main reasons why DeepSeek-R1, even the Zero version, works so well?

1

0

0

1

194

Benjamin Bergner @bergbenj

over 1 year ago

Is it the quality of the base model? Is it the training process (RL vs. SFT)? Is it PPO vs. GRPO for RL? Is it verifiable rewards vs. using a reward model?

1

0

0

0

60

Benjamin Bergner @bergbenj

almost 2 years ago

@giffmana @ylecun Thanks for your post. Just wanted to add that if you work with very high resolution images (megapixel/gigapixel) and small GPUs, IPS Transformer might be interesting: https://t.co/pfHv7Movfv

1

5

0

1

511

Benjamin Bergner @bergbenj

almost 2 years ago

@jxmnop Do you have an example for such a new task/dataset?

0

0

0

0

73

Benjamin Bergner @bergbenj

almost 2 years ago

@scottgeng00 Is the answer also NO if you train on much more synthetic data than you could retrieve from the original dataset?

0

0

0

0

69

Benjamin Bergner @bergbenj

about 2 years ago

@ylecun Why not window attention in the first layers, followed by global attention?

0

0

0

0

99

Benjamin Bergner @bergbenj

about 2 years ago

@mkwng cool idea. just wanted to try on your linked website but got an error when inserting a random location+distance: "An error occurred while generating the trail. Please try again."

1

3

0

0

503

Benjamin Bergner @bergbenj

about 2 years ago

@jxmnop You can combine large encoders with small decoders for efficient generation: https://t.co/EQHUlFaVyD

0

0

0

0

18

Benjamin Bergner @bergbenj

over 2 years ago

@Euclaise_ A bit related: https://t.co/xN6kc7Hh2a

0

0

0

0

17

Benjamin Bergner @bergbenj

over 2 years ago

When do you open the Berlin office?

over 2 years ago

Join the @xAI London office!

2K

13K

2K

349

15M

1

0

0

0

223

bergbenj retweeted

Andrii Skliar 🇺🇦 @avskliar

over 2 years ago

🚀 Excited to share our latest work "Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding" now on arXiv! We're taking strides in making language models faster & more efficient on text generation tasks like translation & summarization.🔍 [https://t.co/QJbUDQOwAV]

2

52

14

23

11K

Benjamin Bergner @bergbenj

over 2 years ago

@giffmana @AdeptAILabs @brainshawn @XiaohuaZhai Great work. Have you also trained smaller models that you would be able to release, similar to Microsoft's GIT?

1

2

0

0

281

Benjamin Bergner @bergbenj

over 2 years ago

Regarding point 8: Doesn't memory usage/training time depend on sequence length and training dataset? What have you used for your reported numbers?

Sebastian Raschka

over 2 years ago

I ran hundreds if not thousands of LoRA & QLoRA experiments to finetune open-source LLMs, and here’s what I learned: 1. Despite the inherent randomness of LLM training (or when training models on GPUs in general), the outcomes remain remarkably consistent across multiple runs. 2. QLoRA presents a trade-off that might be worthwhile if you're constrained by GPU memory. It offers 33% memory savings at the cost of a 33% increase in runtime. 3. When finetuning LLMs, the choice of optimizer shouldn't be a major concern. While SGD on its own is suboptimal, there's minimal variation in outcomes whether you employ AdamW, SGD with a scheduler, or AdamW with a scheduler. 4. While Adam is often labeled a memory-intensive optimizer due to its introduction of two new parameters for every model parameter, this doesn't significantly affect the peak memory demands of the LLM. This is because the majority of the memory is allocated for large matrix multiplications rather than retaining extra parameters. 5. For static datasets, iterating multiple times as done in multi-epoch training might not be beneficial. It often deteriorates the results, probably due to overfitting. 6. If you're incorporating LoRA, ensure it's applied across all layers, not just to the Key and Value matrices, to maximize model performance. 7. Adjusting the LoRA rank is essential, and so is selecting an apt alpha value. A good heuristic is setting alpha at twice the rank's value. 8. 7B models can be finetuned efficiently within a few hours on a single GPU possessing 14 Gb of RAM. With a static dataset, optimizing an LLM to excel across all benchmark tasks is unattainable. Addressing this requires diverse data sources, or perhaps LoRA might not be the ideal tool.

27

1K

217

1K

367K

1

2

0

0

4K

Benjamin Bergner @bergbenj

about 3 years ago

Check It Out #ICLR #ICLR2023

bergbenj's tweet photo. Check It Out #ICLR #ICLR2023 https://t.co/Z1AqhI5bMx

0

2

1

0

219

Benjamin Bergner @bergbenj

about 3 years ago

Visit the #ICLR poster session for a chat! Poster session 2, Mon 1 May 16:30 - 18:30 CEST MH1-2-3-4 #25 Kigali, Rwanda

0

0

0

0

115

Benjamin Bergner @bergbenj

about 3 years ago

Ever got out-of-memory errors while training neural nets?⚡️ Happy to present Iterative Patch Selection, a simple method to process arbitrarily large images with consumer-grade GPUs, at #ICLR2023 📜Paper: https://t.co/yUYqYy0ETG 🐍Code: https://t.co/3RVeQgKrYE Find out more 🧵

bergbenj's tweet photo. Ever got out-of-memory errors while training neural nets?⚡️

Happy to present Iterative Patch Selection, a simple method to process arbitrarily large images with consumer-grade GPUs, at #ICLR2023

📜Paper: https://t.co/yUYqYy0ETG
🐍Code: https://t.co/3RVeQgKrYE

Find out more 🧵 https://t.co/DXrArVWSyP

1

5

0

0

541

Benjamin Bergner @bergbenj

about 3 years ago

Last but not least: This project would not have been possible without @LippertChr and @aravindhm_ Thank you!

1

0

0

0

53

Last Seen Users on Sotwe

Trends for you

Most Popular Users