Looking forward to catching some sun at #NeurIPS2025 this week! I’ll be at 2 workshops presenting this work at the poster sessions:
- Continual and Compatible Foundation Model Updates (CCFM)
- Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
@ilyasut says the age of scaling is over - good thing we put this paper out in time!
Many recent embedding models are finetuned versions of pretrained LLMs.
We asked 🤓: How does retrieval performance scale with pretraining FLOPs?
📄 paper: https://t.co/xKbLeDr5aa
At NeurIPS and want to learn some tips and tricks for speeding up LLM training?
I'll be presenting our work on MosaicBERT, an encoder optimized for fast pretraining today, Tues. Dec 12 5:15-7:15 CST
Come say hi!
https://t.co/6zTLtkUtSR
It's the most wonderful time of the year...come see us and the @databricks team at #NeurIPS2023 for a week of talks, parties, and connection! First up: join our Expo Day talk at 10 AM tomorrow to learn more about optimizing and reasoning on #LLM inference.
@PatronusAI There is so much work to be done in the LLM eval space in industry- and I’ve been impressed by the thoughtful work @PatronusAI has done. The UX of their platform, addressing a real industry need, and the meticulous curation of the EnterprisePII dataset. Well done!
📦 To evaluate the coding capabilities of LLMs, you need to execute the code. But what if the LLM spits out malicious code?😱
With MosaicML, you can now evaluate #LLMs on code gen benchmarks (eg. HumanEval) in an effortless, end-to-end secure framework.
https://t.co/mDD4ic7msb
If you're at #ICML 🌴on Saturday, make sure to check out the https://t.co/H7Rfblz4lM workshop on efficient training of LLMs!
@abhi_venigalla and @jefrankle will be at our poster on optimized pretraining of MosaicBERT ⚡️🚄
📜workshop paper: https://t.co/MjHrqiyj6J
🧵
The MPT suite of large language models (LLMs) by MosaicML has become incredibly popular. But, what makes these models so special? Although there are a variety of reasons for the popularity of MPT, I find these models to be especially useful due to a few unique components…
Fully open-source. MPT models, including MPT-7B and MPT-30B, carry an Apache 2.0 license, meaning that they can be used commercially without any limitations. Plus, these models are accompanied by an entire open-source code repository for fine-tuning, evaluating, or even pre-training these models from scratch (see replies for more details). Given that pre-training a base LLM is the most prohibitive/expensive component of any LLM-based system, the MPT foundation series is a great starting point for building specialized LLMs that solve domain-specific problems.
Fast inference. MPT models are based upon a typical, decoder-only transformer architecture. But, they make a few key modifications to this architecture, including:
- Low precision layer norm
- Flash Attention
- ALiBi (instead of normal positional embeddings)
Due to these modifications, MPT models perform inference very quickly (i.e., 1.5-2X faster than similarly-sized LLaMA models) with HuggingFace inference pipelines. Plus, MPT models are completely compatible with libraries like FasterTransformer, which could be used to further boost inference speed.
Context length. Due to their use of ALiBi, MPT-7B and 30B are capable of handling large context windows and can even extrapolate to context lengths that are beyond data seen during training. To show this, MPT-7B is fine-tuned on data with a 64K token context length (derived from books3 corpus of fiction novels). Researchers at MosaicML found that this MPT-StoryWriter-7B model was capable of handling large context lengths and could even extrapolate further to context windows as large as 84K. They even ingested the entire Great Gatsby book and generated an epilogue!
Performance. Finally, MPT models perform really well. MPT-7B achieves performance on-par with LLaMA-7B across a variety of standard benchmarks. MPT-30B lags slightly behind the performance of LLaMA-30B and Falcon-40B on text-based tasks, but it tends to perform better on programming tasks. Plus, MPT-30B seems to exceed the quality of GPT-3. Put simply, these base models are high-quality and serve as a great foundation for creating open-source alternatives to proprietary systems like ChatGPT or GPT-4.
Big news: we've agreed to acquire @MosaicML, a leading generative AI platform. I couldn’t be more excited to join forces once the deal closes. https://t.co/L4TyrruUEU
Boom! We at @MosaicML plan to unite with an amazing group of colleagues at @Databricks! And don’t worry, still the same great @MosaicML taste: our brand, products, and mission remain. But, going bigger, much bigger. So watch out for more from a truly amazing team! Bravo team!
Working on the MPT-30B-Chat and Instruct models was incredibly exciting. The team, the software, and the hardware were all exceptional (FYI, H100s are _really_ fast)
🚨 A few months ago we announced that you can train Stable Diffusion from scratch for less than $125k using the MosaicML platform.
A major price drop is coming...and we have the training run to back it up. Stay tuned for a major announcement this week!
@calumbirdo Thanks for highlighting the graphic I made for the @MosaicML blog post about training GPT-3 quality models for <500k! I can’t wait to see what kind of graphs you add to your calculator for us visual learners.
https://t.co/AJkxRmaSRs