Made With ML

@MadeWithML

Learn how to responsibly develop, deploy & manage machine learning. Maintained by @GokuMohandas

Learn machine learning →

Joined May 2019

3 Following

10K Followers

649 Posts

MadeWithML retweeted

Goku Mohandas @GokuMohandas

almost 2 years ago

Excited to share our end-to-end LLM workflows guide that we’ve used to help our industry customers fine-tune and serve OSS LLMs that outperform closed-source models in quality, performance and cost. https://t.co/u9hvVj7E24 1/🧵

237

211

20K

MadeWithML retweeted

Guillermo Rauch

@rauchg

over 2 years ago

An AI-generated clone of HN built with @nextjs App Router ◆ Uses PPR and streaming Node.js SSR ◆ Fully dynamic, fresh data from Postgres ◆ All the UIs bootstrapped with @v0 ◆ Content via @mistralai 8x7B and @anyscalecompute Tools What I've learned 🧵 https://t.co/HSbl34jzXY

896

104

821

286K

MadeWithML retweeted

Anyscale

@anyscalecompute

over 2 years ago

@rauchg Glad you're finding it useful! Check out our accompanying blog post and the evaluation experiments we ran comparing across a suite of open-source and proprietary LLMs: https://t.co/AXzFK7iHFF

MadeWithML retweeted

Guillermo Rauch

@rauchg

over 2 years ago

Very impressed with @anyscalecompute's endpoints, which support tools / function calling. 2LOC to play with Mixtral as a replacement for GPT 🤯

rauchg's tweet photo. Very impressed with @anyscalecompute's endpoints, which support tools / function calling.

2LOC to play with Mixtral as a replacement for GPT 🤯 https://t.co/TNIeIkFUg7

385

194

85K

Who to follow

The AI omnicloud PyTorch developers love. Made the first AI Studio & PyTorch Lightning. Get help: https://t.co/a69wnEBpKH

Chip Huyen

@chipro

@aisysbooks @goodailist AI Engineering: https://t.co/94dv4uTU1H Designing MLSys: https://t.co/G81hL2dWmr Reading @chipslib

MadeWithML retweeted

Goku Mohandas @GokuMohandas

over 2 years ago

It's been nice to see small jumps in output quality in our RAG applications from chunking experiments, contextual preprocessing, prompt engineering, fine-tuned embeddings, lexical search, reranking, etc. but we just added Mixtral-8x7B-Instruct to the mix and we're seeing a 🤯 step-function improvement (even compared to gpt-4-0613, see eval reports below)! 🤔 @pcmoritz and I are curious if others are seeing quality jumps like this in their applications and any ideas as to why (MoE, training cutoffs, etc.)? - 📜 Blog: https://t.co/QHgOXPT7S0 - 💻 Code: https://t.co/GMNrsHAhpY - 📓 Notebook: https://t.co/UPXSkwDt6h - 🚀 Endpoints: https://t.co/4tSu0zGM3G (Mixtral at $0.5 / million tokens on @anyscalecompute)

GokuMohandas's tweet photo. It's been nice to see small jumps in output quality in our RAG applications from chunking experiments, contextual preprocessing, prompt engineering, fine-tuned embeddings, lexical search, reranking, etc. but we just added Mixtral-8x7B-Instruct to the mix and we're seeing a 🤯 step-function improvement (even compared to gpt-4-0613, see eval reports below)!

🤔 @pcmoritz and I are curious if others are seeing quality jumps like this in their applications and any ideas as to why (MoE, training cutoffs, etc.)?

- 📜 Blog: https://t.co/QHgOXPT7S0
- 💻 Code: https://t.co/GMNrsHAhpY
- 📓 Notebook: https://t.co/UPXSkwDt6h
- 🚀 Endpoints: https://t.co/4tSu0zGM3G (Mixtral at $0.5 / million tokens on @anyscalecompute)

438

483

117K

MadeWithML retweeted

Robert Nishihara

@robertnishihara

over 2 years ago

The Llama Guard model is now available on Anyscale Endpoints. Get started here: https://t.co/SBYL7T5NQO Example:

17K

MadeWithML retweeted

Robert Nishihara

@robertnishihara

over 2 years ago

One of the most common asks we get is for public (and reproducible) performance benchmarks. LLM inference performance benchmarks are subtle, and this is a rapidly evolving space, so numbers quickly become stale. But to make comparisons, we need to be talking about the same thing. So today, we are open sourcing our benchmark suite. We invite the community to run it and to collaborate with us on augmenting it. 🎉🎉 https://t.co/TcVOddYxWM Alright, so what metrics are we measuring 🤔 The following are fundamental metrics. In general we are interested not just in the mean but in the distribution (P50, P90, ...) 📊 ⚫️ Time to first token: The time before the LLM returns the first token. This matters for streaming applications, especially chatbots. ⚫️ Inter-token latency: The average time between consecutive tokens. We choose to include the time to first token in this measurement to avoid the degenerate case where some systems start streaming very late. ⚫️ End-to-end latency: The time from submitting the query to receiving the full response. ⚫️ Cost (e.g., per token): API providers can often trade off one of the other metrics for cost. For example, you can reduce latency by running the same model on more GPUs or using higher-end GPUs. When running models yourself on dedicated compute, replace cost with throughput (tokens / second). Interestingly, these metrics can be traded off against each other. Some considerations when doing so 👇 1⃣ A given model (e.g., Code Llama 34B) can be deployed in many ways on the same hardware (say a p4de instance). E.g., N replicas each with 8/N GPUs. 2⃣ Pipeline parallelism can improve throughput at the expense of latency. 3⃣ Time to first token is bottlenecked by compute whereas intertoken latency is bottlenecked by GPU memory bandwidth. This is just the start. We hope to work with other LLM providers to increase the level of rigor around benchmarking and provide some consistency around the metrics that we report. Read the blog post here. https://t.co/sTEQDrQFUm

119

40K

MadeWithML retweeted

Sanyam Bhutani

@bhutanisanyam1

over 2 years ago

The definitive guide to RAG in production! 🙏 @GokuMohandas walks us through implementing RAG from scratch, building a scalable app It now has updated discussion on embedding fine-tuning, re-ranking and effectively routing requests I think this is easily the most complete discussion on building RAG Apps A perfect weekend read to code along: https://t.co/Io8lHvip6U

bhutanisanyam1's tweet photo. The definitive guide to RAG in production! 🙏

@GokuMohandas walks us through implementing RAG from scratch, building a scalable app

It now has updated discussion on embedding fine-tuning, re-ranking and effectively routing requests

I think this is easily the most complete discussion on building RAG Apps

A perfect weekend read to code along:

https://t.co/Io8lHvip6U

555

740

110K

MadeWithML retweeted

Robert Nishihara

@robertnishihara

over 2 years ago

We updated our production RAG application guide with a number of new sections: ☑️ When to fine-tune embeddings ☑️ When to augment vector-based retrieval with traditional lexical search ☑️ When to rerank retrieved context ☑️ How to update & reindex as data changes Importantly, building a RAG application is not a one-time effort. Data constantly evolves (for example, if you build a Q&A chatbot for your documentation, you have to remember that your documentation and APIs are changing underneath you). As a consequence, you have to make sure that data is reindexed and that the application continues to work. We describe our process in the guide.

MadeWithML retweeted

Goku Mohandas @GokuMohandas

over 2 years ago

Added some new components (fine-tuning embeddings, lexical search, reranking, etc.) to our production guide for building RAG-based LLM applications. Combination of these yielded significant retrieval and quality score boosts (evals included). Blog: https://t.co/6LUe8Z6DMm

GokuMohandas's tweet photo. Added some new components (fine-tuning embeddings, lexical search, reranking, etc.) to our production guide for building RAG-based LLM applications. Combination of these yielded significant retrieval and quality score boosts (evals included).

Blog: https://t.co/6LUe8Z6DMm https://t.co/zrDwkBmYZ4

205

129

38K

MadeWithML retweeted

Adithyan @adithyan_ai

over 2 years ago

I burned in🔥2000$ in finetuning so you don't have to. I fine-tuned models with @OpenAI and @anyscalecompute API endpoints with 50million tokens. Here are the results I wish I knew before getting into finetuning. If you just want a quick snapshot, look at the figure. A longer explanation follows, explaining my findings. I am not an expert and not deep into theory of AI models. I just want to get the BEST model performance at the CHEAPEST possible price for my USE-CASE. And quickly deploy that to prod. I picked one specific simple USE-CASE. Summarizing text in a very specific tone, voice and a very specific structure. Trained both models with close to 50M tokens (~37M words). In short, - Anyscale costs 40X cheaper to finetune. - Anyscale costs 56x cheaper to finetune. Comparing the outputs, I get on par performance from llama-13b-fine-tuned as gpt-3.5-fine-tuned. Finetuning smaller models is clearly the way to go for simpler use-cases! I don't understand OpenAI's offering for fine-tuning here. They need to step-up the game. They need to either reduce the price or offer flexibility to compete with open-source fine-tuning models. I am going to run an another experiment which is a way more complicated use-case. It would be interesting to see who wins here. I suspect @OpenAI Turbo will have an edge here (otherwise the pricing does not make sense). P.S : I also know I can finetune models locally & directly without API. Like I said, I am not deep into theory yet. I tried this in @huggingface with their auto-train framework. But it was just not as easy as plugging in via API calls. There were adapters and stuff, and I got quickly lost. But I am reading up and will try start including them in the comparisons too. If anyone is aware of other managed/otherwise solutions for finetuning let me know please.

adithyan_ai's tweet photo. I burned in🔥2000$ in finetuning so you don't have to.
I fine-tuned models with @OpenAI and @anyscalecompute API endpoints with 50million tokens. Here are the results I wish I knew before getting into finetuning.

If you just want a quick snapshot, look at the figure. A longer explanation follows, explaining my findings.

I am not an expert and not deep into theory of AI models. I just want to get the BEST model performance at the CHEAPEST possible price for my USE-CASE. And quickly deploy that to prod.

I picked one specific simple USE-CASE. Summarizing text in a very specific tone, voice and a very specific structure.

Trained both models with close to 50M tokens (~37M words). In short,

- Anyscale costs 40X cheaper to finetune.
- Anyscale costs 56x cheaper to finetune.

Comparing the outputs, I get on par performance from llama-13b-fine-tuned as gpt-3.5-fine-tuned. Finetuning smaller models is clearly the way to go for simpler use-cases!

I don't understand OpenAI's offering for fine-tuning here. They need to step-up the game. They need to either reduce the price or offer flexibility to compete with open-source fine-tuning models.

I am going to run an another experiment which is a way more complicated use-case. It would be interesting to see who wins here. I suspect @OpenAI Turbo will have an edge here (otherwise the pricing does not make sense).

P.S : I also know I can finetune models locally & directly without API. Like I said, I am not deep into theory yet. I tried this in @huggingface with their auto-train framework. But it was just not as easy as plugging in via API calls. There were adapters and stuff, and I got quickly lost. But I am reading up and will try start including them in the comparisons too. If anyone is aware of other managed/otherwise solutions for finetuning let me know please.

670

857

373K

MadeWithML retweeted

AI at Meta

@AIatMeta

almost 3 years ago

Anyscale Endpoints enables AI application developers to easily swap closed models for the Llama 2 models — or use open models along with closed models in the same application.

168

59K

MadeWithML retweeted

ray

@raydistributed

almost 3 years ago

The team @MetaAI has done a tremendous amount to move the field forward with the Llama models. We're thrilled to collaborate to help grow the Llama ecosystem. https://t.co/DyOoUZSoJ7

78K

MadeWithML retweeted

Turing Post

@TheTuringPost

almost 3 years ago

3 free MLOps courses you should know about: ▪️ MLOps Course, @GokuMohandas ▪️ CS 329S: Machine Learning Systems Design @Stanford ▪️ MLOps Zoomcamp @Al_Grigor 🧵

TheTuringPost's tweet photo. 3 free MLOps courses you should know about:

▪️ MLOps Course, @GokuMohandas
▪️ CS 329S: Machine Learning Systems Design @Stanford
▪️ MLOps Zoomcamp @Al_Grigor

🧵 https://t.co/1wYgE6CHJt

278

368

45K

MadeWithML retweeted

LlamaIndex 🦙

@llama_index

almost 3 years ago

New LLM integration 🔥: @anyscalecompute endpoints allows any developer to easily run + finetune open-source LLMs through an API. Best of all you get the full power of Ray Serve/Train for scalable/efficient training and inference ⚡️ Big s/o to kylehh: https://t.co/5NK1zy35T3

llama_index's tweet photo. New LLM integration 🔥: @anyscalecompute endpoints allows any developer to easily run + finetune open-source LLMs through an API.

Best of all you get the full power of Ray Serve/Train for scalable/efficient training and inference ⚡️

Big s/o to kylehh: https://t.co/5NK1zy35T3 https://t.co/uJkzGgI5wX

MadeWithML retweeted

Niantic Engineering @NianticEng

almost 3 years ago

Later this month, Niantic will present at Ray Summit 23 and our own @dreamingleo89 wrote about how we are using Ray to improve multiple aspects of our scanning and mapping infrastructures, and we're just getting started. https://t.co/yd8ZQga6Je

11K

MadeWithML retweeted

kourosh hakhamaneshi

@CyrusHakha

almost 3 years ago

🤔 Fine-tuning LLMs: LoRA or Full-Parameter? Which should you choose? Uncover the insights in our latest technical blog. 🔗 Link: https://t.co/3pvQ9TAksF 🧵 Thread (1/N) 👇

205

131

35K

MadeWithML retweeted

Sanyam Bhutani

@bhutanisanyam1

almost 3 years ago

High signal ML for developers guide! 🙏 Building Machine Learning Applications in real world involves a lot of moving parts and ideas. This series covers all of them really well Made with ML by @GokuMohandas is the best no-nonsense collection of guides with every module covering important aspects of ML life cycle For anyone looking for crispy tutorials on best practises, start here: https://t.co/mIN3E7V0Uc

bhutanisanyam1's tweet photo. High signal ML for developers guide! 🙏

Building Machine Learning Applications in real world involves a lot of moving parts and ideas. This series covers all of them really well

Made with ML by @GokuMohandas is the best no-nonsense collection of guides with every module covering important aspects of ML life cycle

For anyone looking for crispy tutorials on best practises, start here:

https://t.co/mIN3E7V0Uc

119

12K

MadeWithML retweeted

Anyscale

@anyscalecompute

almost 3 years ago

Save cloud costs while keeping quality high with your open source LLM - > Llama 2 is about as factually accurate as GPT-4 for summaries and is 30 times cheaper https://t.co/b3C7gkOajx via @anyscalecompute #ML #AI #ArtificialIntelligence #LLM

MadeWithML retweeted

Goku Mohandas @GokuMohandas

almost 3 years ago

A very comprehensive case study on fine-tuning Llama-2 across three different tasks👇 - code for distributed fine-tuning w/ @raydistributed + @huggingface Accelerate + @MSFTDeepSpeed - data prep + eval + baselines - when to & not to fine-tune - using perplexity for checkpointing

21K

Made With ML

@MadeWithML

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users