Charles @csris - Twitter Profile

csris retweeted

about 2 years ago

🚀Excited to be recognized for a second year by @FortuneMagazine in their Top 50 AI Startups list! We have come so far in the past year and a huge thank you to the now over 60,000 developers building on the Together API. Thank you!

2

20

7

2

9K

csris retweeted

Together AI @togethercompute

over 2 years ago

Great comparison of options for Llama-2 inference, concluding by saying, "Overall, we found Together did best overall across cost, throughput and accuracy followed closely by MosaicML." This is only going to get better. Stay tuned. https://t.co/UEY73m5QP3

togethercompute's tweet photo. Great comparison of options for Llama-2 inference, concluding by saying, "Overall, we found Together did best overall across cost, throughput and accuracy followed closely by MosaicML."

This is only going to get better. Stay tuned.

https://t.co/UEY73m5QP3 https://t.co/CmuTiON0sw

0

17

10

4

2K

csris retweeted

Dan Fu

@realDanFu

over 2 years ago

Excited about models that are sub-quadratic in sequence length and model dimension? Our Monarch Mixer paper is now on arXiv -- and super excited to present it as an oral at #NeurIPS2023! Let's dive in to what's new with the paper and the new goodies from this release: Monarch matrices are an expressive and hardware-efficient set of matrices that generalize the FFT -- and can be used to represent all sorts of fun linear transforms, from Hadamard transforms to Toeplitz matrices and more. Monarch mixer (M2) uses Monarch matrices to mix information both along the sequence (replacing attention) and along the model dimension. M2 replaces attention in Transformers with gated convolutions, and replace the linear layers in MLPs with sparse block-diagonal matrices. The result are architectures that scale sub-quadratically in both sequence length and model dimension! Back in July, we released a short blog post (https://t.co/j06MHahncb) with @togethercompute about using Monarch matrices to train some more efficient BERT models -- matching BERT-base in quality with 27% fewer parameters, and with long-context inference throughput. With this release, we're excited to announce two new M2-BERT-large models -- the 260M version matches BERT-large in downstream GLUE score with 24% fewer parameters (and also has much faster long-context throughput). Our paper also has a whole set of theoretical goodness that we didn't get to in our blog post. For causal language modeling -- e.g. GPT-style or decoder-only language modeling -- we need to parameterize the Monarch matrices to make sure that the sequence mixing is causal. This ensures that you can train with next token prediction, GPT-style. We use a mix of polynomial theory to interpret Monarch matrices as bivariate polynomial evaluation, and then causality is just a matter of keeping the degrees in check. (If you're familiar with the FFT convolution theorem, this is equivalent to the padding trick to turn the circular convolution into a causal convolution). Using this theory, we can train M2-GPT models -- fully sub-quadratic in the sequence length. In a weird twist, we found that we can get rid of the MLP layers entirely, and still match GPT performance... wild! Check out our paper, code, and blog post for more details: Paper: https://t.co/v7xbMQpjF3 Code: https://t.co/aYtUMkcOTT Blog: https://t.co/j06MHahncb With @simran_s_arora, @Jessica_Grogan_, Isys Johnson, @EyubogluSabri, @ai_with_brains, @bfspector, @MichaelPoli6, Atri Rudra, and @HazyResearch Building on a lot of great work from great folks, including @tri_dao @_albertgu @davidwromero @srush_nlp @BeidiChen @exnx @BlinkDL_AI @MaxMa1987 @ramin_m_h and many many more! And of course, couldn't have done this work without support from @StanfordHAI @StanfordAILab @StanfordCRFM. In collaboration with @togethercompute. Check out our paper for more, and please reach out if you have ideas about usage or questions! https://t.co/v7xbMQpjF3 And look forward to more soon ;)

realDanFu's tweet photo. Excited about models that are sub-quadratic in sequence length and model dimension? Our Monarch Mixer paper is now on arXiv -- and super excited to present it as an oral at #NeurIPS2023!

Let's dive in to what's new with the paper and the new goodies from this release:
Monarch matrices are an expressive and hardware-efficient set of matrices that generalize the FFT -- and can be used to represent all sorts of fun linear transforms, from Hadamard transforms to Toeplitz matrices and more.

Monarch mixer (M2) uses Monarch matrices to mix information both along the sequence (replacing attention) and along the model dimension.

M2 replaces attention in Transformers with gated convolutions, and replace the linear layers in MLPs with sparse block-diagonal matrices. The result are architectures that scale sub-quadratically in both sequence length and model dimension!

Back in July, we released a short blog post (https://t.co/j06MHahncb) with @togethercompute about using Monarch matrices to train some more efficient BERT models -- matching BERT-base in quality with 27% fewer parameters, and with long-context inference throughput.
With this release, we're excited to announce two new M2-BERT-large models -- the 260M version matches BERT-large in downstream GLUE score with 24% fewer parameters (and also has much faster long-context throughput).

Our paper also has a whole set of theoretical goodness that we didn't get to in our blog post. For causal language modeling -- e.g. GPT-style or decoder-only language modeling -- we need to parameterize the Monarch matrices to make sure that the sequence mixing is causal. This ensures that you can train with next token prediction, GPT-style.
We use a mix of polynomial theory to interpret Monarch matrices as bivariate polynomial evaluation, and then causality is just a matter of keeping the degrees in check. (If you're familiar with the FFT convolution theorem, this is equivalent to the padding trick to turn the circular convolution into a causal convolution).
Using this theory, we can train M2-GPT models -- fully sub-quadratic in the sequence length. In a weird twist, we found that we can get rid of the MLP layers entirely, and still match GPT performance... wild!

Check out our paper, code, and blog post for more details:
Paper: https://t.co/v7xbMQpjF3
Code: https://t.co/aYtUMkcOTT
Blog: https://t.co/j06MHahncb

With @simran_s_arora, @Jessica_Grogan_, Isys Johnson, @EyubogluSabri, @ai_with_brains, @bfspector, @MichaelPoli6, Atri Rudra, and @HazyResearch

Building on a lot of great work from great folks, including @tri_dao @_albertgu @davidwromero @srush_nlp @BeidiChen @exnx @BlinkDL_AI @MaxMa1987 @ramin_m_h and many many more!

And of course, couldn't have done this work without support from @StanfordHAI @StanfordAILab @StanfordCRFM. In collaboration with @togethercompute.

Check out our paper for more, and please reach out if you have ideas about usage or questions! https://t.co/v7xbMQpjF3

And look forward to more soon ;)

5

277

59

150

81K

Charles @csris

almost 3 years ago

@janekm @vipulved @togethercompute When `stream_tokens` is true, the response is a stream of Server-Sent Events where partial result events are JSON objects and the final event is the string "[DONE]". I'll try to post some sample code (in Python) later today. More details: https://t.co/qMueyU4Cgq

0

1

0

44

Who to follow

Jordan Hwang

@jordanhwang8

Mastering the art of the deadpan

Jack Chou

@smallchou

🤷🏻‍♂️. Past lives: startup founder, CPO @Affirm, Head of Product @Pinterest, early @LinkedIn.

csris retweeted

about 3 years ago

Today we are excited to introduce and open-source BLOOMChat a multilingual chat LLM. Built on top of the BLOOM model (@BigscienceW), we further train the model on conversational data from @togethercompute @databricks @laion_ai @huggingface. Some interesting observations: (1/6)

SambaNovaAI's tweet photo. Today we are excited to introduce and open-source BLOOMChat a multilingual chat LLM. Built on top of the BLOOM model (@BigscienceW), we further train the model on conversational data from @togethercompute @databricks @laion_ai @huggingface. Some interesting observations: (1/6) https://t.co/hquSX0y8dY

4

288

77

87

112K

csris retweeted

Together AI @togethercompute

about 3 years ago

The first RedPajama models are here! The 3B and 7B models are now available under Apache 2.0 license, including instruction-tuned and chat versions! This project demonstrates the power of the open-source AI community with many contributors ... 🧵 https://t.co/msO4afBQEK

togethercompute's tweet photo. The first RedPajama models are here! The 3B and 7B models are now available under Apache 2.0 license, including instruction-tuned and chat versions!
This project demonstrates the power of the open-source AI community with many contributors ... 🧵 https://t.co/msO4afBQEK https://t.co/ekLeidmd0q

16

836

212

377

518K

csris retweeted

Together AI @togethercompute

about 3 years ago

In addition to RedPajama 7B, we’ve also been training a 2.8B model. After 600B tokens it is exciting to see the model has higher HELM scores than the excellent Pythia-2.8B & GPT-Neo 2.7B. In fact, trained with twice the tokens, RedPajama-2.8B has comparable quality to Pythia-7B!

togethercompute's tweet photo. In addition to RedPajama 7B, we’ve also been training a 2.8B model. After 600B tokens it is exciting to see the model has higher HELM scores than the excellent Pythia-2.8B & GPT-Neo 2.7B.
In fact, trained with twice the tokens, RedPajama-2.8B has comparable quality to Pythia-7B! https://t.co/VhdkTRPWWb

12

501

77

142

336K

csris retweeted

Together AI @togethercompute

about 3 years ago

Training our first RedPajama 7B model is going well! Less than half way through training (after 440 billion tokens) the model achieves better results on HELM benchmarks than the well-regarded Pythia-7B trained on the Pile. Details at https://t.co/2lvceBsJlb