Roger Waleffe @RWaleffe - Twitter Profile

RWaleffe retweeted

6 months ago

Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months.

ctnzr's tweet photo. Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months. https://t.co/v7MKIy7Oe4

41

1K

221

399

506K

RWaleffe retweeted

Bryan Catanzaro

@ctnzr

10 months ago

Today we're releasing NVIDIA Nemotron Nano v2 - a 9B hybrid SSM that is 6X faster than similarly sized models, while also being more accurate. Along with this model, we are also releasing most of the data we used to create it, including the pretraining corpus. Links to the models, datasets, and tech report are here: https://t.co/NqSYULoiW3

ctnzr's tweet photo. Today we're releasing NVIDIA Nemotron Nano v2 - a 9B hybrid SSM that is 6X faster than similarly sized models, while also being more accurate.

Along with this model, we are also releasing most of the data we used to create it, including the pretraining corpus.

Links to the models, datasets, and tech report are here:

https://t.co/NqSYULoiW3

37

1K

229

638

276K

RWaleffe retweeted

Bryan Catanzaro

@ctnzr

about 1 year ago

Nemotron-H: A family of Hybrid Mamba-Transformer LLMs. * Hybrid architecture means up to 3X faster at the same accuracy * Trained in FP8 * Great for VLMs * Weights and instruct versions to come soon. https://t.co/h3dLuDuiUz

ctnzr's tweet photo. Nemotron-H: A family of Hybrid Mamba-Transformer LLMs.
* Hybrid architecture means up to 3X faster at the same accuracy
* Trained in FP8
* Great for VLMs
* Weights and instruct versions to come soon.

https://t.co/h3dLuDuiUz https://t.co/PbuF7UbV7j

18

621

100

270

201K

RWaleffe retweeted

Bryan Catanzaro

@ctnzr

almost 2 years ago

A 8B-3.5T hybrid SSM model gets better accuracy than an 8B-3.5T transformer trained on the same dataset: * 7% attention, the rest is Mamba2 * MMLU jumps from 50 to 53.6% * Training efficiency is the same * Inference cost is much less https://t.co/x62otbC5uN

ctnzr's tweet photo. A 8B-3.5T hybrid SSM model gets better accuracy than an 8B-3.5T transformer trained on the same dataset:
* 7% attention, the rest is Mamba2
* MMLU jumps from 50 to 53.6%
* Training efficiency is the same
* Inference cost is much less
https://t.co/x62otbC5uN https://t.co/bBfFYEt0a0

17

432

77

203

119K

Who to follow

Ying Fan

@yingfan_bot

@MSFTResearch | PhD @UWMadison | BS @PKU1898

Kangwook Lee

@Kangwook_Lee

CAIO @KRAFTON_AI / CTO @LudoRobotics (Prev) Associate Professor @UWMadisonECE, PhD @Berkeley_EECS

Hongyi Wang

@HongyiWang10

Assist. Prof. @RutgersCS; Head of Infra @genbioai; Ex @mldcmu @WisconsinCS

Roger Waleffe @RWaleffe

almost 2 years ago

@WesleyYue @ctnzr The authors of the RULER benchmark observed something similar for some Transformers they tested.

0

56

Roger Waleffe @RWaleffe

almost 2 years ago

@WesleyYue @ctnzr See our discussion here:

1

0

55

RWaleffe retweeted

Theo Rekatsinas @thodrek

about 2 years ago

Data pruning to reduce pertaining costs is hot, but fancy pruning can take just as long to select data as to train on all of it! Patrik, @Rwaleffe, and @vmageirakos's work at #ICLR2024 tomorrow shows how a simple, low-cost tweak to random sampling outperforms trendy methods!

2

15

4

3

2K

Roger Waleffe @RWaleffe

almost 3 years ago

@DisseminatePod Thanks for having me on the podcast Jack!

0

68

RWaleffe retweeted

Disseminate: The Computer Science Research Podcast @DisseminatePod

almost 3 years ago

🚨 "MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks" with @RWaleffe is available now! 🎧 Listen on Spotify ➡️ https://t.co/PgTrDblzJx ☕️ Support the podcast ➡️ https://t.co/tVHEIk5EgN

1

4

2

0

886

Roger Waleffe @RWaleffe

about 3 years ago

@rishiyer Thanks for sharing this!! Continuous random exploration had also been part of our motivation.

0

1

0

134

Roger Waleffe @RWaleffe

about 3 years ago

Not convinced about using random sampling for data pruning? Consider twice! In our recent work, we introduce Repeated Sampling of Random Subsets: https://t.co/jk2dWHpocl, where we sample a subset of data at each epoch of training instead of only once at the beginning!

3

40

8

11

19K

Roger Waleffe @RWaleffe

about 3 years ago

@BlackHC Regardless of which ‘viewpoint’ one chooses to look at our method with, this algorithm had yet to be studied extensively (empirically and theoretically).

1

0

92

Roger Waleffe @RWaleffe

about 3 years ago

@BlackHC If the sampling of S’ across rounds is done without replacement (instead of with replacement), then our method can also be seen as training on the full dataset but with early stopping after a few epochs (discussed in the paper). This version is particularly useful for analysis.

1

0

107

Roger Waleffe @RWaleffe

about 3 years ago

Joint work with Patrik Okanovic @vmageirakos Kostis Nikolakakis @aminkarbasi @DKalogerias @nmervegurel @thodrek

1

5

0

665

Roger Waleffe @RWaleffe

about 3 years ago

See the preprint here: https://t.co/kT897KdZgz for extensive evaluations together with the convergence analysis and discussion on its generalization.

2

4

0

693

RWaleffe retweeted

PyKEEN @keenuniverse

over 3 years ago

Marius, another amazing KGE (and more) library is now auto-formatting its code with black as of https://t.co/bCxU19K73t 🚀 @JasonMohoney @RWaleffe nice job :)

0

2

0

Roger Waleffe

@RWaleffe

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users