Thomas Sounack @tsounack - Twitter Profile

Pinned Tweet

12 months ago

Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)

4

65

14

47

17K

tsounack retweeted

TRIPODStatement @TRIPODStatement

4 months ago

We are setting out to develop some new recommendations (TRIPOD-CODE) to provide guidance on reporting the availability & structure of code for predictive AI healthcare tools Watch this space & read the protocol here https://t.co/XKoXyBYzdY #transparency #code #reproducibility

TRIPODStatement's tweet photo. We are setting out to develop some new recommendations (TRIPOD-CODE) to provide guidance on reporting the availability & structure of code for predictive AI healthcare tools

Watch this space & read the protocol here

https://t.co/XKoXyBYzdY

#transparency #code #reproducibility https://t.co/hBQGC4tavf

0

17

9

10

2K

Thomas Sounack @tsounack

5 months ago

Another interesting finding was that simple fine-tuning allowed these small models to consistently return parsable JSON outputs. This may be worth exploring if you plan to use small LLMs for a structured output generation task

0

21

Thomas Sounack @tsounack

5 months ago

Our Medslice paper was just accepted at @JAMIAOpen! We provide a pipeline to extract clinically relevant sections of medical notes (HPI, Interval Hx, Assessment and Plan) using fine-tuned language models.

1

3

1

49

Thomas Sounack @tsounack

5 months ago

These small open-sourced LLMs can run on laptops (and even good smartphones), meaning that any institution can run them securely behind their firewall. This is significant since HIPAA-compliant LLM access is still rare for medical institutions.

1

0

26

Thomas Sounack @tsounack

9 months ago

@MaziyarPanahi Yes, we tested our model on sequence and token classification for both biomedical and clinical datasets. You can find the results and our analysis in our preprint: https://t.co/g6YA9R3bqU

0

1

0

62

Thomas Sounack @tsounack

9 months ago

Want to continue training an encoder on your own data, but not sure where to start? Our step-by-step guide for reproducing the BioClinical ModernBERT training was just released! 1/5

2

13

3

17

2K

Thomas Sounack @tsounack

9 months ago

If you would like to see more details about a certain aspect of the guide, please don't hesitate to reach out! Your contributions are welcome and will be acknowledged. Link to our HF collection: https://t.co/6zXS33GrY9 Link to our paper: https://t.co/UzEtzltiRr 5/5

0

2

1

0

137

Thomas Sounack @tsounack

9 months ago

If you are working with a lot of biomedical and/or clinical text, consider continuing MLM training of BioClinical ModernBERT on your own data! The resulting encoder will be much easier to fine-tune on your various downstream tasks (embedding model for RAG, classifier...) 4/5

1

0

152

Thomas Sounack @tsounack

12 months ago

Exciting work from @neumll !

NeuML

@neumll

12 months ago

🧬🔬⚕️ Building on the popularity of our PubMedBERT Embeddings model, we're excited to release a long context medical embeddings model! It's built on the great work below from @tsounack Model: https://t.co/AFF9CKa8Tb Paper: https://t.co/JJH6Tx30GJ https://t.co/pSXJg2nBBa

0

15

3

4

644

0

4

0

125

Thomas Sounack @tsounack

12 months ago

Exciting to see BioClinical ModernBERT (base) ranked #2 among trending fill-mask models - right after BERT! The large version is currently at #4. Grateful for the interest, and can’t wait to see what projects people apply it to!

tsounack's tweet photo. Exciting to see BioClinical ModernBERT (base) ranked #2 among trending fill-mask models - right after BERT!

The large version is currently at #4.

Grateful for the interest, and can’t wait to see what projects people apply it to! https://t.co/EFZ8Fim4OB

0

12

7

4

942

Thomas Sounack @tsounack

12 months ago

Github link: https://t.co/Q04t1zBR5Q

0

6

1

2

136

Thomas Sounack @tsounack

12 months ago

BioClinical ModernBERT github repo is online! It contains: - Our continued pretraining config files - Performance eval code - Inference speed eval code Step-by-step guide on how to continue ModernBERT or BioClinical ModernBERT pretraining coming in the next few days!

1

17

3

4

804

tsounack retweeted

Mike Dupont @introsp3ctor

12 months ago

https://t.co/xGJeik3UZb https://t.co/2vHAxRfLX2 next demo visualizing BioClinical-ModernBERT-base embeddings on a sphere

3

7

1

474

tsounack retweeted

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

12 months ago

BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP → Built on ModernBERT with 8K context, RoPE, and fast unpadded inference Trained via two-phase continued pretraining: - Phase 1: 160.5B tokens (PubMed + PMC + 20 diverse clinical datasets) - Phase 2: 2.8B clinical-only tokens for specialization - Retains biomedical knowledge, improves clinical performance Architecture & Training - Alternating local/global attention - 50K+ vocab for medical terms - MLM rate: 30% → 15% w/ WSD scheduler - Base (150M) and Large (396M) models trained on H100s Performance (Base) - ChemProt: 89.9 F1 (↑ vs BioBERT 89.5) - Phenotype: 58.1 F1 (↑ vs Clinical BERT 25.8) - DEID NER: 82.7 F1 (↑ vs BioMed-RoBERTa 81.1) - SOTA on 4/5 tasks - 71–75k tokens/sec at all sequence lengths - Outperforms baseline encoders (BioBERT, Clinical BERT, Clinical-BigBird, etc.) - Outperforms BigBird, Longformer, Clinical-ModernBERT in throughput 𝘚𝘖𝘔𝘌𝘛𝘐𝘔𝘌𝘚 𝘠𝘖𝘜 𝘑𝘜𝘚𝘛 𝘕𝘌𝘌𝘋 𝘈 𝘉𝘌𝘙𝘛 PAPER: https://t.co/KYVFyXUDOC

gm8xx8's tweet photo. BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP

→ Built on ModernBERT with 8K context, RoPE, and fast unpadded inference

Trained via two-phase continued pretraining:

- Phase 1: 160.5B tokens (PubMed + PMC + 20 diverse clinical datasets)
- Phase 2: 2.8B clinical-only tokens for specialization
- Retains biomedical knowledge, improves clinical performance

Architecture & Training
- Alternating local/global attention
- 50K+ vocab for medical terms
- MLM rate: 30% → 15% w/ WSD scheduler
- Base (150M) and Large (396M) models trained on H100s

Performance (Base)
- ChemProt: 89.9 F1 (↑ vs BioBERT 89.5)
- Phenotype: 58.1 F1 (↑ vs Clinical BERT 25.8)
- DEID NER: 82.7 F1 (↑ vs BioMed-RoBERTa 81.1)
- SOTA on 4/5 tasks
- 71–75k tokens/sec at all sequence lengths
- Outperforms baseline encoders (BioBERT, Clinical BERT, Clinical-BigBird, etc.)
- Outperforms BigBird, Longformer, Clinical-ModernBERT in throughput

𝘚𝘖𝘔𝘌𝘛𝘐𝘔𝘌𝘚 𝘠𝘖𝘜 𝘑𝘜𝘚𝘛 𝘕𝘌𝘌𝘋 𝘈 𝘉𝘌𝘙𝘛

PAPER: https://t.co/KYVFyXUDOC

0

16

3

12

1K

Thomas Sounack @tsounack

12 months ago

@robot__fan @antoine_chaffin Should be up now!

1

2

0

16

Thomas Sounack @tsounack

12 months ago

@SunJacques_ Thanks Jacques! Was cleaning up the repo, it should be accessible now.

0

1

0

31

Thomas Sounack @tsounack

12 months ago

Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)

4

65

14

47

17K

tsounack retweeted

Joseph Pollack #Ï 🎗️ @josephpollack

12 months ago

we are so back "Mitochondria is the powerhouse of the [MASK]."

5

9

2

1

694

Thomas Sounack

@tsounack

Last Seen Users on Sotwe

Trends for you

Most Popular Users