Very excited to share the release of BioClinical ModernBERT!
Highlights:
- biggest and most diverse biomedical and clinical dataset for an encoder
- 8192 context
- fastest throughput with a variety of inputs
- sota results across several tasks
- base and large sizes
(1/8)
We are setting out to develop some new recommendations (TRIPOD-CODE) to provide guidance on reporting the availability & structure of code for predictive AI healthcare tools
Watch this space & read the protocol here
https://t.co/XKoXyBYzdY
#transparency#code#reproducibility
Another interesting finding was that simple fine-tuning allowed these small models to consistently return parsable JSON outputs. This may be worth exploring if you plan to use small LLMs for a structured output generation task
Our Medslice paper was just accepted at @JAMIAOpen!
We provide a pipeline to extract clinically relevant sections of medical notes (HPI, Interval Hx, Assessment and Plan) using fine-tuned language models.
These small open-sourced LLMs can run on laptops (and even good smartphones), meaning that any institution can run them securely behind their firewall. This is significant since HIPAA-compliant LLM access is still rare for medical institutions.
@MaziyarPanahi Yes, we tested our model on sequence and token classification for both biomedical and clinical datasets. You can find the results and our analysis in our preprint: https://t.co/g6YA9R3bqU
Want to continue training an encoder on your own data, but not sure where to start?
Our step-by-step guide for reproducing the BioClinical ModernBERT training was just released!
1/5
If you would like to see more details about a certain aspect of the guide, please don't hesitate to reach out! Your contributions are welcome and will be acknowledged.
Link to our HF collection: https://t.co/6zXS33GrY9
Link to our paper: https://t.co/UzEtzltiRr
5/5
If you are working with a lot of biomedical and/or clinical text, consider continuing MLM training of BioClinical ModernBERT on your own data!
The resulting encoder will be much easier to fine-tune on your various downstream tasks (embedding model for RAG, classifier...)
4/5
π§¬π¬βοΈ Building on the popularity of our PubMedBERT Embeddings model, we're excited to release a long context medical embeddings model!
It's built on the great work below from @tsounack
Model: https://t.co/AFF9CKa8Tb
Paper: https://t.co/JJH6Tx30GJ
https://t.co/pSXJg2nBBa
Exciting to see BioClinical ModernBERT (base) ranked #2 among trending fill-mask models - right after BERT!
The large version is currently at #4.
Grateful for the interest, and canβt wait to see what projects people apply it to!
BioClinical ModernBERT github repo is online! It contains:
- Our continued pretraining config files
- Performance eval code
- Inference speed eval code
Step-by-step guide on how to continue ModernBERT or BioClinical ModernBERT pretraining coming in the next few days!
Very excited to share the release of BioClinical ModernBERT!
Highlights:
- biggest and most diverse biomedical and clinical dataset for an encoder
- 8192 context
- fastest throughput with a variety of inputs
- sota results across several tasks
- base and large sizes
(1/8)