Chemprop v2: An Efficient, Modular Machine Learning Package for Chemical Property Prediction
1. Chemprop v2 is a novel rewrite of the original Chemprop software, significantly enhancing its speed, modularity, and usability. This version now supports multi-GPU training, which allows for faster and larger model training, making it a powerful tool for computational molecular design.
2. The new version maintains the predictive accuracy of its predecessor while improving execution time by about a factor of two and reducing memory usage by a factor of three. These enhancements make Chemprop v2 more efficient and scalable for real-world applications.
3. Chemprop v2 introduces a modular architecture that makes it easier to extend and integrate into Python-based workflows. The package is now organized into five subpackages: data, featurizers, nn, models, and uncertainty, each serving a specific purpose in the machine learning pipeline.
4. The Command Line Interface (CLI) has been improved with more logging statements for better debugging and support for various data splitting methods through the astartes package. Hyperparameter optimization is now handled by Ray Tune, which samples more effective parameter combinations.
5. Chemprop v2 includes extensive Jupyter notebook tutorials and comprehensive documentation, making it accessible to both beginners and advanced users. The tutorials cover a range of topics, including model training, inference, hyperparameter optimization, and model interpretation.
6. The performance of Chemprop v2 has been benchmarked against the original version, showing comparable results in property prediction tasks. This confirms the correctness of the implementation and assures users of its reliability.
7. The package now supports uncertainty estimation, Shapley value analysis, and other advanced features, providing deeper insights into model predictions and enabling more informed decision-making in chemical property prediction.
💻Code: https://t.co/JaycFeej15
📜Paper: https://t.co/FvKESXDuxd
#Chemprop #MachineLearning #ChemicalPropertyPrediction #DeepLearning #ComputationalChemistry #Python #MultiGPU #OpenSource
It's great to be back in Boston for the #AIChE#AIChE2025#AIChEAnnual conference this week! If you’ll be there, I’d love to connect. I'll be presenting on Monday and Thursday:
🎯 Talk: Data-Driven Strategy Selection for Multi-Fidelity Modeling in Autonomous Molecular Discovery
🗓️ Date: Monday, November 3, 2025
⏰ Time: 9:15 AM–9:30 AM
📍 Location: Hynes Convention Center, Room 210
🧪 Session: Automated Molecular and Materials Discovery: Integrating Machine Learning, Simulation, and Experiment
🎯 Talk: Toward Accurate Prediction of Near-Infrared Absorption: Physics-Based Calculations, Machine Learning, and the Crucial Role of Data
🗓️ Date: Thursday, November 6, 2025
⏰ Time: 1:42 PM–1:54 PM
📍 Location: Hynes Convention Center, Room 210
🧠 Session: Machine Learning for Soft and Hard Materials II: Hard Matter and Methods
Looking forward to connecting with people at #ACSFall2025! I'll be giving a talk on Wednesday morning:
"AI-driven chemistry research and undergraduate skill development with Chemprop"
Session: Harnessing the Power of AI in Undergraduate Research and Training
Day/Time: Wednesday (8/20) 8:00 AM - 8:20 AM
Location: Room 140A - Walter E. Washington Convention Center
Thanks to @BonnieHallGV , @sudeep_chemcomp, and Joe Reczek for organizing and for the invitation!
Benchmarking uncertainty quantification for protein engineering @PLOSCompBiol
1. Machine learning (ML) models have accelerated protein engineering, but estimating model uncertainty remains critical for optimizing biological properties and guiding experimental designs effectively.
2. The study benchmarks multiple uncertainty quantification (UQ) methods, including Bayesian ridge regression, Gaussian processes, CNN ensembles, dropout methods, and evidential learning, using diverse protein datasets.
3. Diverse datasets from the FLIP benchmark, such as GB1, AAV, and Meltome landscapes, were used to evaluate these methods under varying conditions, including distributional shifts and different protein representations.
4. Key findings show no single UQ method excels universally across all scenarios. Gaussian processes often yield better calibration, while CNN ensembles provide high accuracy but may lack calibration.
5. UQ estimates influence active learning (AL) and Bayesian optimization (BO) efficiency. However, uncertainty-informed sampling doesn't always outperform simpler greedy strategies, especially in early AL stages.
6. Pretrained protein embeddings (ESM-1b) outperform one-hot encodings in high domain-shift scenarios but exhibit mixed results under low domain-shift conditions.
7. Results highlight the need for further advancements in UQ methods and sampling strategies to ensure robust applications in protein design, especially in cases involving complex data distributions.
8. This work provides actionable recommendations for using UQ in protein engineering, emphasizing task-specific evaluation and potential integration of diverse ML architectures.
@kevinpgreenman@avapamini
💻Code: https://t.co/A8SLh2SHJ8
📜Paper: https://t.co/pF2CgkFYl9
#ProteinEngineering #MachineLearning #Bioinformatics #UncertaintyQuantification #ActiveLearning
We compared the calibration of various machine learning uncertainty estimation methods for protein engineering.
No method excels across all scenarios, and uncertainty-based strategies for optimization often did not outperform methods without uncertainty.
📢New preprint out! We constrain the molecular generation space to follow the "symmetry" of patented molecules that are likely to be synthesizable. Achieved with "symmetry-aware" fragment decomposition, and a constrained Monte Carlo Tree Search generator. https://t.co/NWidW2Wx9y
Zero-shot extrapolation for out-of-distribution (OOD) chemical property prediction is an important step towards high-performance materials discovery. Check out our spotlight at the #NeurIPS AI for Accelerated Materials Design Workshop! https://t.co/wHxezk4zD7
We are delighted to host our first class of ACT-CMS (Accelerating Curricular Transformation in the Computational Molecular Sciences) Faculty Fellows for their workshop this week! They are developing learning activities to integrate programming & computation into their courses.😀
What does the development of artificial intelligence say about the human soul? What leads scientific-minded people to convert to Christianity? The 2024 convention of the Society of Catholic Scientists will tackle both these questions and many more. https://t.co/pXEMmtpo6C
🎉 We're thrilled to announce our 1st class of MolSSI Faculty Fellows, who will collaborate with us over the next 2 years to create cutting-edge curricular resources, integrating programming, data, & computational education into their courses. Welcome! 👏
https://t.co/9MiPyb7OkW
Congrats to Dr. Kevin Greenman on defending his Ph.D. thesis at MIT! His groundbreaking work in chemical engineering uses AI to optimize experiments, aiding novel molecule discovery for displays and biomedical imaging.
He continues his research CatholicTech 🚀📚
Meet Professor Kevin Greenman: Final-year Ph.D. candidate in chemical engineering and computation at MIT, NSF graduate research fellow. Watch the clip to learn more!
#ChemicalEngineering#CatholicScientists#MIT#Innovation
Scripts to reproduce this study are available on GitHub (https://t.co/hhmyc7sh3d) and archived on Zenodo. Chemprop is available on GitHub (https://t.co/cSYHBrUPUd) and can be installed via PyPI or Conda. Documentation is available online, including a tutorial workshop on YouTube.
Check out our new paper on Chemprop in @JCIM_JCTC! We describe and benchmark many additional features we've added to the package since its initial release and paper in 2019. https://t.co/w1zf2rvwzK