Can LLMs predict the effects of all potential missense variants in the human genome?
Predicting the effects of genetic variants on human proteins can be quite challenging.
Existing methods struggle to accurately distinguish between harmful and benign variants, especially when it comes to missense variants that substitute one amino acid for another.
Here, the authors explored two approaches: experimental methods like deep mutational scans (DMS), and computational methods like unsupervised homology-based techniques and protein language models (PLM).
While DMS can capture molecular and cellular phenotypes, they have scalability challenges and are imperfect proxies for clinical outcomes.
Alternatively, computational methods leverage protein properties and evolutionary constraints, but most are trained on labeled data, limiting their coverage.
One such computational approach is EVE, an unsupervised deep-learning method based on generative variational autoencoders, but its predictions are constrained to well-aligned proteins.
This study focused on ESM1b, a neural network based protein language model trained on millions of protein sequences. ESM1b's advantage lies in its ability to predict variant effects without relying on explicit homology, covering a broader range of variants.
The researchers developed a workflow to use ESM1b to predict the effects of all possible missense variants in known human proteins. They evaluated their approach on various benchmarks and compared it with other variant effect prediction methods.
The results showed that ESM1b outperformed other methods in classifying variant pathogenicity.
The most impressive of these was ESM1b's ability to predict variant effects across different protein isoforms. The authors state that it was able to, “distinguish between pathogenic and benign variants [and] yield a true-positive rate of 81% and a true-negative rate of 82%.”
However, ESM1b struggled with variants that led to nonsense-mediated decay (NMD), and the study utilized a sliding window approach for lengthy proteins, which could miss distant interactions. Validation against more experimental data will be crucial before applying ESM1b in real-world scenarios.
The emergence of LPLMs like ESM1b offers a promising avenue for predicting variant effects. These models could improve diagnostic accuracy, aid genetic association studies, inform protein engineering, and uncover new insights into protein function.
As LPLMs continue to advance, they hold promise for enhancing our understanding of genetic variants and their impacts on human health.
###
Brandes N, Goldman G, Wang CH et al. 2023. Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet. DOI: 10.1038/s41588-023-01465-0
🤖 This post was mostly written by ChatGPT. It only seemed right to let the LLM write about LPLMs. 🤖
I fixed all of the weird things it got wrong - like the most important result in the paper. 😬
Nettie Stevens discovered sex chromosomes in 1905. This former school teacher became a genetics pioneer that you need to know!
The late 1800s were a turning point in American history for women. The end of the Civil War marked the start of the women's rights movement and granted them significantly more control over their own lives.
That being said, women were still expected to be teachers or home makers but they won more personal freedom including better access to education.
Nettie Stevens took advantage of these expanded liberties, and started her educational pursuits at age 10 studying to become a school teacher.
In 1883, she began a decade-long career in education, filling roles both as a teacher and as a librarian.
However, this wasn't her life's dream, and in 1896, at the age of 34, she had saved enough money to enroll at Stanford University earning both her bachelor's and master's degrees in 1900.
It was during her summer studies that she took a keen interest in cytology while working at the Stanford Marine Lab. Here she spent her time glued to a microscope and published her first paper on the life-cycle of ciliates in 1901.
Stevens then left Stanford to continue her scientific studies at Bryn Mawr beginning her dissertation work under the guidance of Thomas Hunt Morgan.
You might be familiar with him.
Morgan received the Nobel Prize in 1933 for his work elucidating the role of chromosomes in heredity.
Interestingly, Morgan was initially skeptical of heredity, particularly as it related to sex determination.
At the time, there were two competing hypotheses. One being that sex was determined by environmental factors, like temperature, and another that sex was inherited as a trait.
In spite of the lingering questions on this topic, Stevens successfully completed her dissertation in 1903, and received an award from the Carnegie Institution to study how sex is determined.
This work is the subject of today's #FigureFriday wherein Stevens meticulously detailed the cellular structures of the reproductive organs of multiple insects.
The most striking of these being her drawings in 1905 of the chromosomes found in mealworms (Tenebrio molitor) where she observed:
‘In both somatic and germ cells of the two sexes there is a difference not in the number of chromatin elements, but in the size of one, which is very small in the male (170-s) and of the same size as the other 19 in the female (207).’
Even though she was the first to make this discovery, Thomas Hunt Morgan and another contemporary, Edmund Beecher Wilson, are often incorrectly given the credit.
Tragically, breast cancer cut her life and scientific career short at the age of 50.
But Nettie Stevens' contributions to our understanding of sex determination, along with her exquisite drawings, were the definition of groundbreaking in the field of genetics.
###
Stevens NM. 1905. Studies in Spermatogenesis with Especial Reference to the Accessory Chromosome. Carnegie Institution.
@GenomicsCow Not very but I'll bet you if someone opened a law firm to specifically find and go after these cases they'd probably do ok. There's a lot of labs out there that care more about billing than using the most updated methods. Most are still using Grch37 for alignment 🤣
Ethnic stratification and why single reference based analysis methods aren't 'good enough' in 2023.
If you've done genetics for any amount of time you know that stratification of populations is important for getting useful information out of them.
But if you're not a geneticist, it might be a good idea to explain why this is true.
Stratification is a statistical term that basically means "divide your data into subgroups."
In genetics, we usually start with age, gender, life style/exposure, and ethnicity.
The reason for doing this is to be able to determine if subpopulations within a dataset are more or less likely to have whatever it is that you're looking for.
This is usually a measurable trait.
Sometimes this is a common trait like height or eye color, but in healthcare, we're usually talking about disease traits.
So, figuring out if a specific gender, age group, or ethnicity is predisposed to a disease is important, but because diverse populations have mostly been absent from clinical studies, it’s hard to identify important markers of disease in them.
While we know ‘variants’ or ‘mutations’ can contribute to disease, how these contribute to disease can differ vastly depending on someone’s ethnic background.
Variants in one ethnicity may not matter in another ethnicity because mutations elsewhere can compensate in some way for those changes.
So, how you go about determining what is or is not a variant can have a serious impact on the conclusions you draw from a dataset.
As it stands now, most variants are determined by comparing a patient’s genetic sequence to a ‘reference.’
The reference here is the one determined by the human genome project.
This was supposed to represent the sequence of the average healthy human, but we know now that the bulk of this DNA was provided by a mixed race male.
So ‘variations’ from this reference might not be super accurate if we’re trying to determine the importance of a variant in a different ethnicity.
This bears out in multiple clinical evaluations with a recent survey determining that the number of variants of unknown significance (VUS) was markedly higher in Africans (45%) than Caucasians (32%).
A good chunk of the VUS-ness here has to do with whether the ‘reference’ was appropriate for an African vs a Caucasian, but it also has a lot to do with the fact that most genetic studies have been done in Caucasians, so we already have an idea which ‘variants’ are significant in that population.
Fortunately, we’re seeing progress on multiple fronts here with most associations and government institutions calling for greater diversity in clinical trials.
We also have a human pangenome reference now which more accurately characterizes the ethnic differences we see in our genomes. It’s not completely done yet, but pangenome based variant calling pipelines are available.
So the question is: how long will it take to integrate these updates into clinical practice?
@GenomicsCow This is a good point but I think the incentive is not getting sued. As clinical evidence changes you're kind of on the hook as the lab if you miss reporting out something that is now considered pathogenic!
While everyone else was distracted by the structure of DNA, Barbara McClintock was figuring out how chromosomes exchange information, and discovered a little thing called the transposable element.
The 1950's were a turning point in the fields of molecular biology and genetics. For many years, scientists studied genetics but weren't entirely sure how it all worked on the molecular level.
They were aware of chromosomes, knew that they were composed of DNA and protein, but there was still a big question about the functional role of those two components in inheritance.
Fortunately for McClintock and other scientists who studied genetics in model organisms - in her case, maize (corn) - it didn't matter much what the genetic material was but that it could be tracked and its effects seen in the offspring of her crosses.
As an early cytogeneticist (a scientist that studies chromosomes), McClintock developed many of the foundational staining techniques that were required for studying chromosomes in maize and she produced the first genetic maps of all 10 chromosomes.
Her early work was focused on studying inheritance and she was one of the first to observe and describe the phenomenon of homologous recombination, or crossing-over, which is when two chromosomes exchange genetic information. She published this work in 1931 and continued to study maize genetics until her curious observation that pieces of chromosome 9 appeared to move around the genome. But these pieces, Ac (activator) and Ds (inhibits pigment production), didn't just move around, Ac seemed to control Ds, and ultimately the color patterns of the maize kernels!
Today's #FigureFriday is an image of the kernels that McClintock described in, 'The origin and behavior of mutable loci in maize.' Kernel 10 - no Ac, 11-13 - 1 copy of Ac, 14 - 2 copies, 15 - 3 copies.
We know today that 66% of the human genome and 85% of the maize genome is made up of these mobile 'transposable' elements and they are key in our understanding of epigenetics, but back in 1950, this discovery was problematic.
At the time, it was thought that genomes were static and ordered; pieces of them did not move around. McClintock's findings were so poorly received that she stopped formally publishing her results after 1953 because she thought her colleagues just weren't interested.
Fortunately, her work was 'rediscovered,' or maybe more accurately, 'understood,' in the 1970's when similar elements were found in bacteria, flies and humans.
She was awarded the Nobel Prize in 1983 for the discovery of transposable elements, which was 3 decades after her original description of them.
McClintock's story in science isn't dissimilar from that of other prominent women, like Rosalind Franklin; however, McClintock lived long enough to finally be properly recognized for her historic achievements.
###
McClintock, B. 1950. The origin and behavior of mutable loci in maize. PNAS: 36 (6) 344-355. DOI: 10.1073/pnas.36.6.344
High throughput sequencing metrics: Don't be a monster, review them before sending data to the triage team.
One of the most important things to avoid when doing high throughput sequencing is 'bias.'
Properly assessing your post-alignment and variant calling metrics is super important for ensuring that biased data is not used to generate a patient report.
Here are some of my favorite metrics to keep an eye on:
Percent High Quality (HQ) Reads Aligned - The percentage of HQ reads that actually align to the reference genome. Here HQ is defined as Q20 or better and this stat should be greater than 98%.
GC Bias Plot - This is one of my favorites and for short-read data it should look like an upside down U with high AT and high GC regions showing slight bias (because of amplification) and for long-read methods this plot is usually flat. Any major deviations can indicate bias either from over amplification or if this is target capture data, bias in the capture process.
Insert size - These plots show the size distribution of the sequenced inserts within your library. These should pretty closely mimic the distribution you see in fragment analysis.
Percent Duplication - This is a measure of the number of perfectly duplicated reads in a dataset. The ideal here is less than 5% and usually if you see problems with read duplication you'll also see issues in the GCbias plot.
Coverage - A measure of the average depth of coverage across the genome or your provided target capture probe set.
Transition/Transversion Ratio (TiTv) - Transitions are A<>G or C<>T (substitutions within the purines and pyrimidines) and Transversions A<>C, A<>T, C<>G, and G<>T (conversion of a purine to a pyrimidine, etc). For genomes the expected TiTv is 2 and for exon capture panels it's 3. Major deviations from these values could indicate a bias during sequencing or sample degradation during storage.
Strand Bias - A measure of the bias of the genotype calls made on the positive and negative strands. No bias means calls are the same on each of the complementary strands, high bias means the calls differ and high strand bias around variant calls could indicate an over-reporting of false positives.
Target capture specific metrics:
Fold80 Penalty - This is a measure of evenness or uniformity. The best captures are ones that have perfect uniformity. Fold80 penalty is defined as "fold over-coverage necessary to raise 80% of bases in targets to the mean coverage level." 1 is perfect, so any deviation from that indicates a bias in capture. The best captures are <1.5.
Percent Reads On Target - This is a measure of how much sequencing is being wasted on non-specific binding or off target. This value can vary greatly depending on the size of your capture from 60-70% for an exome down to <20% for smaller capture panels. Deviations from the expected value can indicate bias.
Hey Oncology Market: The proteome is coming at you faster than you think.
One of the early promises that was made during the pitch for funding the human genome project was that once we had figured out the code of life, we'd be able to understand and cure all diseases.
In retrospect, (and even at the time) scientists knew this was hyperbole and that the genome was really just the bottom of the molecular biology pyramid.
Knowing the sequence of the bases is important, but it tells you very little about what is actually expressed by the genome.
For that you need other tools to look at the products of the genome like mRNA, proteins, and metabolites from cellular processes but also modifications to the genome itself that control which parts of it are accessible.
Together we refer to the genome plus all of these other things as the 'multi-ome.'
One of the hardest of these other 'omes' to measure is the proteome.
It represents all of the proteins that make up the little machines that allow our cells to function.
Each of our cells expresses different proteins, and these work together, ultimately creating all of the tissues that make up our bodies.
However, in diseases like cancer, these cellular functions are disrupted due to mutations in the genome that change what is expressed or alter how those proteins function.
We can pick up some of these signals by looking at the genome, but we can get an actual read out of the biology of these cells by looking at the composition of the other 'omes' in the bloodstream!
Up until about 10 years ago, looking at proteins was a very tedious task requiring gels and antibodies or highly complex purification schemes paired with tandem mass spectrometry.
Now we have new techniques for quantifying thousands of proteins at once which gives us a much more comprehensive look at the underlying biology of cancers.
The paper I'm showing a figure from today was written by the group behind the Human Disease Blood Atlas and they characterized 1,463 proteins in more than 1,400 cancer patients.
They then took the data from those results and used machine learning to develop algorithms for predicting AML, ALL, DLCBL, Myeloma, Lung, Colorectal, Glioma, Prostate, Breast, Cervical, Endometrial, and Ovarian cancer.
They followed up by detecting those cancers with relatively high sensitivity and specificity including AUCs for 6 out of 12 above 0.95 (and above 0.8 for the rest!).
While not perfect, this is pretty freakin' good for a first crack, and this work highlights the future potential for proteomics in multi-cancer early detection (MCED).
With some optimization and a more comprehensive method validation, proteomic approaches could make the current players in the MCED space sweat!
###
Álvez MB et al. 2023. Next generation pan-cancer blood proteome profiling using proximity extension assay. Nat Commun. DOI: 10.1038/s41467-023-39765-y
Mendel first described his laws of genetic inheritance in 1865. They were promptly ignored for 35 years.
Because, let's be honest, who gives a husk about peas?!?!
It probably also didn't help that his paper was titled, 'Versuche über Pflanzenhybriden.'
Which in English translates to, 'Experiments in plant hybridization.'
While this sounds exhilarating, it belies the foundational concepts described in its pages and Mendel's work breeding peas wasn't revisited until 1900 when his results were independently rediscovered and confirmed by E. Tschermak (peas), W. Spillman (wheat) and C. Correns (peas).
H. de Vries (flowers) also independently characterized plant genetics in 1900 but had to be told that Mendel scooped him 35 years earlier.
Oof.
So, what was it that Mendel discovered while studying peas?
He observed the physical traits of peas and how these traits were passed to their offspring after breeding.
Mendel established 3 genetic principles from these observations:
Segregation - Traits come in two forms but only one from each parent is passed to offspring.
Independent Assortment - The segregation of the forms of each trait occurs independently of any other trait.
Dominance - Dominant forms of each trait mask the recessive forms and they occur in a 3:1 ratio.
Mendel initially described these as laws, but we know now that there are numerous exceptions, so in genetics we often refer to Mendelian and non-Mendelian inheritance.
But in the early 1900's, the race was on to see if Mendel's phenomenon wasn't just some weird plant thing.
One of the biggest proponents of Mendel at the time was William Bateson.
Bateson is a forgotten figure nowadays, but he popularized the works of Mendel and also did some of the earliest work on genetic linkage.
He also became fast friends with a clinician scientist, Archibald Garrod.
Garrod was most interested in the biology of the diseases in his patients and had a particular fondness for chemistry.
This could be why he was so enamored with the color of his patients’ urine.
This curiosity paid off in 1899 when he first noticed a chemical aberration in the urine of patients with Alkaptonuria.
This is an ancient disease with symptoms being described as far back as 1500 BC, but one of its tell-tale signs is darkly pigmented pee.
Garrod noted to his friend Bateson that this disease was found most often as the result of marriages between first cousins, and at Bateson’s urging, Garrod documented the incidence of Alkaptonuria in these families.
Today’s #FigureFriday is the first evidence of a recessive Mendelian disease in Humans.
In the table you can see that in these families the dominant form of the trait is observed in 36 offspring, and the Alkaptonuria form in 12.
This perfectly aligns with the 3:1 ratio Mendel first observed in his peas!
###
Garrod AE. 1902. The incidence of alkaptonuria: a study in chemical individuality. The Lancet. DOI: 10.1016/S0140-6736(01)41972-6
@billytcl Yes! Biggest give away is if you mention sampling statistics and their eyes gloss over you know they've probably not thought through things very far.
Detecting variants can be challenging, especially detecting the ones that aren't very abundant.
Unfortunately, there are a good number of labs that use the default settings for their sequencing analyses, and/or implement premade pipelines that they 'validate' without doing the appropriate amount of work to make sure that the thing they've developed actually performs the way they say it does.
This gets extra tricky in oncology screening where the allele frequencies can drop well below 1% with the latest crop of sequencing based tests advertising sub 0.1% detection capabilities.
But what does it mean to be able to call a heterozygous variant with a 50% frequency in a germline sample or a 0.1% frequency variant in a liquid biopsy?
Can the assay do it everytime time?
Can the assay do it in every sequence context?
How do you know?
There are a few really good ways to know, but most places don't do these things because they're not required to:
In silico decimation - this is an informatic technique where the data from a large number of samples is randomly reduced ie take samples with a minimum coverage of 40x at each position and reduce them to 30x, 20x, 10x to see where the assay starts losing the ability to detect specific types of variants. In the case of liquid biopsy samples where the minimum coverage to call a sub 1% variant approaches 10,000x (depending on the quality/error correction strategy), decimation through a much higher coverage range might be warranted. It is also possible to create contrived datasets where variants are randomly inserted into the data at a specific frequency, but these programmatic manipulations of frequencies are only good for evaluating informatics performance, not lab process performance.
Contrived synthetic controls - one way to test process performance is to synthesize a variant into a sequence using a company like Integrated DNA Technologies and then spiking or diluting that sequence fragment into a sample to 'contrive' the mutation. This 'sample' can then be taken through the whole process and be used to determine at what allele frequency the lab process begins to fail to detect the variant (preferably this is done for every gene/target/exon in the panel).
Characterized mixture panels - many companies (and NIST) offer mixture panels. Some are contrived, others are mixtures of cell lines with well known variants, but most importantly, these panels have been independently characterized using sequencing and/or droplet PCR to precisely determine the allele frequencies of the variants contained in the panels. These allow for accurate benchmarking of the performance of an assay using an independent resource.
However, none of these methods is perfect and it's always a good idea for labs to track assay performance post-launch, especially as interesting positive samples are gathered or become available through other sources.
@GenomicsCow@h2so4hurts Yes, and that's what I'm alluding to at the end there. Contrived validations are awesome until you can gather enough samples to do an addendum with live samples.
Spoiler Alert: The future of disease early detection isn't going to be genomics.
We already have some pretty good hints that this is true.
One of them is that the hottest Multi-cancer early detection (MCED) screening test is based on methylation.
This falls squarely in the realm of epigenomics.
But we also have three decades worth of high throughput genomics under our belts now and we really don't have a ton to show for it.
We've certainly learned a lot in that time about the genome but one of the greatest lessons we've been taught is that genetics alone is a pretty terrible predictor of whether a healthy person will actually develop a disease.
Recent studies have shown that 8% of people carry a pathogenic mutation, but only about 7% of those people ever become symptomatic.
The caveat here being that some mutations are more likely to cause disease than others. These include mutations in BRCA1 and BRCA2 (Breast Cancer) or HBB (Thalassemia) in which 30-60% of patients with those develop disease.
But what do we do with all of the other 'damaging' pathogenic variants we find in healthy people?
The American College of Medical Genetics (ACMG) suggests extending the list of reportable findings to include pathogenic variants in 81 disease associated genes (inclusive of the 3 above).
These are all considered to be 'medically actionable' because steps can be taken to improve their clinical outcomes.
However, the struggle here, and with most genetic diseases, is knowing when, or if, a healthy person will ever become symptomatic.
Wouldn't it be great if we had a way to detect disease onset decades before there's an observable phenotype in a patient?
We're finally at a point where this might be possible and it's all thanks to new developments in proteomics (the study of proteins and how they interact within our cells)!
Interest in proteomics has increased drastically in the past few years as new technologies have made it possible to easily study thousands of proteins at a time in a single sample.
The power of these methods was highlighted in a recent paper where an aptamer array was used to screen ~5,000 proteins in the plasma of 11,000 patients who had participated in a 25-year study on atherosclerosis.
The researchers were able to use blood samples collected throughout the duration of that original study to identify a set of proteins that could predict the development of dementia up to 25 years before symptom onset.
That's pretty incredible!
While there's still a lot of work to be done, it's exciting to see how proteomics continues to increase our understanding of the biology of disease.
Because, as we've learned, the genome only represents potential, but the proteome could tell us when, or if, that potential is actually beginning to impact biology!
###
Walker KA et al. 2023. Proteomics analysis of plasma from middle-aged adults identifies protein markers of dementia risk in later life. Sci Transl Med. DOI: 10.1126/scitranslmed.adf5681
@dirkeggink Oh, I didn't take it that way at all! To those of us that have done genetic diagnostics for a while, these are totally obvious but I do worry about the expansion of these tools into labs where they haven't taken the time to really think through the process!