The GTEx V10 has been released. Compared to release V8, V10 has 12% more RNA-seq samples and 23% more samples used for eQTL analyses. V10 release includes new smallRNA-seq data from 16,760 tissue samples, including gene expression counts and cis-QTLs. It also includes deep somatic WGS(~195X) from 5-7 tissue samples from each of 55 donors (199 samples in total)
https://t.co/2VVMYapNHj
#UKBiobank #AllofUsResearch #GenomicsEngland #AI #Drug @hmg_journal
Interested in 3' UTR changes in scRNA-seq? Check out our:
* CPA-Perturb-seq paper
* UCSC Perturb-seq tracks (see who regulates your favorite 3’ UTR!)
* Seurat extension for polyA site analysis (PASTA)
* Deep neural perturbation network (APARENT-Perturb)
https://t.co/XXYEBHV1iG
Today, RGC has published an important paper in @Nature today, describing an analysis of close to a million human exomes (n=983,578) as a single variant call set (!). This is the largest and most diverse rare variant database created so far. This impressive feat is accomplished by a large @RegeneronDNA team led by my wonderful colleague @suganthibala.
@SunKat_y et al. Nature 2024
https://t.co/JiJQxCFKBT
What kind of insights can we learn from sequencing ~980k exomes? Below is a summary of the major findings from the paper.
Background of RGC
Regeneron Genetics Center (RGC) was established in 2014 just on time when major pharma companies started entering into the human genomics playfield. Last year, RGC celebrated its 10th year anniversary. I've written about the origin story of RGC before (https://t.co/LoZRQOpkK1).
The business model of RGC is simple and efficient. It collaborates with academic institutions across the world and provide sequencing as free service in exchange for access to genotypic and phenotypic data.
The first successful collaboration was made with Geisinger Health system (GHS) to sequence 100,000 individuals, which was soon followed by an avalanche of large collaborations. Some of our largest collaborators include UK Biobank (N=500k), GHS (N=175k) and Mexico City Prospective Study (N=150k). Today, RGC has more than 300 collaborations around the world. Just a few months ago, it surpassed the milestone of 2 million exomes. What is described in the current paper is only a fraction of that sample.
Diversity of samples
The 980k exome dataset come from a diverse set of samples. 23% (n=190k) of the participants are of non-European ancestries, the largest proportion to date for any similar datasets created so far. This includes both outbred populations and special populations enriched with communities with long-standing cultural history consanguineous and endogamous unions.
When it comes to human genetics, diversity is the key to making discoveries. Almost everyone agrees, and the field is embracing it now. But RGC is way ahead of the game. Just a few months ago, RGC partnered with other companies and laid the first foundational stone of what will become in a few years from now the world's largest genomics resource comprising half a million African Americans and Africans (https://t.co/JE0Yycm8E1).
Variant survey
Human genome is ~3 billion base pairs long. ~1% of which (~30 million base pairs), containing exons, is targeted by exome sequencing. By sequencing 980k exomes, the authors have captured ~16.5 million unique variants. That is, on average, one per every two base pairs across the exome.
The main goal of concentrating on exomes is to capture deleterious spelling errors in the genome, resulting in either loss or substantial decrease or, sometimes, increase in gene function. The authors have identified
- ~1.1 million predicted loss of function variants (pLOFs), ~50% of which are singletons (that is, seen in just one individual)
- ~10 million missense variants, 40% of which are singletons.
As expected, African ancestry groups had more variants (18% more) than any other ancestry group.
Footprints of selection
pLOFs in the human genomes are like bullet holes in aircraft returning from war. The genes untouched or rarely hit by the pLOFs are the most critical genes, without which life is probably impossible.
Studying ~980k exomes, the authors have identified ~4000 genes that are depleted of pLOFs, suggesting they are indispensable. For more than 20% of these genes, we are learning their critical requirement for normal life for the first time. Previous datasets were not able to quantify their mutation constraints because of the shorter length. Most of these genes were not linked to a human disease yet. The current list will inspire many Mendelian discoveries in the near future.
Regional selection
We have 10 times more missense variants than pLOFs, which means we can zoom into within genes and study which parts of a gene are indispensable and which parts aren't.
Not all parts of a protein are critical, but some parts are. For example, DNA binding regions of transcription factor protein, catalytic sites of an enzyme protein, transmembrane domains that forms the pore of channel proteins etc. With a knowledge of ~10 million missense variants from 980,000 humans, such critical regions are now starting to light up, illuminating the most crucial regions of proteins. For example, here is a trace of missense tolerance across different domains of cancer gene KRAS. Human genetics shows that the first 80 amino acids as the most critical region of KRAS, falling under the top 1 percentile of regional missense constrain metric.
Human knockouts
The function of a gene in an organism is understood, typically, by studying the phenotypic consequences of deleting the gene. We cannot do such experiments in humans. But fortunately, Nature has already done this mutagenesis experiments for us. By studying naturally occurring human knockouts, we can assess the consequences of completely inhibiting a gene. This is crucial data for drug developers, as it informs about safety of drugs that act by inhibiting a gene or its product.
Studying the pLOFs across 980k humans, the authors have found 4.686 genes with at least one human knockout, suggesting that a life without these genes is likely possible. In line with that, the authors find that these genes are the ones that were mutationally least constrained (that is, they are enriched for pLOFs). For >1700 genes, we are learning for the first time humans completely lacking these genes do exist in this world. This is an incredible resource for drug development.
Clinical genetics insights
One of the most important use case of reference variant databases is to help clinical geneticists to identify disease causing variants in the patients. Historically, variant databases have been biased towards European populations. As a result, clinical geneticists struggle when they study exomes of non-European ancestry patients and often label the suspected variants as variants of unknown significance (VUS), because of a lack of proper reference database.
Cross-referencing the clinvar database with RGC dataset, the authors find European ancestry groups had more variants labelled "pathogenic" in Clinvar than African ancestry groups. Conversely, African ancestry groups had more VUS than European ancestry groups. This is not because Africans are protected from pathogenic variants, but simply reflect current databases are ignorant to clinically important variants in non-European ancestry individuals. With growing diverse databases such as the current one from RGC, the situation will soon change.
Conclusion
RGC has created one of the largest reference database for studying human exomes. The implications of this resource are many, spanning all areas of human biology from basic science to drug discovery.
Congrats to all my colleagues (@SunKat_y et al.) on this incredible accomplishment. And thanks to all RGC collaborators and research participants without whom such a dataset wouldn't exist.
As part of v4, we are happy to announce the launch of the #gnomAD forum https://t.co/7Lkg4FHjFA. This will be a place for our users to help each other, discuss the data and ask questions. #ASHG23
Lastly, I'd like to thank @NeBanovich & @MustafaRaoofMD for giving me the opportunity to work on the project & for the training in cancer genetics/genomics. It's been a great journey for me in Banovich lab & I wouldn't be where I am now without the training from Nick (12/12).
I am super excited to share our latest work using scRNA-seq to characterize cell types, states & molecular programs in disseminated appendiceal neoplasms:
https://t.co/cR6ua9p880 (1/n)
We would like to thank the co-authors for your help with sample prep, & feedback on the manuscript. We thank @EvanDMee1 for helping with GEO data submission. Most importantly, we would like to thank the patients and donors, without you this work would never been completed. (11/n)
Today I said farewell to the @NeBanovich and the lab to move on to a new position with @kbrowngenetics. I’m excited for the new position and will definitely miss TGen and Banovich Lab.
Bittersweet day. Today is @linhbui2811's last day in the lab. She was the first postdoc to join my group in January of 2019. In that time she's been an integral member of the lab. She's moving on to a staff scientist position at @theNCI with @kbrowngenetics. We'll miss you Linh!