Work led by Ryan Synk, with @cenksahinalp (NCI), Prashant
Pandey (Northeastern), and @ramani_d
Preprint: https://t.co/NpjSKTB6fL
Code and trained checkpoint to follow.
Searching the NIH Sequence Read Archive - 50+ petabases of DNA reads - for matches to a query sequence could transform biology. But existing tools either don't scale (BLAST) or break under sequencing error (k-mer methods). Our new preprint, LOCALE, takes a different approach. ๐งต
The broader point: the bottleneck for petabase-scale sequence search is no longer scalability alone โ it's robustness to sequencing error and biological variation at scale.
Recasting search as dense retrieval opens a path to the full ANN / quantization toolkit from IR.