To validate my PPP fraud detector I needed ground truth labels.
There were none.
So I scraped 3,400 DOJ press releases, extracted defendant business names, and matched them back to 11.3 million loan records.
325 validated matches. ~84% precision.
Built from scratch.
That entity resolution problem, not the model, was the hardest part of the whole project.
Real fraud detection always is.
https://t.co/cvIkdg6TEh