https://t.co/qg4IrfIiMp
exists so Tamil Nadu’s crimes are tracked, remembered, and questioned — because data is the one thing they can’t spin. Feedback are welcome 🙃
11/ The lesson: scraping + infra was the easy 80%.
Getting a free local LLM to classify incidents like a careful analyst — without Gemini’s reasoning — is where all the real work is
9/ Backfilled history via GDELT back to 2021.
GDELT rate-limits silently — empty response instead of an error. Took a few failed runs to even notice the data was missing. - not live still in dev environment
10/ What’s next:
• Tighten the relevance filter
• Confidence scores + a review queue
• Close the reasoning gap (fine-tune / scale the model)
• Harden GDELT ingestion
8/ Root cause: the relevance filter was too loose.
It let political commentary and court procedure through, then force-fit them into a crime category that didn’t apply.
Garbage in, confidently-labeled garbage out.
6/ Taxonomy started too broad — even civic/infrastructure news.
That diluted everything. A “crime map” with potholes on it isn’t a crime map.
Cut it to crime-only: 8 categories.
5/ Dedup was a real problem — same event, 5 outlets, 5 different headlines.
Built 3 layers:
URL hash → title hash → fuzzy title similarity
Kills near-duplicates without merging genuinely separate incidents.
4/ Fix: self-hosted model. Ollama + gemma3:4b on my own GPU.
Zero marginal cost per request.
Tradeoff: a 4B model reasons a lot worse than Gemini — and that tradeoff became the whole project
To answer few questions that people asked me.
1/ I built a real-time crime intelligence system for Tamil Nadu.
It scrapes news, classifies every incident by district + category + severity, and plots it on a live map.
Here’s the build log — including everything that broke.
https://t.co/qg4IrfIiMp
exists so Tamil Nadu’s crimes are tracked, remembered, and questioned — because data is the one thing they can’t spin. Feedback are welcome 🙃
3/ v1 used the Gemini API for the LLM step.
Worked in testing. Then broke at volume — cost scaled with every article, and I hit rate limits during traffic spikes.
A cloud LLM per article doesn’t scale for a news firehose.
https://t.co/qg4IrfIiMp
exists so Tamil Nadu’s crimes are tracked, remembered, and questioned — because data is the one thing they can’t spin. Feedback are welcome 🙃