Building Docstrange, open-source document intelligence for developers. Turning messy PDFs into clean Markdown for RAG pipelines. Sharing what I learn about doc
https://t.co/GrLXfd8wKK
Table extraction is still a hard problem for these models. Gemini 2.5 flash indeed is very good in IDP tasks. More such insights!
It's officially live! The Intelligent Document Processing Leaderboard.
Featuring 16 datasets, over 9,000 documents, and 6 distinct tasks.
More models will be added soon, or you can evaluate on your own— the datasets and code are open-source too!
https://t.co/bnT4H7DpM9
@vanstriendaniel@allen_ai Faced the same issue with this bench. I think it should be fair to eval model + some postprocessing.
If model over extracts but knows what's it's extracting, it can selectively remove things from output to make it compliant with bench?
@GptMaestro Any good benchmarks to test graph performance of this plugin vs native graph databases. Maybe scope of queries possible, latency etc?
Which metrics do you think matter for a graph DB built for agents?
@ProfAdebay@Lunexalith@teortaxesTex Self host, yes. Locally not in near future probably. Going to be very large models, no economies of scale on local etc.
@LandoTakingOver@Lunexalith@teortaxesTex The only argument could be they have only unlocked large scale distillation capability yet, so their best models can't be better than US models. Then might as well release them.
A big assumption though!
Looking at mem benchmarks, most try to evaluate mem systems by giving access to a model to memory and seeing how well they do the task.
Wouldn't a good bench be to directly evaluate system? Eg give it lot of tax filings and figure out what % of tax code it can infer from it?
So far multi-agent setup was needed for more direct practical purposes like context management. Reading mythos's system card (if true), you would need multi-agent setup to minimize reward hacking, setting up accountability, manage model's psyche, just how you build organizations!
"When a metric becomes the target, it stops being a good metric" - Goodharts Law
last few days GLM-OCR has been trending after it claimed 95% on OmniDocBench, which is higher than Gemini-3-pro
in reality GLM-OCR is way worse than the story these benchmarks paint, lets see how
full disclaimer: ive been working in this space for the last 7 years with @nanonets
@heyrimsha@Wealth_Pill We tested it on documents slightly different from ones in popular benchmarks, and it doesn't do well. Model is clearly benchmaxxed
@_karthik https://t.co/Tx2YeGoAJE
The last batch of frontier models became better than finetuned ones, next wave of finetuned ones should surpass them again. These small size VLM's are where a lot of architecture development is happening IMO
@bindureddy For specific tasks like OCR, still some delta in SOTA vs flash varients, however small domain specific models are doing better and better
https://t.co/Tx2YeGoAJE