every company wants to be #1 in their own benchmark
worked with @micro1_ai to have an independently validated benchmark
huge s/o @ArthBohra@donaldwu_ and the rest of the team in making this happen
Many companies are #1 in a benchmark they crafted.
We worked with @micro1 to create an independently audited benchmark to measure document extraction performance with long documents.
The results of LongExtractBench show the nuances companies are likely to find in the real world. micro1 tested frontier models with max reasoning and document processing platforms with their strongest configurations, and found notable precision/recall and completion tradeoffs across most.
Reducto’s Deep Extract leads the industry by a wide margin. 🧵
everyone is #1 on their own benchmark.
really grateful to @micro1_ai for independently sourcing the real world data and manually correcting the ground truth to produce this evaluation.
one surprising takeaway: you cannot take reliability for granted. none of the other benchmarked systems achieved > 95% completion rate. that's a difference you feel in production.
I understand that "99.6%" feels benchmaxx'ed, but we really tried to optimize the pipeline very hard and put accuracy as our top priority.
It's still not perfect because in a production system "0.4%" still means you need human in the loop to QA the results. We will keep improving it.
Huge shout out to @ArthBohra and the team. They iterated this for months and turned a cool demo into a reliable production tool!
Today we're publishing LongExtractBench, a benchmark commissioned by @reductoai and independently validated by micro1.
We evaluated seven production document extraction systems across the same 225 complex enterprise documents. The benchmark was intentionally difficult: documents averaged 358 pages and contained roughly 88,700 ground-truth fields each. Every system was evaluated using the configuration documented in the benchmark methodology.
Key findings:
• Reducto Deep Extract was the only system to successfully complete all 225 documents.
• Direct frontier LLM baselines achieved substantially lower completion rates on long, complex documents.
• In this benchmark, dedicated extraction platforms achieved higher completion rates than the direct frontier LLM baselines.
• Recall was the clearest differentiator. Precision remained high across systems, but recall ranged from 33.8% to 99.6%, highlighting which systems consistently captured the information contained in long, complex documents.
The full report includes the benchmark methodology, limitations, and reproducibility resources. Check out the report and results in the comments below.
If these estimates from McKinsey hold true, we will be spending approx $7T for ~216 GW of incremental compute by 2030. For us to keep pace with this unprecedented buildout, walking down the full supply chain picture gets pretty insane
Kinect (@trykinect) turns every e-commerce store into an AI-powered storefront that actually sells.
As customers shop, online shopping assistants leverage what each customer is looking for in the moment, adapts to every visitor in real time, captures buying intent data they’ve never had before.
Congrats on the launch, @Kratik_ag & @VarunKand!
https://t.co/6jOPPs9sUx