We co-authored a new post on the @awscloud ML Blog with Greg Fina, Jim Fratantoni, and ND Ngoka, walking through how @Pulse__AI extracts clean structured data from financial documents that break traditional OCR and turns it into a fine-tuning set for Amazon Nova Micro on Bedrock.
Every enterprise that processes documents eventually hits the same wall. Extracting data from PDFs is one thing, but putting data back into a PDF is where the workflow breaks down.
Most enterprise PDFs don't have defined form fields at all. Government forms arrive as scanned images, insurance claims are flat PDFs, and healthcare intake forms have been printed and photocopied beyond recognition. Traditional tools have nothing to work with.
@Pulse__AI Form Fill uses our vision-language models to analyze the full document layout, detect where fillable regions exist based on visual context, and place values in the correct locations. You provide structured data or natural language instructions, and Pulse returns a filled PDF.
Healthcare teams are already using this to automate pre-auth forms, CMS-1500 claims, and provider enrollment packets through a single API call.
Full blog post: https://t.co/fWaKnjt8op
The @Pulse__AI research team just published a new blog: Why LLMs Cannot Own Enterprise Document Parsing.
The core argument: reading a document and ingesting one are different problems. An LLM can read a PDF and give you something clean-looking. Ingestion means preserving exact text, numbers, table structure, and source evidence across millions of documents, deterministically and auditably. That's what enterprise workflows in finance, healthcare, legal, and insurance actually need.
The dangerous failure mode of LLM-only parsing is silent: often markdown may look polished, tables look coherent, but underneath, rows are dropped and decimals are incorrectly parsed. One wrong field can corrupt a downstream model, claim, or contract review.
The right architecture is hybrid: computer vision for layout, OCR for text, table-specific extraction for structure, and language models for schema mapping, with deterministic validation for business rules.
Regardless, we ran our newest model, @Pulse__AI Ultra 2, on RD-TableBench.
Pulse Ultra 2 is state-of-the-art on RD-TableBench across both mean and median similarity on the full 1,000-sample set.
Pulse Ultra 2 (SOTA): 0.9172 mean / 0.9684 median
Reducto (2024): 0.9024 mean / 0.9278 median
Reducto Agentic (2026): 0.8842 mean / 0.9308 median
Reducto (2026): 0.8665 mean / 0.8957 median
Mean reflects aggregate performance across the dataset. Median reflects how the model handles a typical document, without being skewed by a handful of unusually easy or hard samples.
Read our in-depth analysis here on other benchmarks in the space: https://t.co/tf4uSBzMHs
🧵Our team at @Pulse__AI has been analyzing the table extraction benchmarks the industry is converging on. Most don't reflect how these systems get used in enterprise workflows, and in some cases the ground truth itself has issues that change how you'd interpret the leaderboard.
The full analysis with examples is in the comments. One specific benchmark worth diving into deeper:
We audited all 1,000 RD-TableBench ground truth files against their source images. 43 (4.3%) contain verifiable errors across three categories: scrambled text or wrong structure (28), garbled OCR including watermark bleed and hallucinated characters (13), and buffer artifacts with random digit sequences attached to real numeric values (2).
On top of that, 89 of the 1,000 ground truth files are byte-for-byte identical to a single provider output shipped in the same dataset. In 15 of the 43 error cases, the ground truth and this provider share the exact same specific error while independent providers do not. Column headers in the same wrong word order ('tropical forest Arid scrub' instead of 'Arid tropical scrub forest'). Watermark text like pulled into cell content in both, and identical garbled character sequences on CJK samples.
Benchmarks are becoming the basis for how document intelligence is being measured, and ground truth quality is the foundation they rest on. We think it's worth the community being able to evaluate that alongside the numbers.
Table extraction is one of the hardest unsolved problems in enterprise document intelligence, because tables encode relationships where position is meaning and a number landing in the wrong cell doesn't just look wrong, it breaks downstream analytics, compliance workflows, and financial models. As the space matures, benchmarks are how the industry measures progress, so we took a close look at several of the public table extraction benchmarks, including OmniDocBench, SCORE-Bench, ParseBench, and RD-TableBench.
What we found is that every one of them has a structural issue that lets providers look better than they actually are. TEDS conflates formatting with structure so that HTML convention differences register as errors. Spatial tolerance parameters can hide missing headers while still reporting high content accuracy. Linearizing 2D tables into 1D sequences means column swaps can align well despite being structurally wrong. And when we audited the 1,000 ground truth files in RD-TableBench against their source images, we identified 43 with verifiable errors, with 89 of them byte-identical to one provider's output and 15 containing the same specific error as that provider while independent providers got it right.
This is what motivated us to build PulseBench-Tab: multilingual by design with 2D-aware scoring, structure cleanly separated from formatting, and everything open from the dataset to the scoring code to the per-sample results.
Our goal is to help move the industry forward, and we'd love the community to pitch in on how we can continue to improve.
Full breakdown, with examples, scoring impact, and the corrected ground truth analysis, linked below.
Read the full analysis: https://t.co/PAq3xT03eu
We took a close look at several public table extraction benchmarks, including OmniDocBench, SCORE-Bench, ParseBench, and RD-TableBench, and found that every one of them has a structural/methodology issue that doesn’t properly evaluate document intelligence at scale.
1/ OmniDocBench's TEDS metric conflates formatting with structure. A 3x3 table with identical content scores differently based on whether it uses <thead> wrappers or plain <tr>.
2/ SCORE-Bench's spatial tolerance parameter can hide serious failures. Drop a financial table's header row, shift data up by one, and the benchmark reports high content accuracy while the headers are gone.
3/ ParseBench relies on frontier VLMs for ground truth, introducing model bias into the benchmark itself. Its table metric treats records as unordered bags, so column transposition and row reordering go unpenalized. The table set is also 503 pages, 54.5% from a single SERFF source, and English-only.
4/ RD-TableBench linearizes 2D tables into 1D sequences, so column swaps can align well despite being structurally wrong. We also audited all 1,000 ground truth files and found 43 with verifiable errors and 89 byte-identical to one provider's output.
That's why we built PulseBench-Tab, a benchmark that is multilingual by design, uses 2D-aware scoring that preserves horizontal and vertical adjacency, cleanly separates structure from formatting, and is fully open from the dataset to the scoring code. Full breakdown by the @Pulse__AI team in comments.
Pulse just open-sourced PulseBench-Tab, a frontier benchmark for table extraction across 9 languages, alongside T-LAG, a new scoring metric that evaluates structure and content in one unified score.
After processing billions of pages for Fortune 50 enterprises, the largest investment firms, and the leading AI startups, @Pulse__AI is opening up the evaluation methodology behind how they train and measure their models.
Most existing benchmarks miss the structural relationships that determine whether a table was actually understood, and T-LAG fixes that.
Pulse now sets the standard for what production-grade extraction actually requires.
Congrats on the launch, @sid_mnk and @ritvikpandey21!
https://t.co/KmZhnknczY
Ultra 2 from @Pulse__AI, our highest-accuracy document extraction model and the one now running in production at some of the largest banks and Fortune 100 enterprises. A year of work went into a new training corpus, a new annotation pipeline, and an entirely new architecture built from the ground up.
The previous generation routed every document through a chain of independent models: OCR, layout detection, table detection, cell extraction, structure reconstruction. Each hop was a new failure point, and upstream mistakes cascaded downstream.
Pulse Ultra 2 replaces that entire pipeline with a unified end-to-end architecture. One model, one forward pass, jointly solving layout understanding, text recognition, cell detection, and structure prediction.
The accuracy gains show up most on the hardest document layouts: merged cells, borderless tables, multilingual documents across 100+ languages, and complex multi-table pages.
Given the high computational demand, we're rolling out with specific rate limits for non-enterprise accounts while we scale capacity. Reach out to our team with the link below for higher rate limits or dedicated capacity.
Footnote extraction is now available on the @Pulse__AI platform.
Every footnote is returned as a structured object with its reference marker, full text, and positional metadata linking it back to exactly where it was cited in the document.
Available now here: https://t.co/t9oQu3Md5U
The dirty secret of document AI: most accuracy problems aren't model problems. They're scoping problems.
If your schema is running against the full document, you're asking your model to find signal across cover pages, appendices, boilerplate, and blank pages. It doesn't know what matters. You do.
A billion pages taught us the fix is upstream: define which pages are relevant before extraction runs. Clean inputs, clean outputs.
@Pulse__AI is launching Split for public access today.