Brian Raymond

2 months ago

@rak_garg @rak_garg which syllable should we drop?

123

3 months ago

Today we’re announcing that Unstructured is now natively embedded inside Teradata Vantage. This isn’t an integration or a connector, our platform is running as a native service inside one of the most trusted enterprise data platforms in the world. Teradata customers run some of the most critical workloads — financial services, healthcare, defense, government. These are organizations where data sovereignty isn’t a nice-to-have, it’s a requirement. Now they can transform 70+ file types of unstructured content into AI-ready data, directly inside Vantage. No external pipelines. No additional infrastructure. We founded Unstructured with a simple conviction: the biggest unlock for enterprise AI is better data. This is what that looks like at scale.

3 months ago

We’re now natively embedded inside Teradata Vantage! @Teradata customers can transform 70+ file types into AI-ready data — no external pipelines, no data movement, full governance. Available across cloud, on-prem, and air-gapped environments. Read more: https://t.co/pghmWeyCf4

UnstructuredIO's tweet photo. We’re now natively embedded inside Teradata Vantage!

@Teradata customers can transform 70+ file types into AI-ready data — no external pipelines, no data movement, full governance.

Available across cloud, on-prem, and air-gapped environments.

Read more: https://t.co/pghmWeyCf4 https://t.co/qWjEaGu0mv

637

413

4 months ago

2025 was our biggest year yet for enterprise deployments into customer VPCs. Every single one taught us something about what actually breaks in production. Here are five of the most impactful lessons our Field Engineering team learned this past year: 1. Environment volatility isn't an edge case—it's the default. Proxy servers silently intercept traffic. Custom certificate authorities break TLS handshakes. API calls vanish into black holes. We stopped treating these as surprises and built pre-deployment validation tools that catch them before the first container even spins up. 2. Configuration shouldn’t require a "tear down." In the early days, changing a single parameter meant rebuilding the stack. Working in diverse customer environments taught us that adaptability can’t be an afterthought. Now, we build runtime configurability into the platform from day one, allowing us to adjust critical settings on the fly without downtime. 3. Answers are needed in minutes, not hours. Debugging in a customer's VPC with limited access is a high-pressure exercise. To solve this, we built AI-powered log analyzers that surface root causes instantly. What used to take half a day of manual digging now takes minutes. 4. Documentation is its own product. Our runbooks aren't written in a vacuum; they are forged from support tickets and troubleshooting sessions. By sharing these "living documents" with customers, we empower their teams to handle routine issues independently, reserving our engineering syncs for genuinely new challenges. 5. On-prem isn't always the answer. This year, we saw a shift: many teams (especially those already utilizing Snowflake or Splunk) wanted enterprise-grade security without the overhead of managing the infrastructure. In response, we built Dedicated Instances (DI). DIs provide a fully managed, isolated environment with complete network separation, offering the security of on-prem with none of the environmental volatility. Deploying into customer environments is an exercise in humility. You can't anticipate every variable, but you can build systems that adapt to the reality of enterprise environments. Reflecting on the past three years, it’s clear why the demand for on prem deployment is skyrocketing. The GenAI ecosystem grew up around incredible open-source tools—Unstructured, Weaviate, Chroma, LlamaIndex, LangChain, Crew AI, etc—that are easy to run on prem. However, as builders graduate these prototypes into production, they hit the "Enterprise Wall": friction with internal infra teams, security audits, and complex networking. At Unstructured, we’ve realized that success doesn’t come from a great product alone; it comes from being the partner that helps builders navigate those wickets and move to production at speed.

145

7 months ago

Couldn't be more excited to share v1 of our industry benchmarks. New models and preprocessing strategies are announced almost every week, and up until today, it's been almost impossible to draw conclusions with any confidence. What we found: - Different model pipelines excel at different tasks: Some configurations optimize for table extraction, others for content fidelity, and others for structural understanding - Trade-offs exist: A pipeline with the highest table accuracy might have slightly different hallucination characteristics than one optimized for element alignment - Consistent leadership: Across the diversity of configurations, Unstructured pipelines consistently outperform competitors on key metrics - Informed choice: You can select the pipeline that best matches your specific document types and downstream requirements Early next month we will be open sourcing our labeled dataset so that everyone can independently evaluate the performance of their preprocessing pipelines. No cherry picking, overfitting to specific docs, or obfuscation. With Unstructured you can trust us for maximum transparency for which models we're using, how we're using them, and how they're performing.

LLM Efficiency @NVIDIA - views have always been only my own 🥇🥈 @ Flunkyball Polish Championships

7 months ago

New benchmarks show Unstructured outperforming other leading document parsing tools across the metrics that actually matter: content fidelity, hallucination control, and table accuracy. We’ve released the numbers and open sourced the framework behind them. See the results, understand the trade-offs, and test it yourself. 👉 https://t.co/OkRQN1UfQx

UnstructuredIO's tweet photo. New benchmarks show Unstructured outperforming other leading document parsing tools across the metrics that actually matter: content fidelity, hallucination control, and table accuracy.

We’ve released the numbers and open sourced the framework behind them.

See the results, understand the trade-offs, and test it yourself.

👉 https://t.co/OkRQN1UfQx

348

Who to follow

Piotr Nawrot

@p_nawrot

Harrison Chase

@hwchase17

@LangChain Always hiring: https://t.co/D5Ut3loFO7

Zekun Wang (ZenMoore) 🔥

@ZenMoore1

#LLM #MLLM #GenAI Researcher @Kling_ai

7 months ago

Incredibly excited for all the new goodness we'll be rolling out this month. Starting things off, we've decided to massively simplify the user experience. No feature gating, multiple pricing tiers, etc. If you sign up through our website you can expect: 1) Access to every feature, connector, model, etc. We're not holding anything back. 2) No complexity around pricing. Go nuts and run wild DAGs with multiple VLM/LLMs. Create custom schemas, custom metadata, etc. One flat cost of $.03/page. Try doing anything similar with hyperscalers and you're looking at >$.10/page. That's with janky connectors. Trying doing the same thing with an IDP startup and you won't be able to do...they don't provide GenAI-ready data 3) We continuously test every connector, every API, every pipeline and deliver world class SLAs and uptime numbers. When we say we're "enterprise ready," it's not just a bumper sticker. We're committed to building tools that developers love. If you have any feedback at all, don't hesitate to drop me a note [email protected]. I'd love to hear from you.

7 months ago

Our new pricing is live. 15,000 free pages. No expiration. Full access to every feature. When you’re ready to scale, it’s just $0.03 per page: one simple price for every connector, every transform strategy, and everything you need to build production-grade ETL pipelines. Let’s go → https://t.co/uA6EMG6Qab

UnstructuredIO's tweet photo. Our new pricing is live.

15,000 free pages.
No expiration.
Full access to every feature.

When you’re ready to scale, it’s just $0.03 per page: one simple price for every connector, every transform strategy, and everything you need to build production-grade ETL pipelines.

Let’s go → https://t.co/uA6EMG6Qab

349

171

8 months ago

Excited to announce our OEM partnership with @IBM. Together, we’re helping enterprises move from messy, unstructured content to trusted, AI-ready data at scale. https://t.co/ELH3VheXyP + Unstructured = reliable AI. Special thanks to @ritika_gunnar, Ed Calvesbert, Emily Fontaine, and Stevan Slusher or helping to make this a reality!! You can read more here: https://t.co/xDOmhsr8qg

109

8 months ago

@clairevo 100000%

8 months ago

https://t.co/HQDgmnpvtz

110

8 months ago

OCR document parsing metrics create a fundamental problem: they punish generative models for being "different," even when they're actually right. A model can extract every piece of information perfectly, but if it formats the output slightly differently than expected, it gets penalized by rigid evaluation criteria. This mismatch between technical accuracy and practical utility led us to develop SCORE, a new semantic evaluation framework for generative document parsing. Rather than fixating on exact format matching, SCORE evaluates what truly matters: whether a vision-language model actually understood and preserved the document's content, structure and meaning. You can explore the full methodology in our paper: https://t.co/30Onttt504 This philosophy of prioritizing substance over form extends to how we think about parsing solutions more broadly. The best outcomes come from flexibility: choosing approaches that align with your specific data, workflows, and business requirements. That's precisely why Unstructured doesn't lock you into a single parser or model. Our platform provides multiple parsing strategies and seamlessly integrates with leading vision-language models—from Claude to GPT-4o to Gemini—with new options continuously added as the field advances. Every solution we offer is rigorously benchmarked against real-world scenarios. As new models and techniques emerge, we evaluate them, optimize their performance, and make them available to you. This means you're never constrained by today's capabilities and continuously benefit from tomorrow's advancements without the friction of constant migration or integration work.

503

9 months ago

DIY data processing solutions work — until their maintenance costs eclipse their value. We’re proud that thousands of teams run production systems on Unstructured OSS. That’s why we open-sourced it in the first place. Over the years, we’ve seen a clear pattern emerge. Teams start strong with DIY pipelines built on OSS: prototypes succeed, workflows get automated, and RAG pipelines begin powering real applications. But as usage grows — more documents, more complexity, more enterprise requirements — maintaining these systems quietly become many full-time jobs. Teams find themselves spending cycles on orchestration, retries, scaling, and troubleshooting, rather than building the features that differentiate their product. From helping hundreds of customers navigate this transition, one lesson stands out: knowing when to consider managed infrastructure isn’t about whether OSS works — it’s about recognizing when your engineering effort could be better spent elsewhere. Managed infrastructure allows teams to focus on delivering value while still benefiting from the base capabilities OSS provides, with the added advantage of continuous improvements, advanced parsing, extensive metadata enrichment features, and more, all handled by a team of experts. In our upcoming webinar, we’ll share these signals and lessons in more detail, helping you decide when it’s time to move from OSS to the Unstructured ETL+ platform: 📷 When to Move from Open Source to the Unstructured ETL+ Platform https://t.co/dUoNSd4GJ9

418

_Brian_Raymond retweeted

Daniel Schofield @UnstructuredDan

9 months ago

Notice something different? We just had a glow up. ✨ But our website update isn’t just skin deep. What really matters is what’s inside: a guide to taming your unstructured data, whatever form it takes. Stop dilly dallying. Get your data. 👉 https://t.co/k55aFWcoGJ @_Brian_Raymond @ctmaddock #RAG #AI #GenAI #ETL #UnstructuredData #LLM #MCP #EnterpriseAI #RAGinProduction #Transformation #Quality #LLMready #SourceConnectors #DocumentTransformation #DocTransformation #Transform #OCR #DocumentAI #VLM #StructuredData #DataQuality #Parsing #Unstructured #TheGenAIDataCompany

UnstructuredIO's tweet photo. Notice something different? We just had a glow up. ✨

But our website update isn’t just skin deep. What really matters is what’s inside: a guide to taming your unstructured data, whatever form it takes.

Stop dilly dallying. Get your data. 👉 https://t.co/k55aFWcoGJ

@_Brian_Raymond @ctmaddock #RAG #AI #GenAI #ETL #UnstructuredData #LLM #MCP #EnterpriseAI #RAGinProduction #Transformation #Quality #LLMready #SourceConnectors #DocumentTransformation #DocTransformation #Transform #OCR #DocumentAI #VLM #StructuredData #DataQuality #Parsing #Unstructured #TheGenAIDataCompany

611

_Brian_Raymond retweeted

9 months ago

At @UnstructuredIO, we often get the question "how well do you perform on scanned forms that include handwriting?" These types of documents are notoriously among the most difficult types of documents to ingest cleanly and reliably, yet they remain ubiquitous across many industries and are especially prevalent in healthcare, insurance, and similar domains. Our short answer? Brilliantly. But we encourage you to see for yourself via our free trial! → https://t.co/83noec85wV Our industry-leading VLM partitioner is designed to tackle the most complex documents generally across all business domains, but it is especially powerful when it comes to scanned, rotated/skewed, and/or handwritten documents. Parsing these documents with less sophisticated parsers results in one or more of the following: strings of jibberish characters due to inaccurate OCR; signatures treated as blobs; form fields lost; checkboxes ignored; marginal notes dropped entirely; or worse. By leveraging state-of-the-art models and grounding our VLM partitioner in a rich document element ontology, we produce rich, clean parses of these documents, without collapsing the document's structural context: - Handwritten fields captured as structured inputs with handwriting transcribed - Checkboxes encoded as checkboxes, not flattened text - Signatures and logos preserved distinctly - Page numbers and layout context retained - Layouts and sections captured The result: even your most complex, analog-origin documents are parsed into a consistent, auditable structure that downstream systems (data entry, RAG, compliance, analytics) can trust. See an example below: a scanned, tilted, complex, medical form, filled in by hand with dummy data on the left and our parsed, rendered, stylized HTML on the right. Of course, when VLMs and handwriting are concerned, very few parses will be 100% perfect, but even for complex, messy forms like this, you can often expect very high 90s in terms of both layout and textual content accuracy from our partitioner. This example evaluated at ~98+% for both content and layout accuracy. Want to learn more? Join us for my upcoming webinar: Document Transformation Quality Series: Pushing the Boundaries of Document Transformation Quality - https://t.co/8OB8iIvH4M #DocumentAI #Handwriting #ScannedDocs #VLM #Ontology #DataQuality #ScannedForms

UnstructuredDan's tweet photo. At @UnstructuredIO, we often get the question "how well do you perform on scanned forms that include handwriting?"

These types of documents are notoriously among the most difficult types of documents to ingest cleanly and reliably, yet they remain ubiquitous across many industries and are especially prevalent in healthcare, insurance, and similar domains.

Our short answer? Brilliantly. But we encourage you to see for yourself via our free trial! → https://t.co/83noec85wV

Our industry-leading VLM partitioner is designed to tackle the most complex documents generally across all business domains, but it is especially powerful when it comes to scanned, rotated/skewed, and/or handwritten documents. Parsing these documents with less sophisticated parsers results in one or more of the following: strings of jibberish characters due to inaccurate OCR; signatures treated as blobs; form fields lost; checkboxes ignored; marginal notes dropped entirely; or worse.

By leveraging state-of-the-art models and grounding our VLM partitioner in a rich document element ontology, we produce rich, clean parses of these documents, without collapsing the document's structural context:
- Handwritten fields captured as structured inputs with handwriting transcribed
- Checkboxes encoded as checkboxes, not flattened text
- Signatures and logos preserved distinctly
- Page numbers and layout context retained
- Layouts and sections captured

The result: even your most complex, analog-origin documents are parsed into a consistent, auditable structure that downstream systems (data entry, RAG, compliance, analytics) can trust.

See an example below: a scanned, tilted, complex, medical form, filled in by hand with dummy data on the left and our parsed, rendered, stylized HTML on the right. Of course, when VLMs and handwriting are concerned, very few parses will be 100% perfect, but even for complex, messy forms like this, you can often expect very high 90s in terms of both layout and textual content accuracy from our partitioner. This example evaluated at ~98+% for both content and layout accuracy.

Want to learn more? Join us for my upcoming webinar:
Document Transformation Quality Series: Pushing the Boundaries of Document Transformation Quality - https://t.co/8OB8iIvH4M

#DocumentAI #Handwriting #ScannedDocs #VLM #Ontology #DataQuality #ScannedForms

990

_Brian_Raymond retweeted

10 months ago

We're excited to announce that Unstructured has joined the @PalantirTech FedStart program to accelerate our path to FedRAMP High and IL-5 compliance. Through FedStart, we will leverage Palantir’s proven security and accreditation expertise to fast-track the deployment of secure, GenAI-ready data pipelines for federal agencies. This enables agencies to transform complex, multimodal data into AI-ready outputs at speed—while meeting the highest compliance standards. By streamlining compliance, we remove the data engineering bottlenecks that slow AI adoption, helping agencies focus resources on mission-critical GenAI applications instead of infrastructure and integration hurdles. This collaboration underscores our mission: turning unstructured multimodal data into a strategic advantage for the federal government. Read the full PR here → https://t.co/zxGGktlvSo #AI #GenAI #ETL #ETL+ #RAG #UnstructuredData #LLM #MCP #Compliance #Security #EnterpriseAI #RAGinProduction #LLMready #Unstructured #TheGenAIDataCompany

UnstructuredIO's tweet photo. We're excited to announce that Unstructured has joined the @PalantirTech FedStart program to accelerate our path to FedRAMP High and IL-5 compliance.

Through FedStart, we will leverage Palantir’s proven security and accreditation expertise to fast-track the deployment of secure, GenAI-ready data pipelines for federal agencies. This enables agencies to transform complex, multimodal data into AI-ready outputs at speed—while meeting the highest compliance standards.

By streamlining compliance, we remove the data engineering bottlenecks that slow AI adoption, helping agencies focus resources on mission-critical GenAI applications instead of infrastructure and integration hurdles. This collaboration underscores our mission: turning unstructured multimodal data into a strategic advantage for the federal government.

Read the full PR here → https://t.co/zxGGktlvSo

#AI #GenAI #ETL #ETL+ #RAG #UnstructuredData #LLM #MCP #Compliance #Security #EnterpriseAI #RAGinProduction #LLMready #Unstructured #TheGenAIDataCompany

236

20K

_Brian_Raymond retweeted

11 months ago

📢 National Archives: MLK Files Now Available for RAG Use Unstructured has extracted the full text from the newly declassified files related to the assassination of Dr. Martin Luther King Jr. We’ve released a structured, machine-readable corpus designed to support research, analysis, and responsible GenAI applications—especially in RAG pipelines. This is just the beginning. More context and resources will follow. 📂 Access the dataset: https://t.co/Syx8jtMRFL #MLKFiles #UnstructuredData #GenAI #RAG #ResponsibleAI #DocumentProcessing #AI #GenAI #ETL #ETL+ #RAG #UnstructuredData #LLM #MCP #EnterpriseAI #LLMready #DataEngineering #Unstructured #TheGenAIDataCompany @USNatArchives @ABC @nytimes @washingtonpost @CNN @nbc @NBCNews @abcnews @Reuters @AP @NPR @TIME @BBC

404

_Brian_Raymond retweeted

11 months ago

ICMYI: We’re hiring! We’re building big things, and we need smarties like you to do it 🤓 Come join our team of passionate, innovative engineers today 👉 https://t.co/8a8pwnEbDT #WhateverItIsWeCanStructureIt

481

_Brian_Raymond retweeted

11 months ago

📣 We’re hiring! We’re growing faster than ever, and we need more incredible engineers to join our team. Could that be you? 👀👇 🔹 Field Infrastructure Engineer 🔹 Solutions Engineer (Post-Sales) 🔹 Solutions Architect (Pre-Sales) 🔹 Principal Software Engineer 🔹 Staff Software Engineer 🔹 Founding SMB Account Exec Unstructured is ranked #24 on Fast Company’s Most Innovative Companies List, trusted by 82% of the Fortune 1000, and used by over 65,000 organizations worldwide. 👉 https://t.co/IBuPmraciR Let’s build something great together! #WhateverItIsWeCanStructureIt #AI #GenAI #ETL #ETL+ #RAG #UnstructuredData #LLM #MCP #EnterpriseAI #RAGinProduction #LLMready #Unstructured #TheGenAIDataCompany #NowHiring #LifeAtUnstructured #UnstructuredCulture #JoinOurTeam

978

_Brian_Raymond retweeted