Synthetic datasets on @huggingface should ship with the pipeline that built them. 🤗
NeMo Data Designer now does exactly that.
.push_to_hub() uploads the dataset and the pipeline behind it.
One line to reproduce 👇
NVIDIA Nemotron 3 Ultra is now live!
Frontier accuracy, 5X greater speed, 30% lower cost.
Deploy however you need - on-premise, on the cloud, or at the edge.
Model is live on HuggingFace under the OpenMDW 1.1 license.
https://t.co/IOfAwv3jB6
Every synthetic dataset comes with two scores 🎯
One for synthetic data quality — measured by comparing distributions, correlations, and structure to the original data.
One for privacy — produced by running membership and attribute inference attacks to test for training-data leakage.
⭐️ Star the repo and try it out: https://t.co/BVrmb8ZiIQ
The most valuable data in any enterprise is the data you're not allowed to use 🔒
patient records. financial transactions. customer logs, etc.
Therefore, we open-sourced NeMo Safe Synthesizer, NVIDIA's end-to-end pipeline to unlock that data by optimizing the tradeoff between quality and privacy.
Apache 2.0.
NeMo Safe Synthesizer takes your sensitive dataset through four stages:
🧹 strip PII
🔐 fine-tune privately
🎲 generate safely
🛡️ evaluate against real attacks
All configurable.
Custom is the rule in synthetic data. Frameworks should be built that way.
The right design isn’t to make custom possible. It’s to make custom feel like every other component: typed config, dependency-ordered, declarative.
No more glue code around the framework.
Plugins do exactly that. Shipped in NeMo Data Designer v0.6.0. 🔌
🔌 🎨 Have it your way with Data Designer plugins!
I have been waiting for this release since we open sourced Data Designer. As of v0.6.0, plugins are stable and out of experimental mode 🙌
Need a custom seed reader for an internal corpus? A simulator-backed column generator? A processor that formats records for your trainer?
Now you can package that logic as a plugin with a typed config, then share it with your team or the broader community. Data Designer handles the framework work, so your customization plugs into the same declarative workflow.
Check out the Dev Note by @fujikanaeda and me for more info:
🔗 Dev Note
https://t.co/LaSejx5cHq
🔗 First-party plugins
https://t.co/yWjs6wri7p
Nemotron 3 Nano Omni shipped. The long-doc reasoning is real - and synthetic data did a lot of the heavy lifting.
Proud of the NeMo Data Designer team's contribution. Full SDG story in the dev note👇
🔗 https://t.co/oA0GJA3EqD
Meet Nemotron 3 Nano Omni 👋
Our latest addition to the Nemotron family is the highest efficiency, open multimodal model with leading accuracy.
30B parameters. 256K context length. 🧵👇
Every language deserves training data designed for it. Not translated into it. Personas are how you get there, built into NeMo Data Designer. First Korean persona dataset, #1 on @huggingface in a day. Congrats to the team!
Super cool to see Nemotron-Personas-Korea hit #1 on @huggingface. It's also the first Korean persona dataset.
Huge shoutout to the team and the dev community pushing open datasets forward 💚
Physical AI is transforming manufacturing, from design to the factory floor. 🦾
Hear how leaders from @ABBRobotics, @JLR_News, and @TulipInterfaces are using AI-powered simulation, synthetic data, and real-time video analytics to unlock new levels of efficiency across the entire product lifecycle.
Watch the full video ➡️ https://t.co/upn5TWsoE6
Shipped with @vanstriendaniel, @davidberenstei , and @Wauplin from @huggingface. 🧡
The best SDG features don’t come from a roadmap. They come from the people actually using the tools.
🔗 https://t.co/VN28asjVHk
Synthetic datasets on @huggingface should ship with the pipeline that built them. 🤗
NeMo Data Designer now does exactly that.
.push_to_hub() uploads the dataset and the pipeline behind it.
One line to reproduce 👇
DataDesignerConfigBuilder.from_config(hf_url)
That one line loads the entire pipeline — models, columns, processors. Fork it, rerun it, ship v2.
The HF URL is the synthetic dataset artifact. Not just where the parquet lives. ✨
The pipeline: sampling → prompt gen → schema gen → SQL gen → quality waterfall 🔓
300K generated. 96.5K kept. 68% rejected 🗑️
Shipped in Nemotron Super v3 SFT 🚀 Full pipeline open-sourced in NeMo Data Designer.
Fork it. Swap your schemas. Generate your own.
https://t.co/0t1js7risk
+15 points on BIRD 📈
Not from scaling parameters. From engineering the training data.
Synthetic data pipeline raised Nemotron Super’s text-to-SQL accuracy from 26.77% → 41.80% — outperforming GPT-OSS-120B (38.25%).
What we learned 🧵↓
Most text-to-SQL datasets assume the happy path. Clean schemas, 2-5 tables, obvious names.
Production? 50 tables. sales_orders next to sales_orders_archive. Dates as text. Currency with $ symbols. JSON blobs hiding critical flags 🫠
So we built the mess on purpose — distractor tables, distractor columns, dirty data by taxonomy.
The model learns to ignore noise.
Transform PDFs, CSV, DOCX, TXT or any file into a structured synthetic datasets via Unsloth Data Recipes.
Build and edit your datasets visually via a graph-node workflow and use them for fine-tuning. Powered by @NVIDIA DataDesigner.
NVIDIA lancou o DataDesigner open-source e pouca gente ta falando disso. E uma ferramenta pra gerar dados sinteticos de alta qualidade pra treinar modelos de IA. Se voce ja sofreu com dataset pequeno ou caro demais, isso resolve. Dados sinteticos sao o futuro do fine-tuning e NVIDIA acabou de democratizar isso. https://t.co/0tdp8sOWT7