Maarten Van Segbroeck @mvansegb - Twitter Profile

Pinned Tweet

2 months ago

Synthetic datasets on @huggingface should ship with the pipeline that built them. 🤗 NeMo Data Designer now does exactly that. .push_to_hub() uploads the dataset and the pipeline behind it. One line to reproduce 👇

mvansegb's tweet photo. Synthetic datasets on @huggingface should ship with the pipeline that built them. 🤗

NeMo Data Designer now does exactly that.

.push_to_hub() uploads the dataset and the pipeline behind it.

One line to reproduce 👇 https://t.co/OyBbE2sVQC

1

7

2

662

mvansegb retweeted

Bryan Catanzaro

@ctnzr

15 days ago

NVIDIA Nemotron 3 Ultra is now live! Frontier accuracy, 5X greater speed, 30% lower cost. Deploy however you need - on-premise, on the cloud, or at the edge. Model is live on HuggingFace under the OpenMDW 1.1 license. https://t.co/IOfAwv3jB6

25

371

74

84

93K

Maarten Van Segbroeck

@mvansegb

29 days ago

Every synthetic dataset comes with two scores 🎯 One for synthetic data quality — measured by comparing distributions, correlations, and structure to the original data. One for privacy — produced by running membership and attribute inference attacks to test for training-data leakage. ⭐️ Star the repo and try it out: https://t.co/BVrmb8ZiIQ

0

1

0

74

Maarten Van Segbroeck

@mvansegb

29 days ago

The most valuable data in any enterprise is the data you're not allowed to use 🔒 patient records. financial transactions. customer logs, etc. Therefore, we open-sourced NeMo Safe Synthesizer, NVIDIA's end-to-end pipeline to unlock that data by optimizing the tradeoff between quality and privacy. Apache 2.0.

mvansegb's tweet photo. The most valuable data in any enterprise is the data you're not allowed to use 🔒

patient records. financial transactions. customer logs, etc.

Therefore, we open-sourced NeMo Safe Synthesizer, NVIDIA's end-to-end pipeline to unlock that data by optimizing the tradeoff between quality and privacy.

Apache 2.0.

1

14

5

23K

Who to follow

Father. AI Consultant & Freelancer. Ph.D. @KAUST_Alumni. Ex. @AdobeResearch @Qualcomm_EU @samsungresearch Wandering Caribbean 🇨🇴, ☮️❤️❌⚔️

Sonic2028

@Sonic2028

价值信息分享，关注我，一起学习，一起进步，一起撸！有些事情做了不一定有回报，但是不做是肯定没有的，干就完了！！（ #互fo，不回复定期清理）

Maarten Van Segbroeck

@mvansegb

29 days ago

NeMo Safe Synthesizer takes your sensitive dataset through four stages: 🧹 strip PII 🔐 fine-tune privately 🎲 generate safely 🛡️ evaluate against real attacks All configurable.

1

0

91

Maarten Van Segbroeck

@mvansegb

about 1 month ago

Custom is the rule in synthetic data. Frameworks should be built that way. The right design isn’t to make custom possible. It’s to make custom feel like every other component: typed config, dependency-ordered, declarative. No more glue code around the framework. Plugins do exactly that. Shipped in NeMo Data Designer v0.6.0. 🔌

Johnny Greco

@johnnypgreco

about 1 month ago

🔌 🎨 Have it your way with Data Designer plugins! I have been waiting for this release since we open sourced Data Designer. As of v0.6.0, plugins are stable and out of experimental mode 🙌 Need a custom seed reader for an internal corpus? A simulator-backed column generator? A processor that formats records for your trainer? Now you can package that logic as a plugin with a typed config, then share it with your team or the broader community. Data Designer handles the framework work, so your customization plugs into the same declarative workflow. Check out the Dev Note by @fujikanaeda and me for more info: 🔗 Dev Note https://t.co/LaSejx5cHq 🔗 First-party plugins https://t.co/yWjs6wri7p

johnnypgreco's tweet photo. 🔌 🎨 Have it your way with Data Designer plugins!

I have been waiting for this release since we open sourced Data Designer. As of v0.6.0, plugins are stable and out of experimental mode 🙌

Need a custom seed reader for an internal corpus? A simulator-backed column generator? A processor that formats records for your trainer?

Now you can package that logic as a plugin with a typed config, then share it with your team or the broader community. Data Designer handles the framework work, so your customization plugs into the same declarative workflow.

Check out the Dev Note by @fujikanaeda and me for more info:

🔗 Dev Note
https://t.co/LaSejx5cHq

🔗 First-party plugins
https://t.co/yWjs6wri7p

2

17

4

9

320K

2

69

8

33

318K

Maarten Van Segbroeck

@mvansegb

about 2 months ago

Nemotron 3 Nano Omni shipped. The long-doc reasoning is real - and synthetic data did a lot of the heavy lifting. Proud of the NeMo Data Designer team's contribution. Full SDG story in the dev note👇 🔗 https://t.co/oA0GJA3EqD

mvansegb's tweet photo. Nemotron 3 Nano Omni shipped. The long-doc reasoning is real - and synthetic data did a lot of the heavy lifting.

Proud of the NeMo Data Designer team's contribution. Full SDG story in the dev note👇

🔗 https://t.co/oA0GJA3EqD https://t.co/maoDeCKEGI

NVIDIA AI

@NVIDIAAI

about 2 months ago

Meet Nemotron 3 Nano Omni 👋 Our latest addition to the Nemotron family is the highest efficiency, open multimodal model with leading accuracy. 30B parameters. 256K context length. 🧵👇

92

1K

188

506

458K

5

73

10

29

197K

mvansegb retweeted

F.O.L.A

@folaoftech

about 2 months ago

I love Claude 😁❤️

68

6K

672

2K

454K

Maarten Van Segbroeck

@mvansegb

about 2 months ago

Every language deserves training data designed for it. Not translated into it. Personas are how you get there, built into NeMo Data Designer. First Korean persona dataset, #1 on @huggingface in a day. Congrats to the team!

NVIDIA AI

@NVIDIAAI

about 2 months ago

Super cool to see Nemotron-Personas-Korea hit #1 on @huggingface. It's also the first Korean persona dataset. Huge shoutout to the team and the dev community pushing open datasets forward 💚

NVIDIAAI's tweet photo. Super cool to see Nemotron-Personas-Korea hit #1 on @huggingface. It's also the first Korean persona dataset.

Huge shoutout to the team and the dev community pushing open datasets forward 💚 https://t.co/U7vuFroGQX

5

194

35

31

22K

0

6

1

621

mvansegb retweeted

NVIDIA Omniverse

@nvidiaomniverse

about 2 months ago

Physical AI is transforming manufacturing, from design to the factory floor. 🦾 Hear how leaders from @ABBRobotics, @JLR_News, and @TulipInterfaces are using AI-powered simulation, synthetic data, and real-time video analytics to unlock new levels of efficiency across the entire product lifecycle. Watch the full video ➡️ https://t.co/upn5TWsoE6

16

366

65

66

45K

Maarten Van Segbroeck

@mvansegb

2 months ago

Shipped with @vanstriendaniel, @davidberenstei , and @Wauplin from @huggingface. 🧡 The best SDG features don’t come from a roadmap. They come from the people actually using the tools. 🔗 https://t.co/VN28asjVHk

mvansegb's tweet photo. Shipped with @vanstriendaniel, @davidberenstei , and @Wauplin from @huggingface. 🧡

The best SDG features don’t come from a roadmap. They come from the people actually using the tools.

🔗 https://t.co/VN28asjVHk https://t.co/4hD3BNiUFk

0

1

0

85

Maarten Van Segbroeck

@mvansegb

2 months ago

Synthetic datasets on @huggingface should ship with the pipeline that built them. 🤗 NeMo Data Designer now does exactly that. .push_to_hub() uploads the dataset and the pipeline behind it. One line to reproduce 👇

1

7

2

662

Maarten Van Segbroeck

@mvansegb

2 months ago

DataDesignerConfigBuilder.from_config(hf_url) That one line loads the entire pipeline — models, columns, processors. Fork it, rerun it, ship v2. The HF URL is the synthetic dataset artifact. Not just where the parquet lives. ✨

1

0

63

mvansegb retweeted

Eric W. Tramel

@fujikanaeda

2 months ago

updated Nemotron 3 Super tech report now available on arXiv :) https://t.co/Usg6BrZkyy

6

142

21

67

16K

Maarten Van Segbroeck

@mvansegb

2 months ago

The pipeline: sampling → prompt gen → schema gen → SQL gen → quality waterfall 🔓 300K generated. 96.5K kept. 68% rejected 🗑️ Shipped in Nemotron Super v3 SFT 🚀 Full pipeline open-sourced in NeMo Data Designer. Fork it. Swap your schemas. Generate your own. https://t.co/0t1js7risk

0

5

0

2

279

Maarten Van Segbroeck

@mvansegb

2 months ago

+15 points on BIRD 📈 Not from scaling parameters. From engineering the training data. Synthetic data pipeline raised Nemotron Super’s text-to-SQL accuracy from 26.77% → 41.80% — outperforming GPT-OSS-120B (38.25%). What we learned 🧵↓

mvansegb's tweet photo. +15 points on BIRD 📈

Not from scaling parameters. From engineering the training data.

Synthetic data pipeline raised Nemotron Super’s text-to-SQL accuracy from 26.77% → 41.80% — outperforming GPT-OSS-120B (38.25%).

What we learned 🧵↓ https://t.co/KObUnx76yS

1

33

8

23

10K

Maarten Van Segbroeck

@mvansegb

2 months ago

Most text-to-SQL datasets assume the happy path. Clean schemas, 2-5 tables, obvious names. Production? 50 tables. sales_orders next to sales_orders_archive. Dates as text. Currency with $ symbols. JSON blobs hiding critical flags 🫠 So we built the mess on purpose — distractor tables, distractor columns, dirty data by taxonomy. The model learns to ignore noise.

2

8

0

1

339

Maarten Van Segbroeck

@mvansegb

2 months ago

@andrew_n_carr Thanks Andrew! Just getting started 🚀

0

24

Maarten Van Segbroeck

@mvansegb

2 months ago

1.5K+ ⭐ · 135+ forks · growing fast. Need data? Data Designer. Go build.

Python Trending 🇺🇦 @pythontrending

2 months ago

DataDesigner - 🎨 NeMo Data Designer: Generate high-quality synthetic data from scratch or from seed data. https://t.co/K1Oo8N9Wzt

1

20

3

10

4K

1

22

5

16

3K

mvansegb retweeted

Unsloth AI

@UnslothAI

3 months ago

Transform PDFs, CSV, DOCX, TXT or any file into a structured synthetic datasets via Unsloth Data Recipes. Build and edit your datasets visually via a graph-node workflow and use them for fine-tuning. Powered by @NVIDIA DataDesigner.

3

110

11

66

11K

mvansegb retweeted

CV.YH

@0xCVYH

2 months ago

NVIDIA lancou o DataDesigner open-source e pouca gente ta falando disso. E uma ferramenta pra gerar dados sinteticos de alta qualidade pra treinar modelos de IA. Se voce ja sofreu com dataset pequeno ou caro demais, isso resolve. Dados sinteticos sao o futuro do fine-tuning e NVIDIA acabou de democratizar isso. https://t.co/0tdp8sOWT7

3

27

2

18

1K

Maarten Van Segbroeck

@mvansegb

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users