Synthetic Data Vault

@sdv_dev

Join our growing ecosystem of #opensource libraries & resources for generating #SyntheticData for different data modalities. Created at @lab_dai, MIT.

Cambridge, MA

Joined September 2020

46 Following

377 Followers

260 Posts

Pinned Tweet

Synthetic Data Vault

@sdv_dev

over 2 years ago

Today, DataCebo launched SDV Enterprise & raised $8.5M in VC. SDV Enterprise is a commercial model of the source-available Synthetic Data Vault (SDV). It makes it easy to develop, manage & deploy #generativeAI models for apps when real data is limited. https://t.co/8DqHCxPDMv

sdv_dev's tweet photo. Today, DataCebo launched SDV Enterprise & raised $8.5M in VC. SDV Enterprise is a commercial model of the source-available Synthetic Data Vault (SDV). It makes it easy to develop, manage & deploy #generativeAI models for apps when real data is limited. https://t.co/8DqHCxPDMv https://t.co/1iNyc9p8XX

599

sdv_dev retweeted

Akshay 🚀

@akshay_pachaar

about 1 year ago

Generate synthetic data at scale! SDV is an open-source Python library that generates tabular synthetic data by using ML algorithms to learn and replicate patterns from your real data. Here's how it works in 3 steps: 1️⃣ Train: Point SDV at your real table; it will capture the underlying distributions & relationships. 2️⃣ Generate: Run the trained SDV model to pop out as many look-alike rows as you need—no real data exposed. 3️⃣ Validate: Use SDV’s quality report to see how closely the generated data matches the real stuff; tweak and repeat if you want it tighter. Class imbalance—solved in one shot! ✨ Key features: 🧠 Multiple models from GaussianCopula to CTGAN 🔗 Single, multi & sequential-table support 🔒 Built-in anonymization & logical constraints ⚙️ Single call does it all `sdv.sample()` Link to the GitHub repo in next tweet! ____ Share this with your network if you found this insightful ♻️ Follow me ( @akshay_pachaar ) for more insights and tutorials on AI and Machine Learning!

akshay_pachaar's tweet photo. Generate synthetic data at scale!

SDV is an open-source Python library that generates tabular synthetic data by using ML algorithms to learn and replicate patterns from your real data.

Here's how it works in 3 steps:

1️⃣ Train: Point SDV at your real table; it will capture the underlying distributions & relationships.

2️⃣ Generate: Run the trained SDV model to pop out as many look-alike rows as you need—no real data exposed.

3️⃣ Validate: Use SDV’s quality report to see how closely the generated data matches the real stuff; tweak and repeat if you want it tighter.

Class imbalance—solved in one shot! ✨

Key features:

🧠 Multiple models from GaussianCopula to CTGAN
🔗 Single, multi & sequential-table support
🔒 Built-in anonymization & logical constraints
⚙️ Single call does it all `sdv.sample()`

Link to the GitHub repo in next tweet!
____
Share this with your network if you found this insightful ♻️
Follow me ( @akshay_pachaar ) for more insights and tutorials on AI and Machine Learning!

523

119

752

48K

Synthetic Data Vault

@sdv_dev

about 1 year ago

Generating synthetic data that maintains realistic relationships between columns is crucial for testing and analysis. Traditional random generation approaches often create unrealistic patterns, like luxury hotel rooms priced cheaper than basic rooms. GaussianCopulaSynthesizer automatically learns and maintains these relationships, creating synthetic data that preserves the statistical patterns of your original dataset. ⭐️ Full code: https://t.co/MwjN2e7FBO

sdv_dev's tweet photo. Generating synthetic data that maintains realistic relationships between columns is crucial for testing and analysis. Traditional random generation approaches often create unrealistic patterns, like luxury hotel rooms priced cheaper than basic rooms.

GaussianCopulaSynthesizer automatically learns and maintains these relationships, creating synthetic data that preserves the statistical patterns of your original dataset.

⭐️ Full code: https://t.co/MwjN2e7FBO

143

Synthetic Data Vault

@sdv_dev

about 1 year ago

Many businesses collect and store their customers’ GPS locations to help improve their products. But GPS locations may contain precise locations of people’s homes. Businesses are sensitive to sharing this data even to internal teams, as it may reveal private information about people they know. For example, a food delivery application stores the GPS location associated with each delivery. An internal product team wants to use this data to improve the local restaurant recommendations the application makes to users for future orders. The company needs a way to preserve local insights on the best restaurants from the GPS location data without exposing sensitive user locations. One anonymization approach they could take is replacing every collected GPS location with a randomly chosen one from within the same postal code. Users tend to order from restaurants in the same or neighboring postal codes, so the integrity of local trends is still preserved. To implement this approach, they would need a dataset that contains the geographic boundaries for each postal code and an algorithm for identifying the postal code from a GPS location. To make this process seamless, we created the MetroAreaAnonymizer. With just a few lines of code, you can use the MetroAreaAnonymizer to replace GPS locations with a randomly chosen one from the same postal code. MetroAreaAnonymizer is part of our RDT library, which contains many helpful transformations for your raw data. 📚 Learn more about MetroAreaAnonymizer here: https://t.co/xjICTiEIVG 📚 Learn about RDT here: https://t.co/tIgwSxn6gU 📚 Learn more about the SDV here: https://t.co/aeuBqPb5Xh #syntheticdata #machinelearning #anonymization #geospatial

sdv_dev's tweet photo. Many businesses collect and store their customers’ GPS locations to help improve their products. But GPS locations may contain precise locations of people’s homes. Businesses are sensitive to sharing this data even to internal teams, as it may reveal private information about people they know.

For example, a food delivery application stores the GPS location associated with each delivery. An internal product team wants to use this data to improve the local restaurant recommendations the application makes to users for future orders. The company needs a way to preserve local insights on the best restaurants from the GPS location data without exposing sensitive user locations.

One anonymization approach they could take is replacing every collected GPS location with a randomly chosen one from within the same postal code. Users tend to order from restaurants in the same or neighboring postal codes, so the integrity of local trends is still preserved.

To implement this approach, they would need a dataset that contains the geographic boundaries for each postal code and an algorithm for identifying the postal code from a GPS location. To make this process seamless, we created the MetroAreaAnonymizer.

With just a few lines of code, you can use the MetroAreaAnonymizer to replace GPS locations with a randomly chosen one from the same postal code. MetroAreaAnonymizer is part of our RDT library, which contains many helpful transformations for your raw data.

📚 Learn more about MetroAreaAnonymizer here: https://t.co/xjICTiEIVG

📚 Learn about RDT here: https://t.co/tIgwSxn6gU

📚 Learn more about the SDV here: https://t.co/aeuBqPb5Xh

#syntheticdata #machinelearning #anonymization #geospatial

110

Who to follow

AscendEX

@AscendEX_

Start your crypto ascent here, with simple solutions that let you Invest, Earn, and Trade 300’s of crypto assets in one place. ⛰️🚀🌕🤝🏻

Jan Leike

@janleike

AI research @AnthropicAI. Previously OpenAI & DeepMind. Optimizing for a post-AGI future where humanity flourishes. Opinions aren't my employer's.

Viswa Colluru

@viswacolluru

Founder & CEO @envedabio I Son, husband, and teammate I India ‣ US for PhD @UWMadison I Love solving hard problems with amazing people

Synthetic Data Vault

@sdv_dev

about 1 year ago

Synthetic tabular data can help you test software applications because it resembles the key properties and patterns in your real data. Consider a news publication that wants to use synthetic data to test a new software change for their mobile application before it rolls out to their entire reader base. They trained an AI model on their real data and used it to generate synthetic data. Before they can incorporate this synthetic data into the test environment however, it must meet some minimum criteria for the application to function properly. Here are some examples of criteria that the synthetic data must meet: 1. Data Validity: Primary keys must be unique and non-null. Many features need to retrieve a specific row in a table using a unique identifier. For example, to authenticate a user, the application needs to find the specific row corresponding to their unique user_id value. 2. Data Structure: Data types, column names, and table names should match those in the real data. Application code that retrieves or updates data using specific column names, column types, and table names will error, like when the application needs to update a user’s settings. 3. Relationship Validity: Each foreign key must have a reference to a valid primary key (also known as referential integrity). Many features in the app require joining data from multiple tables, like the recommended articles feature. Without referential integrity, the retrieved data might contain a subset or none of the recommended articles for the user. To help them validate that the synthetic data meets the minimum criteria for usability, they could use the SDV’s Diagnostic Report. This report runs all of our basic data format and validity checks by comparing the real and synthetic data. The Diagnostic Report is part of our open-source and vendor-neutral SDMetrics library. Synthetic data generated by the default synthesizers in the SDV will always result in 100% diagnostic scores. We call this the 𝗦𝗗𝗩 𝗚𝘂𝗮𝗿𝗮𝗻𝘁𝗲𝗲. If the SDV ever generates synthetic data that doesn’t score 100% on the Diagnostic Report, then you’ve identified a bug! Please reach out to us on GitHub or Slack and we will prioritize investigating it. 📚 Learn more about the single-table Diagnostic Report: https://t.co/5ICHqDb5rt 📚 Learn more about the multi-table Diagnostic Report: https://t.co/6W0PYTRnOn 📚 Learn more about the SDV here: https://t.co/aeuBqPbDMP #dataquality #generativeai #machinelearning #softwaretesting #syntheticdata

sdv_dev's tweet photo. Synthetic tabular data can help you test software applications because it resembles the key properties and patterns in your real data.

Consider a news publication that wants to use synthetic data to test a new software change for their mobile application before it rolls out to their entire reader base. They trained an AI model on their real data and used it to generate synthetic data.

Before they can incorporate this synthetic data into the test environment however, it must meet some minimum criteria for the application to function properly. Here are some examples of criteria that the synthetic data must meet:

1. Data Validity: Primary keys must be unique and non-null.

Many features need to retrieve a specific row in a table using a unique identifier. For example, to authenticate a user, the application needs to find the specific row corresponding to their unique user_id value.

2. Data Structure: Data types, column names, and table names should match those in the real data.

Application code that retrieves or updates data using specific column names, column types, and table names will error, like when the application needs to update a user’s settings.

3. Relationship Validity: Each foreign key must have a reference to a valid primary key (also known as referential integrity).

Many features in the app require joining data from multiple tables, like the recommended articles feature. Without referential integrity, the retrieved data might contain a subset or none of the recommended articles for the user.

To help them validate that the synthetic data meets the minimum criteria for usability, they could use the SDV’s Diagnostic Report. This report runs all of our basic data format and validity checks by comparing the real and synthetic data.

The Diagnostic Report is part of our open-source and vendor-neutral SDMetrics library. Synthetic data generated by the default synthesizers in the SDV will always result in 100% diagnostic scores. We call this the 𝗦𝗗𝗩 𝗚𝘂𝗮𝗿𝗮𝗻𝘁𝗲𝗲.

If the SDV ever generates synthetic data that doesn’t score 100% on the Diagnostic Report, then you’ve identified a bug! Please reach out to us on GitHub or Slack and we will prioritize investigating it.

📚 Learn more about the single-table Diagnostic Report: https://t.co/5ICHqDb5rt

📚 Learn more about the multi-table Diagnostic Report: https://t.co/6W0PYTRnOn

📚 Learn more about the SDV here: https://t.co/aeuBqPbDMP

#dataquality #generativeai #machinelearning #softwaretesting #syntheticdata

Synthetic Data Vault

@sdv_dev

about 1 year ago

One challenge in training AI models to generate valid synthetic data is teaching them to mimic the rules-based business logic that exist in real datasets. Let’s explore an example of one such rule. The one-to-many relationship is a common pattern in database schemas. An interesting variation of this pattern occurs when only some rows are allowed to have connections while others aren’t. For example, a gym offers a premium membership tier that gives access to extra benefits (like pool access and sauna access). To record the perks available to each member, they use a members table and a benefits table. Only the rows representing premium members are allowed to have connections to rows in the benefits table while the rows representing basic members are not. This enables the gym to store specific information for a subset of their membership in a separate table in a simple way. We call this the ForeignToPrimaryKeySubset pattern because only a subset of the primary keys in the parent table have a 1-to-many relationship with the foreign keys in the child table. If your data contains this pattern, you can now generate multi-table synthetic data using the SDV that also adheres to this pattern. This pattern is part of our Constraint Augmented Generation bundle, or CAG, in the SDV Enterprise. 📚 Learn more about the ForeignToPrimaryKeySubset pattern here: https://t.co/tVEqBnqvc1 📚 Learn more about the CAG bundle here: https://t.co/8d6fbtrHgn 📚 Learn more about the SDV here: https://t.co/aeuBqPbDMP #syntheticdata #generativeai #databases #machinelearning #datamodeling

sdv_dev's tweet photo. One challenge in training AI models to generate valid synthetic data is teaching them to mimic the rules-based business logic that exist in real datasets. Let’s explore an example of one such rule.

The one-to-many relationship is a common pattern in database schemas. An interesting variation of this pattern occurs when only some rows are allowed to have connections while others aren’t.

For example, a gym offers a premium membership tier that gives access to extra benefits (like pool access and sauna access). To record the perks available to each member, they use a members table and a benefits table.

Only the rows representing premium members are allowed to have connections to rows in the benefits table while the rows representing basic members are not. This enables the gym to store specific information for a subset of their membership in a separate table in a simple way.

We call this the ForeignToPrimaryKeySubset pattern because only a subset of the primary keys in the parent table have a 1-to-many relationship with the foreign keys in the child table.

If your data contains this pattern, you can now generate multi-table synthetic data using the SDV that also adheres to this pattern. This pattern is part of our Constraint Augmented Generation bundle, or CAG, in the SDV Enterprise.

📚 Learn more about the ForeignToPrimaryKeySubset pattern here: https://t.co/tVEqBnqvc1

📚 Learn more about the CAG bundle here: https://t.co/8d6fbtrHgn

📚 Learn more about the SDV here: https://t.co/aeuBqPbDMP

#syntheticdata #generativeai #databases #machinelearning #datamodeling

Synthetic Data Vault

@sdv_dev

over 1 year ago

✈️ @Expedia recently shared a very interesting methodology on how they collect and use synthetic data to improve their flight price forecasting models. When a user makes a flight search, Expedia retrieves the latest pricing data from their data providers for the specified search parameters - route, fare class, trip dates, etc. To build interesting price prediction features for their customers, the Expedia team trains forecasting models on data they’ve collected but they wanted to improve prediction accuracy even further. 🛑 The Challenge Even though millions of searches are made by users daily, the sheer number of combinations for possible routes, trip dates, and passenger counts is so large, that there were a lot of combinations for which the team did not have the price. To develop a robust forecasting model ideally the team would have at least one search a day for each of the combinations of the search parameters. 🤖 How they Incorporated Synthetic Data? To fill these gaps they built automated software that requests flight prices for specific search parameters. 🎯 Their goal with synthetic searches is to have at least one search a day for their most popular routes for the trip dates that fall within the upcoming months. During the model training phase, they combine data from real user searches and from synthetic searches to ensure they have better data coverage. ✅ User Impact When a user searches for a flight, Expedia shows a chart that visualizes how prices are forecasted to change between now and takeoff. By improving the accuracy of their price forecasts, Expedia helps their users decide if they should book a flight immediately or wait until a forecasted price drop occurs in the future. 🚧 Limitations Using an automated search based on synthetically created search parameters could interfere with the experience of onsite users - who are trying to search for price. The team took this into consideration and were deliberate about balancing the data retrieval needs of real user searches with the team’s needs for synthetic searches. 📚 Read the Dec 2024 @thenewstack article by Shiyi Pickrell, the SVP of Data and AI at Expedia: https://t.co/Ul9Tz8ejAn 📚 Read the Oct 2023 @Medium article b y Andrew Reuben: Senior Machine Learning Scientist at Expedia: https://t.co/CLsJU4l4NS #syntheticdata #generativeai #machinelearning #openai #travel Image credit: Expedia

sdv_dev's tweet photo. ✈️ @Expedia recently shared a very interesting methodology on how they collect and use synthetic data to improve their flight price forecasting models.

When a user makes a flight search, Expedia retrieves the latest pricing data from their data providers for the specified search parameters - route, fare class, trip dates, etc. To build interesting price prediction features for their customers, the Expedia team trains forecasting models on data they’ve collected but they wanted to improve prediction accuracy even further.

🛑 The Challenge

Even though millions of searches are made by users daily, the sheer number of combinations for possible routes, trip dates, and passenger counts is so large, that there were a lot of combinations for which the team did not have the price. To develop a robust forecasting model ideally the team would have at least one search a day for each of the combinations of the search parameters.

🤖 How they Incorporated Synthetic Data?

To fill these gaps they built automated software that requests flight prices for specific search parameters. 🎯 Their goal with synthetic searches is to have at least one search a day for their most popular routes for the trip dates that fall within the upcoming months.

During the model training phase, they combine data from real user searches and from synthetic searches to ensure they have better data coverage.

✅ User Impact

When a user searches for a flight, Expedia shows a chart that visualizes how prices are forecasted to change between now and takeoff. By improving the accuracy of their price forecasts, Expedia helps their users decide if they should book a flight immediately or wait until a forecasted price drop occurs in the future.

🚧 Limitations

Using an automated search based on synthetically created search parameters could interfere with the experience of onsite users - who are trying to search for price. The team took this into consideration and were deliberate about balancing the data retrieval needs of real user searches with the team’s needs for synthetic searches.

📚 Read the Dec 2024 @thenewstack article by Shiyi Pickrell, the SVP of Data and AI at Expedia: https://t.co/Ul9Tz8ejAn

📚 Read the Oct 2023 @Medium article b y Andrew Reuben: Senior Machine Learning Scientist at Expedia: https://t.co/CLsJU4l4NS

#syntheticdata #generativeai #machinelearning #openai #travel

Image credit: Expedia

102

Synthetic Data Vault

@sdv_dev

over 1 year ago

One challenge in training AI models to generate valid synthetic data is teaching them to mimic the rules-based business logic that exist in real datasets. Let’s explore an example of one such rule. Some applications need to store numerical data with different units of measurement in the same column. For example, an online retailer accepts payments in many different currencies and records every transaction in a table. They use an amount column to record the transaction amount and a currency column to record the currency for each transaction. The transaction amounts associated with each currency might have radically different scales (min-max ranges and distributions) because of the exchange rate. 1 USD (American Dollar) is equivalent to ~1063 ARS (Argentinian Pesos), which is reflected in the transaction amounts. We need a way to instruct the AI model to learn the scales for each currency separately. To enable SDV synthesizers to model this business logic and generate synthetic data that adheres to it, we created the MixedScales constraint. You can use this constraint whenever the value of one or more categorical columns (like the currency column) determines the scale of a numerical column (like the amount column). The MixedScales constraint is part of our Constraint Augmented Generation, or CAG, in the SDV Enterprise. 📚 Learn more about the MixedScales constraint here: https://t.co/QjoGuJKzNV 📚 Learn more about the CAG bundle here: https://t.co/8d6fbtrHgn #syntheticdata #generativeai #databases #finance #datamodeling

Synthetic Data Vault

@sdv_dev

over 1 year ago

Today, we’re excited to introduce a powerful new bundle to The Synthetic Data Vault: AI connectors. AI connectors address 2 key challenges that SDV users face when training generative AI models on datasets from enterprise data stores. (Link to the announcement: https://t.co/nP3imke08I) ❎ Creating accurate metadata is time consuming, especially for complex multi-table schemas Metadata provides a deeper context (semantic and statistical) about your data and the synthesizers use this context to generate high quality synthetic data. Without AI connectors, SDV users have to export data from the database, use SDV’s metadata auto-detection feature to establish metadata, and then manually update the metadata to be accurate. ✅ AI Connectors automatically generate higher quality metadata AI connectors automatically infers higher quality metadata using the database schema and our own inference engine, without having to read tables into memory from the database. When benchmarked with 55 datasets stored in 4 different database platforms, metadata generated using AI connectors resulted in 35% higher quality metadata (average score of 0.98) compared to metadata generated using the auto-detection approach (average score of 0.73). ❎ Identifying a referentially sound and representative sample for training data is tricky Training SDV Synthesizers requires loading a representative sample of data from your database into memory. In addition, the data needs to have referential integrity for the synthesizers to learn the proper relationships. Approaches to identifying a high quality, referentially sound sample of data can be tedious and time-consuming to implement. ✅ AI Connectors uses an inbuilt algorithm to generate a training data set and guarantee referential integrity With AI connectors, we created an algorithm called Referential First Search (RFS) that guarantees that the real data used to train the model is a subset with referential integrity. When benchmarked with 7 datasets stored in 5 different databases, training data created using AI connectors achieved an average of 18% higher quality data score over the standard approach of random subsampling and then enforcing referential integrity after. Read more about AI connectors and how to access it in our latest product announcement here: https://t.co/nP3imke08I #syntheticdata #generativeai #machinelearning #databases

Synthetic Data Vault

@sdv_dev

over 1 year ago

SDV Enterprise v0.23.0 is out 🎉 This release enhances your ability to program your synthesizer to find certain patterns and recreate them— whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data. 🏆 Improved CAG patterns. Use CarryOverColumns to specify a column that is repeated across many tables with different relationships. The PrimaryToPrimaryKeySubset pattern now works with missing values. See more about these interesting data patterns SDV Enterprise supports in the slides below. 💡 Experiment with new transformers to improve your synthetic data quality. Try applying the new LogScaler and LogitScaler on data that exhibits exponential properties. 📚 Read the full Release Notes here: https://t.co/yOS3a4q82x 📚 Learn more about the SDV: https://t.co/7PnfKPunql #syntheticdata #generativeai #machinelearning #ai

sdv_dev's tweet photo. SDV Enterprise v0.23.0 is out 🎉

This release enhances your ability to program your synthesizer to find certain patterns and recreate them— whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data.

🏆 Improved CAG patterns. Use CarryOverColumns to specify a column that is repeated across many tables with different relationships. The PrimaryToPrimaryKeySubset pattern now works with missing values. See more about these interesting data patterns SDV Enterprise supports in the slides below.

💡 Experiment with new transformers to improve your synthetic data quality. Try applying the new LogScaler and LogitScaler on data that exhibits exponential properties.

📚 Read the full Release Notes here: https://t.co/yOS3a4q82x

📚 Learn more about the SDV: https://t.co/7PnfKPunql

#syntheticdata #generativeai #machinelearning #ai

Synthetic Data Vault

@sdv_dev

over 1 year ago

SDV Enterprise v0.23.0 is out 🎉 This release enhances your ability to program your synthesizer to find certain patterns and recreate them— whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data. 🏆 Improved CAG patterns. Use CarryOverColumns to specify a column that is repeated across many tables with different relationships. The PrimaryToPrimaryKeySubset pattern now works with missing values. See more about these interesting data patterns SDV Enterprise supports in the slides below. 💡 Experiment with new transformers to improve your synthetic data quality. Try applying the new LogScaler and LogitScaler on data that exhibits exponential properties. 📚 Read the full Release Notes here: https://t.co/yOS3a4pAcZ 📚 Learn more about the SDV: https://t.co/7PnfKPtPAN #syntheticdata #generativeai #machinelearning #ai

Synthetic Data Vault

@sdv_dev

over 1 year ago

Synthetic data is a powerful way to generate test data that looks and feels like real production data. You can either insert the synthetic data back into the database in an environment for manual testing or use the data for running automated tests. But if you need to test a new application that has no real world usage or collected data, then you need to adopt a different approach. Instead of training models on your real data to generate synthetic data, you can generate fake test data from scratch that adheres to your database schema. In the SDV, we created a dedicated synthesizer called DayZSynthesizer to support this workflow. Here are the 3 main steps: 1. Generate baseline metadata Auto-generate baseline metadata from your database’s schema (for supported databases) or use our Metadata APIs to create a JSON representation of your metadata that mirrors your database schema. 2. Improve the data realism You can update sdtypes to add semantic meaning to special columns like social security numbers, postal codes, and addresses to improve the format and type of fake data that’s generated. You can also define min-max value ranges for numerical columns, define a fixed set of categories for categorical columns, define datetime ranges, and control the proportion of missing data you’d like for each column. 3. Generate and export fake data 🚀 Generate the rows you need for each table and export the data into your database. The beauty of this workflow is that every time you make a software change that requires a change in the database schema, you can re-generate fake data with minimal changes to the code you already wrote. 📚 Learn more about DayZSynthesizer here: https://t.co/4RWf25VxrW 📚 Learn more about the Metadata Creation API Here: https://t.co/NozuGc7GxL 📚 Learn more about the SDV here: https://t.co/7PnfKPunql #syntheticdata #fakedata #machinelearning #generativeai

sdv_dev's tweet photo. Synthetic data is a powerful way to generate test data that looks and feels like real production data. You can either insert the synthetic data back into the database in an environment for manual testing or use the data for running automated tests.

But if you need to test a new application that has no real world usage or collected data, then you need to adopt a different approach.

Instead of training models on your real data to generate synthetic data, you can generate fake test data from scratch that adheres to your database schema. In the SDV, we created a dedicated synthesizer called DayZSynthesizer to support this workflow.

Here are the 3 main steps:

1. Generate baseline metadata

Auto-generate baseline metadata from your database’s schema (for supported databases) or use our Metadata APIs to create a JSON representation of your metadata that mirrors your database schema.

2. Improve the data realism

You can update sdtypes to add semantic meaning to special columns like social security numbers, postal codes, and addresses to improve the format and type of fake data that’s generated. You can also define min-max value ranges for numerical columns, define a fixed set of categories for categorical columns, define datetime ranges, and control the proportion of missing data you’d like for each column.

3. Generate and export fake data 🚀

Generate the rows you need for each table and export the data into your database.

The beauty of this workflow is that every time you make a software change that requires a change in the database schema, you can re-generate fake data with minimal changes to the code you already wrote.

📚 Learn more about DayZSynthesizer here: https://t.co/4RWf25VxrW

📚 Learn more about the Metadata Creation API Here: https://t.co/NozuGc7GxL

📚 Learn more about the SDV here: https://t.co/7PnfKPunql

#syntheticdata #fakedata #machinelearning #generativeai

Synthetic Data Vault

@sdv_dev

over 1 year ago

Last week, we shared a synthetic populations dataset for the United States but this week we’re sharing one published by researchers for the whole world. 🌏 Marijin Ton et al released a gigantic synthetic population dataset that represents ~𝟳.𝟯𝟯 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗵𝘂𝗺𝗮𝗻𝘀, which matches the 2015 human population count, and ~𝟭.𝟵𝟵 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗵𝗼𝘂𝘀𝗲𝗵𝗼𝗹𝗱𝘀. 𝗧𝗵𝗲 𝗠𝗼𝘁𝗶𝘃𝗮𝘁𝗶𝗼𝗻 To understand the impact of societal changes like disease, extreme weather, and more, modelers sometimes resort to simplifying assumptions of human behavior. According to the authors – “𝘍𝘰𝘳 𝘦𝘹𝘢𝘮𝘱𝘭𝘦, 𝘪𝘯𝘵𝘦𝘨𝘳𝘢𝘵𝘦𝘥 𝘢𝘴𝘴𝘦𝘴𝘴𝘮𝘦𝘯𝘵 𝘮𝘰𝘥𝘦𝘭𝘴 𝘰𝘧 𝘤𝘭𝘪𝘮𝘢𝘵𝘦 𝘤𝘩𝘢𝘯𝘨𝘦 𝘵𝘺𝘱𝘪𝘤𝘢𝘭𝘭𝘺 𝘢𝘴𝘴𝘶𝘮𝘦 𝘢 𝘳𝘦𝘱𝘳𝘦𝘴𝘦𝘯𝘵𝘢𝘵𝘪𝘷𝘦 𝘤𝘰𝘯𝘴𝘶𝘮𝘦𝘳 𝘰𝘧 𝘢 𝘴𝘪𝘯𝘨𝘭𝘦 𝘢𝘷𝘦𝘳𝘢𝘨𝘦 𝘨𝘭𝘰𝘣𝘢𝘭 𝘰𝘳 𝘳𝘦𝘨𝘪𝘰𝘯𝘢𝘭 𝘤𝘰𝘯𝘴𝘶𝘮𝘦𝘳.” By creating a synthetic individuals dataset that’s consistent with published demographic statistics at the state / province level (administrative level 1) for most countries, they’re hoping to improve the data and assumptions used in global impact simulations. 𝗧𝗵𝗲𝗶𝗿 𝗗𝗮𝘁𝗮 𝗦𝗼𝘂𝗿𝗰𝗲𝘀 The team primarily used data from 2 databases: • Luxembourg Income Study, which has very detailed microdata for 50 countries. LIS data especially shines for medium and high income countries. • Demographic and Health Surveys, which has very detailed microdata for 90 countries. DHS data especially shines for low-income countries. Households and individuals in the remaining countries were generated using regional statistics. A small number of countries were excluded that were missing reliable, published statistics. This is a great dataset to explore geospatial visualizations or to build regional or global impact models. 📚 Link to the paper: https://t.co/1Uq61TGmox 🗄️ Link to the dataset: https://t.co/vx07ezoFKF #syntheticdata #machinelearning #generativeai Kudos to researchers who made this happen: Michiel Ingels, Jens de Bruijn, Hans de Moel, Lena Reimann, Wouter Botzen, Jeroen Aerts Credit to the Nature Magazine and the authors for the image showcasing the population coverage and data source for each country.

sdv_dev's tweet photo. Last week, we shared a synthetic populations dataset for the United States but this week we’re sharing one published by researchers for the whole world. 🌏

Marijin Ton et al released a gigantic synthetic population dataset that represents ~𝟳.𝟯𝟯 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗵𝘂𝗺𝗮𝗻𝘀, which matches the 2015 human population count, and ~𝟭.𝟵𝟵 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗵𝗼𝘂𝘀𝗲𝗵𝗼𝗹𝗱𝘀.

𝗧𝗵𝗲 𝗠𝗼𝘁𝗶𝘃𝗮𝘁𝗶𝗼𝗻
To understand the impact of societal changes like disease, extreme weather, and more, modelers sometimes resort to simplifying assumptions of human behavior.

According to the authors – “𝘍𝘰𝘳 𝘦𝘹𝘢𝘮𝘱𝘭𝘦, 𝘪𝘯𝘵𝘦𝘨𝘳𝘢𝘵𝘦𝘥 𝘢𝘴𝘴𝘦𝘴𝘴𝘮𝘦𝘯𝘵 𝘮𝘰𝘥𝘦𝘭𝘴 𝘰𝘧 𝘤𝘭𝘪𝘮𝘢𝘵𝘦 𝘤𝘩𝘢𝘯𝘨𝘦 𝘵𝘺𝘱𝘪𝘤𝘢𝘭𝘭𝘺 𝘢𝘴𝘴𝘶𝘮𝘦 𝘢 𝘳𝘦𝘱𝘳𝘦𝘴𝘦𝘯𝘵𝘢𝘵𝘪𝘷𝘦 𝘤𝘰𝘯𝘴𝘶𝘮𝘦𝘳 𝘰𝘧 𝘢 𝘴𝘪𝘯𝘨𝘭𝘦 𝘢𝘷𝘦𝘳𝘢𝘨𝘦 𝘨𝘭𝘰𝘣𝘢𝘭 𝘰𝘳 𝘳𝘦𝘨𝘪𝘰𝘯𝘢𝘭 𝘤𝘰𝘯𝘴𝘶𝘮𝘦𝘳.”

By creating a synthetic individuals dataset that’s consistent with published demographic statistics at the state / province level (administrative level 1) for most countries, they’re hoping to improve the data and assumptions used in global impact simulations.

𝗧𝗵𝗲𝗶𝗿 𝗗𝗮𝘁𝗮 𝗦𝗼𝘂𝗿𝗰𝗲𝘀
The team primarily used data from 2 databases:

• Luxembourg Income Study, which has very detailed microdata for 50 countries. LIS data especially shines for medium and high income countries.

• Demographic and Health Surveys, which has very detailed microdata for 90 countries. DHS data especially shines for low-income countries.

Households and individuals in the remaining countries were generated using regional statistics. A small number of countries were excluded that were missing reliable, published statistics.

This is a great dataset to explore geospatial visualizations or to build regional or global impact models.

📚 Link to the paper: https://t.co/1Uq61TGmox
🗄️ Link to the dataset: https://t.co/vx07ezoFKF

#syntheticdata #machinelearning #generativeai

Kudos to researchers who made this happen: Michiel Ingels, Jens de Bruijn, Hans de Moel, Lena Reimann, Wouter Botzen, Jeroen Aerts

Credit to the Nature Magazine and the authors for the image showcasing the population coverage and data source for each country.

110

Synthetic Data Vault

@sdv_dev

over 1 year ago

Some multi-table datasets have interesting data patterns, like mirroring 1 or more columns in a child table from its parent table. This design pattern helps the database user avoid the need to run a time-consuming or expensive JOIN query, especially if one of the tables is extremely large or if the database is column-oriented like OLAP databases are. For example, imagine you’re building an #ecommerce orders dashboard that frequently needed to analyze order volume and amounts by the user’s country of origin. With a fully normalized table design, this application would need to accumulate this information by frequently querying and joining both the orders and users tables. If this query was slow or expensive, you could instead mirror the country of origin information from the 𝘶𝘴𝘦𝘳𝘴 table to the 𝘰𝘳𝘥𝘦𝘳𝘴 table. We call this the 𝗖𝗮𝗿𝗿𝘆𝗢𝘃𝗲𝗿𝗖𝗼𝗹𝘂𝗺𝗻𝘀 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 because 1 or more columns are carried over from one table to another. If your real data contains this pattern, you can now generate multi-table synthetic data using the SDV that also adheres to this pattern. This pattern is part of our 𝗖𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁 𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 bundle, or CAG, in the SDV Enterprise. 📚Learn more about the CarryOverColumns pattern here: https://t.co/KPeAmSpmjH 📚 Learn more about the CAG bundle here: https://t.co/WfKIvYy02v #syntheticdata #generativeai #databases #machinelearning #datamodeling

sdv_dev's tweet photo. Some multi-table datasets have interesting data patterns, like mirroring 1 or more columns in a child table from its parent table. This design pattern helps the database user avoid the need to run a time-consuming or expensive JOIN query, especially if one of the tables is extremely large or if the database is column-oriented like OLAP databases are.

For example, imagine you’re building an #ecommerce orders dashboard that frequently needed to analyze order volume and amounts by the user’s country of origin. With a fully normalized table design, this application would need to accumulate this information by frequently querying and joining both the orders and users tables.

If this query was slow or expensive, you could instead mirror the country of origin information from the 𝘶𝘴𝘦𝘳𝘴 table to the 𝘰𝘳𝘥𝘦𝘳𝘴 table.

We call this the 𝗖𝗮𝗿𝗿𝘆𝗢𝘃𝗲𝗿𝗖𝗼𝗹𝘂𝗺𝗻𝘀 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 because 1 or more columns are carried over from one table to another.

If your real data contains this pattern, you can now generate multi-table synthetic data using the SDV that also adheres to this pattern. This pattern is part of our 𝗖𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁 𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 bundle, or CAG, in the SDV Enterprise.

📚Learn more about the CarryOverColumns pattern here: https://t.co/KPeAmSpmjH

📚 Learn more about the CAG bundle here: https://t.co/WfKIvYy02v

#syntheticdata #generativeai #databases #machinelearning #datamodeling

Synthetic Data Vault

@sdv_dev

over 1 year ago

James Rineer et al just released a new dataset containing millions of #syntheticdata about households and individuals in the US. Using publicly available census data from the U.S. Census Bureau, they generated: 🏘️ 120,754,708 synthetic households 👥 303,128,287 synthetic individuals 🗄️ 3 Gigabytes of compressed parquet files The team was very meticulous with many aspects of the data generation. For example, they used external population density sources to place households inside real census block groups instead of just randomly generating locations inside the US. This is a great dataset for practicing spatiotemporal analysis and visualization. 🗺️📊 Link to the paper: https://t.co/UDMjIxvF8H Link to the dataset: https://t.co/l2vUTIdnCX #gis #machinelearning #ai #openai Collaborators: Nicholas Kruskamp Caroline Kery Kasey Jones Rainer Hilscher Georgiy Bobashev Credit to the @Nature magazine and the authors for the excellent image.

sdv_dev's tweet photo. James Rineer et al just released a new dataset containing millions of #syntheticdata about households and individuals in the US. Using publicly available census data from the U.S. Census Bureau, they generated:

🏘️ 120,754,708 synthetic households
👥 303,128,287 synthetic individuals
🗄️ 3 Gigabytes of compressed parquet files

The team was very meticulous with many aspects of the data generation. For example, they used external population density sources to place households inside real census block groups instead of just randomly generating locations inside the US.

This is a great dataset for practicing spatiotemporal analysis and visualization. 🗺️📊

Link to the paper: https://t.co/UDMjIxvF8H

Link to the dataset: https://t.co/l2vUTIdnCX

#gis #machinelearning #ai #openai

Collaborators: Nicholas Kruskamp Caroline Kery Kasey Jones Rainer Hilscher Georgiy Bobashev

Credit to the @Nature magazine and the authors for the excellent image.

180

Synthetic Data Vault

@sdv_dev

over 1 year ago

In 2024, synthetic data routinely made headlines alongside many AI product launches. 𝗛𝗲𝗿𝗲 𝗮𝗿𝗲 𝗼𝘂𝗿 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻𝘀 𝗳𝗼𝗿 𝟮𝟬𝟮𝟱 🔮 𝟭. 𝗧𝗵𝗲 𝗿𝗶𝘀𝗲 𝗼𝗳 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗔𝗜 𝘄𝗶𝗹𝗹 𝗿𝗲𝘀𝘂𝗹𝘁 𝗶𝗻 𝗮 𝗻𝘂𝗺𝗯𝗲𝗿 𝗼𝗳 𝗟𝗟𝗠-𝗯𝗮𝘀𝗲𝗱 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 𝘁𝗼𝗼𝗹𝘀 𝗳𝗼𝗿 𝘁𝗮𝗯𝘂𝗹𝗮𝗿 𝗱𝗮𝘁𝗮. 𝗡𝗼𝗻𝗲 𝘄𝗶𝗹𝗹 𝗱𝗲𝗹𝗶𝘃𝗲𝗿 𝗼𝗻 𝘁𝗵𝗲 𝗽𝗿𝗼𝗺𝗶𝘀𝗲, 𝗯𝘂𝘁 𝘁𝗵𝗶𝘀 𝗽𝗿𝗼𝗰𝗲𝘀𝘀 𝘄𝗶𝗹𝗹 𝗵𝗲𝗹𝗽 𝗲𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝘀 𝗱𝗲𝗳𝗶𝗻𝗲 𝗿𝗲𝗾𝘂𝗶𝗿𝗲𝗺𝗲𝗻𝘁𝘀. Researchers have started to use LLM’s to generate synthetic tabular data. We predict that these efforts will show promise on toy or single-table datasets but will fall short for complex, enterprise-grade, multi-table databases that contain lots of hidden context. Even though these tools will be tested and will fail to deliver ... it will lead to the development of much more concrete requirements for tabular synthetic data generators. 𝟮. 𝗖𝗼𝗺𝗽𝗮𝗻𝗶𝗲𝘀 𝘄𝗶𝗹𝗹 𝗳𝗮𝗰𝗲 𝗮 𝗳𝗿𝗲𝗲𝘇𝗲 𝗶𝗻 𝗱𝗮𝘁𝗮 𝗮𝘀𝘀𝗲𝘁 𝗮𝘃𝗮𝗶𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗱𝘂𝗲 𝘁𝗼 𝗿𝗲𝗴𝘂𝗹𝗮𝘁𝗶𝗼𝗻𝘀 𝗮𝗻𝗱 𝗱𝗲𝗰𝗹𝗶𝗻𝗶𝗻𝗴 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗰𝗼𝗻𝘀𝗲𝗻𝘁. Increased privacy and security regulations and increased customer privacy consciousness will make it harder to use customer data to train AI models. This will lead companies to run out of usable data and turn to synthetic data as a viable solution. 𝟯. 𝗘𝘃𝗲𝗿𝘆 𝗰𝗼𝗺𝗽𝗮𝗻𝘆 𝘄𝗶𝗹𝗹, 𝗮𝘁 𝘁𝗵𝗲 𝘃𝗲𝗿𝘆 𝗹𝗲𝗮𝘀𝘁, 𝗲𝘅𝗽𝗲𝗿𝗶𝗺𝗲𝗻𝘁 𝘄𝗶𝘁𝗵 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝗶𝗻 𝟮𝟬𝟮𝟱 𝗮𝘀 𝗽𝗮𝗿𝘁 𝗼𝗳 𝘁𝗵𝗲𝗶𝗿 𝗯𝗿𝗼𝗮𝗱𝗲𝗿 𝗔𝗜 𝗱𝗮𝘁𝗮 𝘀𝘁𝗿𝗮𝘁𝗲𝗴𝘆. Synthetic data is often better than real data in AI training and can be more freely shared across the organization. AI models simply perform better when trained with upsampled, augmented, and bias-corrected synthetic data as they can identify patterns more efficiently without overfitting. We are already seeing this — the SDV software has been downloaded more than 7 million times, and as many as 10% of global Fortune 500 companies currently experiment with SDV. We predict this number will grow exponentially next year. 𝟰. 𝗦𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝗳𝗼𝗿 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀 𝘄𝗶𝗹𝗹 𝗯𝗲𝗰𝗼𝗺𝗲 𝗮 𝗺𝗼𝗿𝗲 𝗽𝗿𝗲𝘀𝘀𝗶𝗻𝗴 𝗻𝗲𝗲𝗱. Enterprises will need additional data to train more robust AI agents and synthetic data can help fill the gap. 𝟱. 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝘀 𝘄𝗶𝗹𝗹 𝗴𝗮𝗶𝗻 𝗯𝗶𝗴 𝗳𝗿𝗼𝗺 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝘁𝗮𝗯𝘂𝗹𝗮𝗿 𝗱𝗮𝘁𝗮 𝗮𝗻𝗱 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝘁𝗼 𝘁𝗿𝗮𝗶𝗻 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀. While big tech focuses on improving LLM’s, most enterprises will gain more immediate value from synthetic tabular data to improve data access, train more robust ML models, or train better AI agents. 📖 Read more about our 2025 predictions and our 2024 recap here: https://t.co/HVohrewK4F #generativeai #ai #openai #syntheticdata #machinelearning

Synthetic Data Vault

@sdv_dev

over 1 year ago

If you want to use AI generated synthetic data in place of your sensitive real data, then you need to be confident that the 𝐬𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐝𝐚𝐭𝐚 𝐚𝐝𝐡𝐞𝐫𝐞𝐬 𝐭𝐨 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞 𝐛𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐫𝐮𝐥𝐞𝐬.⁣ ⁣ For example, imagine that you’re an online retailer that wants to test, using realistic data, how a new version of your website displays order history. Each order contains product names, their SKU’s (stock keeping units), along with some other fields.⁣ ⁣ Every SKU value is linked to a unique product name and the generated synthetic data needs to reflect this pattern to help you accurately test the change. A SKU value can’t appear next to different product names in the synthetic data.⁣ ⁣ In the SDV, you can define this business rule using the 𝐅𝐢𝐱𝐞𝐝𝐂𝐨𝐦𝐛𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬 𝐂𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐭 and require your synthesizer to generate synthetic data that adheres to it. ⁣ ⁣ 📖Learn more about the 𝐅𝐢𝐱𝐞𝐝𝐂𝐨𝐦𝐛𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬 𝐂𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐭 here: https://t.co/kKXhiMhwkj⁣ ⁣ 🤝Join the SDV community here: https://t.co/FHaiI14lfh⁣ ⁣ #generativeai #syntheticdata #machinelearning #openai

sdv_dev's tweet photo. If you want to use AI generated synthetic data in place of your sensitive real data, then you need to be confident that the 𝐬𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐝𝐚𝐭𝐚 𝐚𝐝𝐡𝐞𝐫𝐞𝐬 𝐭𝐨 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞 𝐛𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐫𝐮𝐥𝐞𝐬.⁣
⁣
For example, imagine that you’re an online retailer that wants to test, using realistic data, how a new version of your website displays order history. Each order contains product names, their SKU’s (stock keeping units), along with some other fields.⁣
⁣
Every SKU value is linked to a unique product name and the generated synthetic data needs to reflect this pattern to help you accurately test the change. A SKU value can’t appear next to different product names in the synthetic data.⁣
⁣
In the SDV, you can define this business rule using the 𝐅𝐢𝐱𝐞𝐝𝐂𝐨𝐦𝐛𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬 𝐂𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐭 and require your synthesizer to generate synthetic data that adheres to it. ⁣
⁣
📖Learn more about the 𝐅𝐢𝐱𝐞𝐝𝐂𝐨𝐦𝐛𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬 𝐂𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐭 here: https://t.co/kKXhiMhwkj⁣
⁣
🤝Join the SDV community here: https://t.co/FHaiI14lfh⁣
⁣
#generativeai #syntheticdata #machinelearning #openai

Synthetic Data Vault

@sdv_dev

over 1 year ago

An easy way to improve the quality of the synthetic data that the SDV generates is to accurately define each column’s sdtype. Sdtypes are a key part of the SDV’s Metadata model, which lets you, the expert of the data, provide additional context for the SDV to incorporate. For example, a column containing the values 75023, 10002, and 10003 could represent any of the following sdtypes based on the dataset: - Numerical - Categorical - Postal Code - Identifier (or ID) Each sdtype results in different synthetic data generation behavior for a column, as you can tell from the diagram below. Start by establishing baseline metadata using SDV’s auto-detection feature and then update the sdtype for specific columns to better align with the behavior you expect. Learn more about sdtypes here: https://t.co/bRAhZySY7L #generativeAI #syntheticdata #AI

sdv_dev's tweet photo. An easy way to improve the quality of the synthetic data that the SDV generates is to accurately define each column’s sdtype. Sdtypes are a key part of the SDV’s Metadata model, which lets you, the expert of the data, provide additional context for the SDV to incorporate.

For example, a column containing the values 75023, 10002, and 10003 could represent any of the following sdtypes based on the dataset:

- Numerical
- Categorical
- Postal Code
- Identifier (or ID)

Each sdtype results in different synthetic data generation behavior for a column, as you can tell from the diagram below. Start by establishing baseline metadata using SDV’s auto-detection feature and then update the sdtype for specific columns to better align with the behavior you expect.

Learn more about sdtypes here: https://t.co/bRAhZySY7L

#generativeAI #syntheticdata #AI

Synthetic Data Vault

@sdv_dev

over 1 year ago

Many real-world classification datasets have severe class imbalance. For example, imagine a fraud dataset where 99.9% of the rows are labelled non-fraudulent and only 0.01% are labelled fraudulent. By incorporating synthetic data in your training data, you can achieve a more desirable label balance. Start by training a generative AI model in the SDV on your real data. Then, use the Conditional Sampling feature to generate synthetic data for just the rows in the minority label class. Because the model is trained on your real data, the generated synthetic data will mirror the column distributions and correlations between the columns in your real data. By supplementing your training data with synthetic data that’s conditionally sampled from the minority class, you can even achieve a 50-50 class balance. Learn more about our Conditional Sampling feature here: https://t.co/6gvlKikZGB

Synthetic Data Vault

@sdv_dev

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users