Today, DataCebo launched SDV Enterprise & raised $8.5M in VC. SDV Enterprise is a commercial model of the source-available Synthetic Data Vault (SDV). It makes it easy to develop, manage & deploy #generativeAI models for apps when real data is limited. https://t.co/8DqHCxPDMv
Generate synthetic data at scale!
SDV is an open-source Python library that generates tabular synthetic data by using ML algorithms to learn and replicate patterns from your real data.
Here's how it works in 3 steps:
1๏ธโฃ Train: Point SDV at your real table; it will capture the underlying distributions & relationships.
2๏ธโฃ Generate: Run the trained SDV model to pop out as many look-alike rows as you needโno real data exposed.
3๏ธโฃ Validate: Use SDVโs quality report to see how closely the generated data matches the real stuff; tweak and repeat if you want it tighter.
Class imbalanceโsolved in one shot! โจ
Key features:
๐ง Multiple models from GaussianCopula to CTGAN
๐ Single, multi & sequential-table support
๐ Built-in anonymization & logical constraints
โ๏ธ Single call does it all `sdv.sample()`
Link to the GitHub repo in next tweet!
____
Share this with your network if you found this insightful โป๏ธ
Follow me ( @akshay_pachaar ) for more insights and tutorials on AI and Machine Learning!
Generating synthetic data that maintains realistic relationships between columns is crucial for testing and analysis. Traditional random generation approaches often create unrealistic patterns, like luxury hotel rooms priced cheaper than basic rooms.
GaussianCopulaSynthesizer automatically learns and maintains these relationships, creating synthetic data that preserves the statistical patterns of your original dataset.
โญ๏ธ Full code: https://t.co/MwjN2e7FBO
Many businesses collect and store their customersโ GPS locations to help improve their products. But GPS locations may contain precise locations of peopleโs homes. Businesses are sensitive to sharing this data even to internal teams, as it may reveal private information about people they know.
For example, a food delivery application stores the GPS location associated with each delivery. An internal product team wants to use this data to improve the local restaurant recommendations the application makes to users for future orders. The company needs a way to preserve local insights on the best restaurants from the GPS location data without exposing sensitive user locations.
One anonymization approach they could take is replacing every collected GPS location with a randomly chosen one from within the same postal code. Users tend to order from restaurants in the same or neighboring postal codes, so the integrity of local trends is still preserved.
To implement this approach, they would need a dataset that contains the geographic boundaries for each postal code and an algorithm for identifying the postal code from a GPS location. To make this process seamless, we created the MetroAreaAnonymizer.
With just a few lines of code, you can use the MetroAreaAnonymizer to replace GPS locations with a randomly chosen one from the same postal code. MetroAreaAnonymizer is part of our RDT library, which contains many helpful transformations for your raw data.
๐ Learn more about MetroAreaAnonymizer here: https://t.co/xjICTiEIVG
๐ Learn about RDT here: https://t.co/tIgwSxn6gU
๐ Learn more about the SDV here: https://t.co/aeuBqPb5Xh
#syntheticdata #machinelearning #anonymization #geospatial
Synthetic tabular data can help you test software applications because it resembles the key properties and patterns in your real data.
Consider a news publication that wants to use synthetic data to test a new software change for their mobile application before it rolls out to their entire reader base. They trained an AI model on their real data and used it to generate synthetic data.
Before they can incorporate this synthetic data into the test environment however, it must meet some minimum criteria for the application to function properly. Here are some examples of criteria that the synthetic data must meet:
1. Data Validity: Primary keys must be unique and non-null.
Many features need to retrieve a specific row in a table using a unique identifier. For example, to authenticate a user, the application needs to find the specific row corresponding to their unique user_id value.
2. Data Structure: Data types, column names, and table names should match those in the real data.
Application code that retrieves or updates data using specific column names, column types, and table names will error, like when the application needs to update a userโs settings.
3. Relationship Validity: Each foreign key must have a reference to a valid primary key (also known as referential integrity).
Many features in the app require joining data from multiple tables, like the recommended articles feature. Without referential integrity, the retrieved data might contain a subset or none of the recommended articles for the user.
To help them validate that the synthetic data meets the minimum criteria for usability, they could use the SDVโs Diagnostic Report. This report runs all of our basic data format and validity checks by comparing the real and synthetic data.
The Diagnostic Report is part of our open-source and vendor-neutral SDMetrics library. Synthetic data generated by the default synthesizers in the SDV will always result in 100% diagnostic scores. We call this the ๐ฆ๐๐ฉ ๐๐๐ฎ๐ฟ๐ฎ๐ป๐๐ฒ๐ฒ.
If the SDV ever generates synthetic data that doesnโt score 100% on the Diagnostic Report, then youโve identified a bug! Please reach out to us on GitHub or Slack and we will prioritize investigating it.
๐ Learn more about the single-table Diagnostic Report: https://t.co/5ICHqDb5rt
๐ Learn more about the multi-table Diagnostic Report: https://t.co/6W0PYTRnOn
๐ Learn more about the SDV here: https://t.co/aeuBqPbDMP
#dataquality #generativeai #machinelearning #softwaretesting #syntheticdata
One challenge in training AI models to generate valid synthetic data is teaching them to mimic the rules-based business logic that exist in real datasets. Letโs explore an example of one such rule.
The one-to-many relationship is a common pattern in database schemas. An interesting variation of this pattern occurs when only some rows are allowed to have connections while others arenโt.
For example, a gym offers a premium membership tier that gives access to extra benefits (like pool access and sauna access). To record the perks available to each member, they use a members table and a benefits table.
Only the rows representing premium members are allowed to have connections to rows in the benefits table while the rows representing basic members are not. This enables the gym to store specific information for a subset of their membership in a separate table in a simple way.
We call this the ForeignToPrimaryKeySubset pattern because only a subset of the primary keys in the parent table have a 1-to-many relationship with the foreign keys in the child table.
If your data contains this pattern, you can now generate multi-table synthetic data using the SDV that also adheres to this pattern. This pattern is part of our Constraint Augmented Generation bundle, or CAG, in the SDV Enterprise.
๐ Learn more about the ForeignToPrimaryKeySubset pattern here: https://t.co/tVEqBnqvc1
๐ Learn more about the CAG bundle here: https://t.co/8d6fbtrHgn
๐ Learn more about the SDV here: https://t.co/aeuBqPbDMP
#syntheticdata #generativeai #databases #machinelearning #datamodeling
โ๏ธ @Expedia recently shared a very interesting methodology on how they collect and use synthetic data to improve their flight price forecasting models.
When a user makes a flight search, Expedia retrieves the latest pricing data from their data providers for the specified search parameters - route, fare class, trip dates, etc. To build interesting price prediction features for their customers, the Expedia team trains forecasting models on data theyโve collected but they wanted to improve prediction accuracy even further.
๐ The Challenge
Even though millions of searches are made by users daily, the sheer number of combinations for possible routes, trip dates, and passenger counts is so large, that there were a lot of combinations for which the team did not have the price. To develop a robust forecasting model ideally the team would have at least one search a day for each of the combinations of the search parameters.
๐ค How they Incorporated Synthetic Data?
To fill these gaps they built automated software that requests flight prices for specific search parameters. ๐ฏ Their goal with synthetic searches is to have at least one search a day for their most popular routes for the trip dates that fall within the upcoming months.
During the model training phase, they combine data from real user searches and from synthetic searches to ensure they have better data coverage.
โ User Impact
When a user searches for a flight, Expedia shows a chart that visualizes how prices are forecasted to change between now and takeoff. By improving the accuracy of their price forecasts, Expedia helps their users decide if they should book a flight immediately or wait until a forecasted price drop occurs in the future.
๐ง Limitations
Using an automated search based on synthetically created search parameters could interfere with the experience of onsite users - who are trying to search for price. The team took this into consideration and were deliberate about balancing the data retrieval needs of real user searches with the teamโs needs for synthetic searches.
๐ Read the Dec 2024 @thenewstack article by Shiyi Pickrell, the SVP of Data and AI at Expedia: https://t.co/Ul9Tz8ejAn
๐ Read the Oct 2023 @Medium article b y Andrew Reuben: Senior Machine Learning Scientist at Expedia: https://t.co/CLsJU4l4NS
#syntheticdata #generativeai #machinelearning #openai #travel
Image credit: Expedia
One challenge in training AI models to generate valid synthetic data is teaching them to mimic the rules-based business logic that exist in real datasets. Letโs explore an example of one such rule.
Some applications need to store numerical data with different units of measurement in the same column. For example, an online retailer accepts payments in many different currencies and records every transaction in a table. They use an amount column to record the transaction amount and a currency column to record the currency for each transaction.
The transaction amounts associated with each currency might have radically different scales (min-max ranges and distributions) because of the exchange rate. 1 USD (American Dollar) is equivalent to ~1063 ARS (Argentinian Pesos), which is reflected in the transaction amounts.
We need a way to instruct the AI model to learn the scales for each currency separately. To enable SDV synthesizers to model this business logic and generate synthetic data that adheres to it, we created the MixedScales constraint. You can use this constraint whenever the value of one or more categorical columns (like the currency column) determines the scale of a numerical column (like the amount column).
The MixedScales constraint is part of our Constraint Augmented Generation, or CAG, in the SDV Enterprise.
๐ Learn more about the MixedScales constraint here: https://t.co/QjoGuJKzNV
๐ Learn more about the CAG bundle here: https://t.co/8d6fbtrHgn
#syntheticdata #generativeai #databases #finance #datamodeling
Today, weโre excited to introduce a powerful new bundle to The Synthetic Data Vault: AI connectors. AI connectors address 2 key challenges that SDV users face when training generative AI models on datasets from enterprise data stores. (Link to the announcement: https://t.co/nP3imke08I)
โ Creating accurate metadata is time consuming, especially for complex multi-table schemas
Metadata provides a deeper context (semantic and statistical) about your data and the synthesizers use this context to generate high quality synthetic data. Without AI connectors, SDV users have to export data from the database, use SDVโs metadata auto-detection feature to establish metadata, and then manually update the metadata to be accurate.
โ AI Connectors automatically generate higher quality metadata
AI connectors automatically infers higher quality metadata using the database schema and our own inference engine, without having to read tables into memory from the database.
When benchmarked with 55 datasets stored in 4 different database platforms, metadata generated using AI connectors resulted in 35% higher quality metadata (average score of 0.98) compared to metadata generated using the auto-detection approach (average score of 0.73).
โ Identifying a referentially sound and representative sample for training data is tricky
Training SDV Synthesizers requires loading a representative sample of data from your database into memory. In addition, the data needs to have referential integrity for the synthesizers to learn the proper relationships. Approaches to identifying a high quality, referentially sound sample of data can be tedious and time-consuming to implement.
โ AI Connectors uses an inbuilt algorithm to generate a training data set and guarantee referential integrity
With AI connectors, we created an algorithm called Referential First Search (RFS) that guarantees that the real data used to train the model is a subset with referential integrity. When benchmarked with 7 datasets stored in 5 different databases, training data created using AI connectors achieved an average of 18% higher quality data score over the standard approach of random subsampling and then enforcing referential integrity after.
Read more about AI connectors and how to access it in our latest product announcement here: https://t.co/nP3imke08I
#syntheticdata #generativeai #machinelearning #databases
SDV Enterprise v0.23.0 is out ๐
This release enhances your ability to program your synthesizer to find certain patterns and recreate themโ whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data.
๐ Improved CAG patterns. Use CarryOverColumns to specify a column that is repeated across many tables with different relationships. The PrimaryToPrimaryKeySubset pattern now works with missing values. See more about these interesting data patterns SDV Enterprise supports in the slides below.
๐ก Experiment with new transformers to improve your synthetic data quality. Try applying the new LogScaler and LogitScaler on data that exhibits exponential properties.
๐ Read the full Release Notes here: https://t.co/yOS3a4q82x
๐ Learn more about the SDV: https://t.co/7PnfKPunql
#syntheticdata #generativeai #machinelearning #ai
SDV Enterprise v0.23.0 is out ๐
This release enhances your ability to program your synthesizer to find certain patterns and recreate themโ whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data.
๐ Improved CAG patterns. Use CarryOverColumns to specify a column that is repeated across many tables with different relationships. The PrimaryToPrimaryKeySubset pattern now works with missing values. See more about these interesting data patterns SDV Enterprise supports in the slides below.
๐ก Experiment with new transformers to improve your synthetic data quality. Try applying the new LogScaler and LogitScaler on data that exhibits exponential properties.
๐ Read the full Release Notes here: https://t.co/yOS3a4pAcZ
๐ Learn more about the SDV: https://t.co/7PnfKPtPAN
#syntheticdata #generativeai #machinelearning #ai
Synthetic data is a powerful way to generate test data that looks and feels like real production data. You can either insert the synthetic data back into the database in an environment for manual testing or use the data for running automated tests.
But if you need to test a new application that has no real world usage or collected data, then you need to adopt a different approach.
Instead of training models on your real data to generate synthetic data, you can generate fake test data from scratch that adheres to your database schema. In the SDV, we created a dedicated synthesizer called DayZSynthesizer to support this workflow.
Here are the 3 main steps:
1. Generate baseline metadata
Auto-generate baseline metadata from your databaseโs schema (for supported databases) or use our Metadata APIs to create a JSON representation of your metadata that mirrors your database schema.
2. Improve the data realism
You can update sdtypes to add semantic meaning to special columns like social security numbers, postal codes, and addresses to improve the format and type of fake data thatโs generated. You can also define min-max value ranges for numerical columns, define a fixed set of categories for categorical columns, define datetime ranges, and control the proportion of missing data youโd like for each column.
3. Generate and export fake data ๐
Generate the rows you need for each table and export the data into your database.
The beauty of this workflow is that every time you make a software change that requires a change in the database schema, you can re-generate fake data with minimal changes to the code you already wrote.
๐ Learn more about DayZSynthesizer here: https://t.co/4RWf25VxrW
๐ Learn more about the Metadata Creation API Here: https://t.co/NozuGc7GxL
๐ Learn more about the SDV here: https://t.co/7PnfKPunql
#syntheticdata #fakedata #machinelearning #generativeai
Last week, we shared a synthetic populations dataset for the United States but this week weโre sharing one published by researchers for the whole world. ๐
Marijin Ton et alย released a gigantic synthetic population dataset that represents ~๐ณ.๐ฏ๐ฏ ๐ฏ๐ถ๐น๐น๐ถ๐ผ๐ป ๐ต๐๐บ๐ฎ๐ป๐, which matches the 2015 human population count, and ~๐ญ.๐ต๐ต ๐ฏ๐ถ๐น๐น๐ถ๐ผ๐ป ๐ต๐ผ๐๐๐ฒ๐ต๐ผ๐น๐ฑ๐.
๐ง๐ต๐ฒ ๐ ๐ผ๐๐ถ๐๐ฎ๐๐ถ๐ผ๐ป
To understand the impact of societal changes like disease, extreme weather, and more, modelers sometimes resort to simplifying assumptions of human behavior.
According to the authors โ โ๐๐ฐ๐ณ ๐ฆ๐น๐ข๐ฎ๐ฑ๐ญ๐ฆ, ๐ช๐ฏ๐ต๐ฆ๐จ๐ณ๐ข๐ต๐ฆ๐ฅ ๐ข๐ด๐ด๐ฆ๐ด๐ด๐ฎ๐ฆ๐ฏ๐ต ๐ฎ๐ฐ๐ฅ๐ฆ๐ญ๐ด ๐ฐ๐ง ๐ค๐ญ๐ช๐ฎ๐ข๐ต๐ฆ ๐ค๐ฉ๐ข๐ฏ๐จ๐ฆ ๐ต๐บ๐ฑ๐ช๐ค๐ข๐ญ๐ญ๐บ ๐ข๐ด๐ด๐ถ๐ฎ๐ฆ ๐ข ๐ณ๐ฆ๐ฑ๐ณ๐ฆ๐ด๐ฆ๐ฏ๐ต๐ข๐ต๐ช๐ท๐ฆ ๐ค๐ฐ๐ฏ๐ด๐ถ๐ฎ๐ฆ๐ณ ๐ฐ๐ง ๐ข ๐ด๐ช๐ฏ๐จ๐ญ๐ฆ ๐ข๐ท๐ฆ๐ณ๐ข๐จ๐ฆ ๐จ๐ญ๐ฐ๐ฃ๐ข๐ญ ๐ฐ๐ณ ๐ณ๐ฆ๐จ๐ช๐ฐ๐ฏ๐ข๐ญ ๐ค๐ฐ๐ฏ๐ด๐ถ๐ฎ๐ฆ๐ณ.โ
By creating a synthetic individuals dataset thatโs consistent with published demographic statistics at the state / province level (administrative level 1) for most countries, theyโre hoping to improve the data and assumptions used in global impact simulations.
๐ง๐ต๐ฒ๐ถ๐ฟ ๐๐ฎ๐๐ฎ ๐ฆ๐ผ๐๐ฟ๐ฐ๐ฒ๐
The team primarily used data from 2 databases:
โข Luxembourg Income Study, which has very detailed microdata for 50 countries. LIS data especially shines for medium and high income countries.
โข Demographic and Health Surveys, which has very detailed microdata for 90 countries. DHS data especially shines for low-income countries.
Households and individuals in the remaining countries were generated using regional statistics. A small number of countries were excluded that were missing reliable, published statistics.
This is a great dataset to explore geospatial visualizations or to build regional or global impact models.
๐ Link to the paper: https://t.co/1Uq61TGmox
๐๏ธ Link to the dataset: https://t.co/vx07ezoFKF
#syntheticdata #machinelearning #generativeai
Kudos to researchers who made this happen: Michiel Ingels, Jens de Bruijn, Hans de Moel, Lena Reimann, Wouter Botzen, Jeroen Aerts
Credit to the Nature Magazine and the authors for the image showcasing the population coverage and data source for each country.
Some multi-table datasets have interesting data patterns, like mirroring 1 or more columns in a child table from its parent table. This design pattern helps the database user avoid the need to run a time-consuming or expensive JOIN query, especially if one of the tables is extremely large or if the database is column-oriented like OLAP databases are.
For example, imagine youโre building an #ecommerce orders dashboard that frequently needed to analyze order volume and amounts by the userโs country of origin. With a fully normalized table design, this application would need to accumulate this information by frequently querying and joining both the orders and users tables.
If this query was slow or expensive, you could instead mirror the country of origin information from the ๐ถ๐ด๐ฆ๐ณ๐ด table to the ๐ฐ๐ณ๐ฅ๐ฆ๐ณ๐ด table.
We call this the ๐๐ฎ๐ฟ๐ฟ๐๐ข๐๐ฒ๐ฟ๐๐ผ๐น๐๐บ๐ป๐ ๐ฝ๐ฎ๐๐๐ฒ๐ฟ๐ป because 1 or more columns are carried over from one table to another.
If your real data contains this pattern, you can now generate multi-table synthetic data using the SDV that also adheres to this pattern. This pattern is part of our ๐๐ผ๐ป๐๐๐ฟ๐ฎ๐ถ๐ป๐ ๐๐๐ด๐บ๐ฒ๐ป๐๐ฒ๐ฑ ๐๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป bundle, or CAG, in the SDV Enterprise.
๐Learn more about the CarryOverColumns pattern here: https://t.co/KPeAmSpmjH
๐ Learn more about the CAG bundle here: https://t.co/WfKIvYy02v
#syntheticdata #generativeai #databases #machinelearning #datamodeling
James Rineer et al just released a new dataset containing millions of #syntheticdata about households and individuals in the US. Using publicly available census data from the U.S. Census Bureau, they generated:
๐๏ธ 120,754,708 synthetic households
๐ฅ 303,128,287 synthetic individuals
๐๏ธ 3 Gigabytes of compressed parquet files
The team was very meticulous with many aspects of the data generation. For example, they used external population density sources to place households inside real census block groups instead of just randomly generating locations inside the US.
This is a great dataset for practicing spatiotemporal analysis and visualization. ๐บ๏ธ๐
Link to the paper: https://t.co/UDMjIxvF8H
Link to the dataset: https://t.co/l2vUTIdnCX
#gis #machinelearning #ai #openai
Collaborators: Nicholas Kruskamp Caroline Kery Kasey Jones Rainer Hilscher Georgiy Bobashev
Credit to the @Nature magazine and the authors for the excellent image.
In 2024, synthetic data routinely made headlines alongside many AI product launches. ๐๐ฒ๐ฟ๐ฒ ๐ฎ๐ฟ๐ฒ ๐ผ๐๐ฟ ๐ฝ๐ฟ๐ฒ๐ฑ๐ถ๐ฐ๐๐ถ๐ผ๐ป๐ ๐ณ๐ผ๐ฟ ๐ฎ๐ฌ๐ฎ๐ฑ ๐ฎ
๐ญ. ๐ง๐ต๐ฒ ๐ฟ๐ถ๐๐ฒ ๐ผ๐ณ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐๐ฒ ๐๐ ๐๐ถ๐น๐น ๐ฟ๐ฒ๐๐๐น๐ ๐ถ๐ป ๐ฎ ๐ป๐๐บ๐ฏ๐ฒ๐ฟ ๐ผ๐ณ ๐๐๐ -๐ฏ๐ฎ๐๐ฒ๐ฑ ๐๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐ฑ๐ฎ๐๐ฎ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐๐ผ๐ผ๐น๐ ๐ณ๐ผ๐ฟ ๐๐ฎ๐ฏ๐๐น๐ฎ๐ฟ ๐ฑ๐ฎ๐๐ฎ. ๐ก๐ผ๐ป๐ฒ ๐๐ถ๐น๐น ๐ฑ๐ฒ๐น๐ถ๐๐ฒ๐ฟ ๐ผ๐ป ๐๐ต๐ฒ ๐ฝ๐ฟ๐ผ๐บ๐ถ๐๐ฒ, ๐ฏ๐๐ ๐๐ต๐ถ๐ ๐ฝ๐ฟ๐ผ๐ฐ๐ฒ๐๐ ๐๐ถ๐น๐น ๐ต๐ฒ๐น๐ฝ ๐ฒ๐ป๐๐ฒ๐ฟ๐ฝ๐ฟ๐ถ๐๐ฒ๐ ๐ฑ๐ฒ๐ณ๐ถ๐ป๐ฒ ๐ฟ๐ฒ๐พ๐๐ถ๐ฟ๐ฒ๐บ๐ฒ๐ป๐๐.
Researchers have started to use LLMโs to generate synthetic tabular data. We predict that these efforts will show promise on toy or single-table datasets but will fall short for complex, enterprise-grade, multi-table databases that contain lots of hidden context. Even though these tools will be tested and will fail to deliver ... it will lead to the development of much more concrete requirements for tabular synthetic data generators.
๐ฎ. ๐๐ผ๐บ๐ฝ๐ฎ๐ป๐ถ๐ฒ๐ ๐๐ถ๐น๐น ๐ณ๐ฎ๐ฐ๐ฒ ๐ฎ ๐ณ๐ฟ๐ฒ๐ฒ๐๐ฒ ๐ถ๐ป ๐ฑ๐ฎ๐๐ฎ ๐ฎ๐๐๐ฒ๐ ๐ฎ๐๐ฎ๐ถ๐น๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ ๐ฑ๐๐ฒ ๐๐ผ ๐ฟ๐ฒ๐ด๐๐น๐ฎ๐๐ถ๐ผ๐ป๐ ๐ฎ๐ป๐ฑ ๐ฑ๐ฒ๐ฐ๐น๐ถ๐ป๐ถ๐ป๐ด ๐ฐ๐๐๐๐ผ๐บ๐ฒ๐ฟ ๐ฐ๐ผ๐ป๐๐ฒ๐ป๐.
Increased privacy and security regulations and increased customer privacy consciousness will make it harder to use customer data to train AI models. This will lead companies to run out of usable data and turn to synthetic data as a viable solution.
๐ฏ. ๐๐๐ฒ๐ฟ๐ ๐ฐ๐ผ๐บ๐ฝ๐ฎ๐ป๐ ๐๐ถ๐น๐น, ๐ฎ๐ ๐๐ต๐ฒ ๐๐ฒ๐ฟ๐ ๐น๐ฒ๐ฎ๐๐, ๐ฒ๐ ๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐ ๐๐ถ๐๐ต ๐๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐ฑ๐ฎ๐๐ฎ ๐ถ๐ป ๐ฎ๐ฌ๐ฎ๐ฑ ๐ฎ๐ ๐ฝ๐ฎ๐ฟ๐ ๐ผ๐ณ ๐๐ต๐ฒ๐ถ๐ฟ ๐ฏ๐ฟ๐ผ๐ฎ๐ฑ๐ฒ๐ฟ ๐๐ ๐ฑ๐ฎ๐๐ฎ ๐๐๐ฟ๐ฎ๐๐ฒ๐ด๐.
Synthetic data is often better than real data in AI training and can be more freely shared across the organization. AI models simply perform better when trained with upsampled, augmented, and bias-corrected synthetic data as they can identify patterns more efficiently without overfitting. We are already seeing this โ the SDV software has been downloaded more than 7 million times, and as many as 10% of global Fortune 500 companies currently experiment with SDV. We predict this number will grow exponentially next year.
๐ฐ. ๐ฆ๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐ฑ๐ฎ๐๐ฎ ๐ณ๐ผ๐ฟ ๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐๐ ๐ฎ๐ด๐ฒ๐ป๐๐ ๐๐ถ๐น๐น ๐ฏ๐ฒ๐ฐ๐ผ๐บ๐ฒ ๐ฎ ๐บ๐ผ๐ฟ๐ฒ ๐ฝ๐ฟ๐ฒ๐๐๐ถ๐ป๐ด ๐ป๐ฒ๐ฒ๐ฑ.
Enterprises will need additional data to train more robust AI agents and synthetic data can help fill the gap.
๐ฑ. ๐๐ป๐๐ฒ๐ฟ๐ฝ๐ฟ๐ถ๐๐ฒ๐ ๐๐ถ๐น๐น ๐ด๐ฎ๐ถ๐ป ๐ฏ๐ถ๐ด ๐ณ๐ฟ๐ผ๐บ ๐๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐๐ฎ๐ฏ๐๐น๐ฎ๐ฟ ๐ฑ๐ฎ๐๐ฎ ๐ฎ๐ป๐ฑ ๐๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐ฑ๐ฎ๐๐ฎ ๐๐ผ ๐๐ฟ๐ฎ๐ถ๐ป ๐๐ ๐ฎ๐ด๐ฒ๐ป๐๐.
While big tech focuses on improving LLMโs, most enterprises will gain more immediate value from synthetic tabular data to improve data access, train more robust ML models, or train better AI agents.
๐ Read more about our 2025 predictions and our 2024 recap here: https://t.co/HVohrewK4F
#generativeai #ai #openai #syntheticdata #machinelearning
If you want to use AI generated synthetic data in place of your sensitive real data, then you need to be confident that the ๐ฌ๐ฒ๐ง๐ญ๐ก๐๐ญ๐ข๐ ๐๐๐ญ๐ ๐๐๐ก๐๐ซ๐๐ฌ ๐ญ๐จ ๐ญ๐ก๐ ๐ฌ๐๐ฆ๐ ๐๐ฎ๐ฌ๐ข๐ง๐๐ฌ๐ฌ ๐ซ๐ฎ๐ฅ๐๐ฌ.โฃ
โฃ
For example, imagine that youโre an online retailer that wants to test, using realistic data, how a new version of your website displays order history. Each order contains product names, their SKUโs (stock keeping units), along with some other fields.โฃ
โฃ
Every SKU value is linked to a unique product name and the generated synthetic data needs to reflect this pattern to help you accurately test the change. A SKU value canโt appear next to different product names in the synthetic data.โฃ
โฃ
In the SDV, you can define this business rule using the ๐ ๐ข๐ฑ๐๐๐๐จ๐ฆ๐๐ข๐ง๐๐ญ๐ข๐จ๐ง๐ฌ ๐๐จ๐ง๐ฌ๐ญ๐ซ๐๐ข๐ง๐ญ and require your synthesizer to generate synthetic data that adheres to it. โฃ
โฃ
๐Learn more about the ๐ ๐ข๐ฑ๐๐๐๐จ๐ฆ๐๐ข๐ง๐๐ญ๐ข๐จ๐ง๐ฌ ๐๐จ๐ง๐ฌ๐ญ๐ซ๐๐ข๐ง๐ญ here: https://t.co/kKXhiMhwkjโฃ
โฃ
๐คJoin the SDV community here: https://t.co/FHaiI14lfhโฃ
โฃ
#generativeai #syntheticdata #machinelearning #openai
An easy way to improve the quality of the synthetic data that the SDV generates is to accurately define each columnโs sdtype. Sdtypes are a key part of the SDVโs Metadata model, which lets you, the expert of the data, provide additional context for the SDV to incorporate.
For example, a column containing the values 75023, 10002, and 10003 could represent any of the following sdtypes based on the dataset:
- Numerical
- Categorical
- Postal Code
- Identifier (or ID)
Each sdtype results in different synthetic data generation behavior for a column, as you can tell from the diagram below. Start by establishing baseline metadata using SDVโs auto-detection feature and then update the sdtype for specific columns to better align with the behavior you expect.
Learn more about sdtypes here: https://t.co/bRAhZySY7L
#generativeAI #syntheticdata #AI
Many real-world classification datasets have severe class imbalance. For example, imagine a fraud dataset where 99.9% of the rows are labelled non-fraudulent and only 0.01% are labelled fraudulent. By incorporating synthetic data in your training data, you can achieve a more desirable label balance.
Start by training a generative AI model in the SDV on your real data. Then, use the Conditional Sampling feature to generate synthetic data for just the rows in the minority label class. Because the model is trained on your real data, the generated synthetic data will mirror the column distributions and correlations between the columns in your real data.
By supplementing your training data with synthetic data thatโs conditionally sampled from the minority class, you can even achieve a 50-50 class balance.
Learn more about our Conditional Sampling feature here: https://t.co/6gvlKikZGB