We are introducing EU Inc. To make building and growing a business across the EU faster, simpler, and smarter.
๐ธ Start a company in less than 48 hours
๐ธ No minimum capital requirement
๐ธ Fully online and borderless
@Dexerto Good thing you can block YouTube Shorts with Brave. ๐ฆ
Hereโs how to do it in our browser:
Android/iOS:
1) Go to Settings -> Media
2) Enable "Block YouTube Shorts"
Desktop:
1) Go to Settings -> Shields -> Content Filtering
2) Enable "YouTube Anti-Shorts"
A free throwback MIT course breaking down how machine learning techniques can be applied to healthcare: https://t.co/TrQlckLh8o (v/@MITOCW)
Here, MIT prof. & CSAIL principal investigator David Sontag discusses how AI can help sort thru medical data (Lecture 1).
Last week, we shared a synthetic populations dataset for the United States but this week weโre sharing one published by researchers for the whole world. ๐
Marijin Ton et alย released a gigantic synthetic population dataset that represents ~๐ณ.๐ฏ๐ฏ ๐ฏ๐ถ๐น๐น๐ถ๐ผ๐ป ๐ต๐๐บ๐ฎ๐ป๐, which matches the 2015 human population count, and ~๐ญ.๐ต๐ต ๐ฏ๐ถ๐น๐น๐ถ๐ผ๐ป ๐ต๐ผ๐๐๐ฒ๐ต๐ผ๐น๐ฑ๐.
๐ง๐ต๐ฒ ๐ ๐ผ๐๐ถ๐๐ฎ๐๐ถ๐ผ๐ป
To understand the impact of societal changes like disease, extreme weather, and more, modelers sometimes resort to simplifying assumptions of human behavior.
According to the authors โ โ๐๐ฐ๐ณ ๐ฆ๐น๐ข๐ฎ๐ฑ๐ญ๐ฆ, ๐ช๐ฏ๐ต๐ฆ๐จ๐ณ๐ข๐ต๐ฆ๐ฅ ๐ข๐ด๐ด๐ฆ๐ด๐ด๐ฎ๐ฆ๐ฏ๐ต ๐ฎ๐ฐ๐ฅ๐ฆ๐ญ๐ด ๐ฐ๐ง ๐ค๐ญ๐ช๐ฎ๐ข๐ต๐ฆ ๐ค๐ฉ๐ข๐ฏ๐จ๐ฆ ๐ต๐บ๐ฑ๐ช๐ค๐ข๐ญ๐ญ๐บ ๐ข๐ด๐ด๐ถ๐ฎ๐ฆ ๐ข ๐ณ๐ฆ๐ฑ๐ณ๐ฆ๐ด๐ฆ๐ฏ๐ต๐ข๐ต๐ช๐ท๐ฆ ๐ค๐ฐ๐ฏ๐ด๐ถ๐ฎ๐ฆ๐ณ ๐ฐ๐ง ๐ข ๐ด๐ช๐ฏ๐จ๐ญ๐ฆ ๐ข๐ท๐ฆ๐ณ๐ข๐จ๐ฆ ๐จ๐ญ๐ฐ๐ฃ๐ข๐ญ ๐ฐ๐ณ ๐ณ๐ฆ๐จ๐ช๐ฐ๐ฏ๐ข๐ญ ๐ค๐ฐ๐ฏ๐ด๐ถ๐ฎ๐ฆ๐ณ.โ
By creating a synthetic individuals dataset thatโs consistent with published demographic statistics at the state / province level (administrative level 1) for most countries, theyโre hoping to improve the data and assumptions used in global impact simulations.
๐ง๐ต๐ฒ๐ถ๐ฟ ๐๐ฎ๐๐ฎ ๐ฆ๐ผ๐๐ฟ๐ฐ๐ฒ๐
The team primarily used data from 2 databases:
โข Luxembourg Income Study, which has very detailed microdata for 50 countries. LIS data especially shines for medium and high income countries.
โข Demographic and Health Surveys, which has very detailed microdata for 90 countries. DHS data especially shines for low-income countries.
Households and individuals in the remaining countries were generated using regional statistics. A small number of countries were excluded that were missing reliable, published statistics.
This is a great dataset to explore geospatial visualizations or to build regional or global impact models.
๐ Link to the paper: https://t.co/1Uq61TGmox
๐๏ธ Link to the dataset: https://t.co/vx07ezoFKF
#syntheticdata #machinelearning #generativeai
Kudos to researchers who made this happen: Michiel Ingels, Jens de Bruijn, Hans de Moel, Lena Reimann, Wouter Botzen, Jeroen Aerts
Credit to the Nature Magazine and the authors for the image showcasing the population coverage and data source for each country.
In 2024, synthetic data routinely made headlines alongside many AI product launches. ๐๐ฒ๐ฟ๐ฒ ๐ฎ๐ฟ๐ฒ ๐ผ๐๐ฟ ๐ฝ๐ฟ๐ฒ๐ฑ๐ถ๐ฐ๐๐ถ๐ผ๐ป๐ ๐ณ๐ผ๐ฟ ๐ฎ๐ฌ๐ฎ๐ฑ ๐ฎ
๐ญ. ๐ง๐ต๐ฒ ๐ฟ๐ถ๐๐ฒ ๐ผ๐ณ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐๐ฒ ๐๐ ๐๐ถ๐น๐น ๐ฟ๐ฒ๐๐๐น๐ ๐ถ๐ป ๐ฎ ๐ป๐๐บ๐ฏ๐ฒ๐ฟ ๐ผ๐ณ ๐๐๐ -๐ฏ๐ฎ๐๐ฒ๐ฑ ๐๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐ฑ๐ฎ๐๐ฎ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐๐ผ๐ผ๐น๐ ๐ณ๐ผ๐ฟ ๐๐ฎ๐ฏ๐๐น๐ฎ๐ฟ ๐ฑ๐ฎ๐๐ฎ. ๐ก๐ผ๐ป๐ฒ ๐๐ถ๐น๐น ๐ฑ๐ฒ๐น๐ถ๐๐ฒ๐ฟ ๐ผ๐ป ๐๐ต๐ฒ ๐ฝ๐ฟ๐ผ๐บ๐ถ๐๐ฒ, ๐ฏ๐๐ ๐๐ต๐ถ๐ ๐ฝ๐ฟ๐ผ๐ฐ๐ฒ๐๐ ๐๐ถ๐น๐น ๐ต๐ฒ๐น๐ฝ ๐ฒ๐ป๐๐ฒ๐ฟ๐ฝ๐ฟ๐ถ๐๐ฒ๐ ๐ฑ๐ฒ๐ณ๐ถ๐ป๐ฒ ๐ฟ๐ฒ๐พ๐๐ถ๐ฟ๐ฒ๐บ๐ฒ๐ป๐๐.
Researchers have started to use LLMโs to generate synthetic tabular data. We predict that these efforts will show promise on toy or single-table datasets but will fall short for complex, enterprise-grade, multi-table databases that contain lots of hidden context. Even though these tools will be tested and will fail to deliver ... it will lead to the development of much more concrete requirements for tabular synthetic data generators.
๐ฎ. ๐๐ผ๐บ๐ฝ๐ฎ๐ป๐ถ๐ฒ๐ ๐๐ถ๐น๐น ๐ณ๐ฎ๐ฐ๐ฒ ๐ฎ ๐ณ๐ฟ๐ฒ๐ฒ๐๐ฒ ๐ถ๐ป ๐ฑ๐ฎ๐๐ฎ ๐ฎ๐๐๐ฒ๐ ๐ฎ๐๐ฎ๐ถ๐น๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ ๐ฑ๐๐ฒ ๐๐ผ ๐ฟ๐ฒ๐ด๐๐น๐ฎ๐๐ถ๐ผ๐ป๐ ๐ฎ๐ป๐ฑ ๐ฑ๐ฒ๐ฐ๐น๐ถ๐ป๐ถ๐ป๐ด ๐ฐ๐๐๐๐ผ๐บ๐ฒ๐ฟ ๐ฐ๐ผ๐ป๐๐ฒ๐ป๐.
Increased privacy and security regulations and increased customer privacy consciousness will make it harder to use customer data to train AI models. This will lead companies to run out of usable data and turn to synthetic data as a viable solution.
๐ฏ. ๐๐๐ฒ๐ฟ๐ ๐ฐ๐ผ๐บ๐ฝ๐ฎ๐ป๐ ๐๐ถ๐น๐น, ๐ฎ๐ ๐๐ต๐ฒ ๐๐ฒ๐ฟ๐ ๐น๐ฒ๐ฎ๐๐, ๐ฒ๐ ๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐ ๐๐ถ๐๐ต ๐๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐ฑ๐ฎ๐๐ฎ ๐ถ๐ป ๐ฎ๐ฌ๐ฎ๐ฑ ๐ฎ๐ ๐ฝ๐ฎ๐ฟ๐ ๐ผ๐ณ ๐๐ต๐ฒ๐ถ๐ฟ ๐ฏ๐ฟ๐ผ๐ฎ๐ฑ๐ฒ๐ฟ ๐๐ ๐ฑ๐ฎ๐๐ฎ ๐๐๐ฟ๐ฎ๐๐ฒ๐ด๐.
Synthetic data is often better than real data in AI training and can be more freely shared across the organization. AI models simply perform better when trained with upsampled, augmented, and bias-corrected synthetic data as they can identify patterns more efficiently without overfitting. We are already seeing this โ the SDV software has been downloaded more than 7 million times, and as many as 10% of global Fortune 500 companies currently experiment with SDV. We predict this number will grow exponentially next year.
๐ฐ. ๐ฆ๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐ฑ๐ฎ๐๐ฎ ๐ณ๐ผ๐ฟ ๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐๐ ๐ฎ๐ด๐ฒ๐ป๐๐ ๐๐ถ๐น๐น ๐ฏ๐ฒ๐ฐ๐ผ๐บ๐ฒ ๐ฎ ๐บ๐ผ๐ฟ๐ฒ ๐ฝ๐ฟ๐ฒ๐๐๐ถ๐ป๐ด ๐ป๐ฒ๐ฒ๐ฑ.
Enterprises will need additional data to train more robust AI agents and synthetic data can help fill the gap.
๐ฑ. ๐๐ป๐๐ฒ๐ฟ๐ฝ๐ฟ๐ถ๐๐ฒ๐ ๐๐ถ๐น๐น ๐ด๐ฎ๐ถ๐ป ๐ฏ๐ถ๐ด ๐ณ๐ฟ๐ผ๐บ ๐๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐๐ฎ๐ฏ๐๐น๐ฎ๐ฟ ๐ฑ๐ฎ๐๐ฎ ๐ฎ๐ป๐ฑ ๐๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐ฑ๐ฎ๐๐ฎ ๐๐ผ ๐๐ฟ๐ฎ๐ถ๐ป ๐๐ ๐ฎ๐ด๐ฒ๐ป๐๐.
While big tech focuses on improving LLMโs, most enterprises will gain more immediate value from synthetic tabular data to improve data access, train more robust ML models, or train better AI agents.
๐ Read more about our 2025 predictions and our 2024 recap here: https://t.co/HVohrewK4F
#generativeai #ai #openai #syntheticdata #machinelearning
One of our users exclaimed "These speedups are insane!" Our multi table synthesizer in SDV Enterprise, called HSA Synthesizer, runs in less than 1 minute what takes HMA Synthesizer an hour - across 20 datasets.
โ๏ธ We have been focusing on multi table synthesizers. #syntheticdata platform must address the complexity of multi table enterprise data at scale.
๐ฅ The 70x speeds fundamentally change how one uses #SDV. If you can model that fast and sample even faster the need to save model and version it goes away.
โ What is more interesting is that these speed ups have not been achieved by increasing the compute required, but fundamentally changing the algorithms.
We are continuously evolving and more to come.
You can learn more about the trade offs in this blog: https://t.co/yMOQB30Tx9
#syntheticdata, #generativeai, #performance -- @sdv_dev
#OTD in 2016 we submitted the final camera ready version of the Massachusetts Institute of Technology paper โญ๏ธ The synthetic data vault โญ๏ธ
The paper said:
"This synthetic data must meet two requirements:
1๏ธโฃ First, it must somewhat resemble the original data statistically, to ensure realism and keep problems engaging for data scientists.
2๏ธโฃ Second, it must also formally and structurally resemble the original data, so that any software written on top of it can be reused.
In order to meet these requirements, the data must be statistically modeled in its original form, so that we can sample from and recreate it. In our case and in most cases, that form is the database itself. Thus, modeling must occur before any transformations and aggregations are applied."
Today, #sdv counts millions of downloads, thousands of users and so many additional modules have been added to evaluate #syntheticdata, #benchmark models and so much more..
You can find the original paper here: https://t.co/NwXbBPafWL
#syntheticdata, #generativeai, #tabulardata , #ai, #machinelearning, #DataScience