Plamen

Maths and software. AI, my kids and motorbikes. Former researcher @lab_dai, @MIT. Co-created @sdv_dev. Co-founder of @oassisai, @datacebo and @pythiac_com.

9 months ago

@karpathy I see in the readme that this is MIT license, will it stay that way ?

Who to follow

Carles Sala

@csalacat

Husband, dad and mediocre web developer

10 months ago

@JustDevZero Y desde que sali de Andromeda nos hemos visto menos de 7 veces 😅

10 months ago

WoW 7 Years... I think I average 1 tweet per year 😅 #MyXAnniversary

11 months ago

Funny how I can intuitively write scalable software, but building an IKEA cabinet turns me into a caveman discovering tools for the first time. 🐵

about 1 year ago

@levelsio The one victim I see there is the TV sitting on the ground. You need some VC funding for furniture or tv stands.

410

pvkdeveloper retweeted

Synthetic Data Vault

@sdv_dev

over 1 year ago

Last week, we shared a synthetic populations dataset for the United States but this week we’re sharing one published by researchers for the whole world. 🌏 Marijin Ton et al released a gigantic synthetic population dataset that represents ~𝟳.𝟯𝟯 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗵𝘂𝗺𝗮𝗻𝘀, which matches the 2015 human population count, and ~𝟭.𝟵𝟵 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗵𝗼𝘂𝘀𝗲𝗵𝗼𝗹𝗱𝘀. 𝗧𝗵𝗲 𝗠𝗼𝘁𝗶𝘃𝗮𝘁𝗶𝗼𝗻 To understand the impact of societal changes like disease, extreme weather, and more, modelers sometimes resort to simplifying assumptions of human behavior. According to the authors – “𝘍𝘰𝘳 𝘦𝘹𝘢𝘮𝘱𝘭𝘦, 𝘪𝘯𝘵𝘦𝘨𝘳𝘢𝘵𝘦𝘥 𝘢𝘴𝘴𝘦𝘴𝘴𝘮𝘦𝘯𝘵 𝘮𝘰𝘥𝘦𝘭𝘴 𝘰𝘧 𝘤𝘭𝘪𝘮𝘢𝘵𝘦 𝘤𝘩𝘢𝘯𝘨𝘦 𝘵𝘺𝘱𝘪𝘤𝘢𝘭𝘭𝘺 𝘢𝘴𝘴𝘶𝘮𝘦 𝘢 𝘳𝘦𝘱𝘳𝘦𝘴𝘦𝘯𝘵𝘢𝘵𝘪𝘷𝘦 𝘤𝘰𝘯𝘴𝘶𝘮𝘦𝘳 𝘰𝘧 𝘢 𝘴𝘪𝘯𝘨𝘭𝘦 𝘢𝘷𝘦𝘳𝘢𝘨𝘦 𝘨𝘭𝘰𝘣𝘢𝘭 𝘰𝘳 𝘳𝘦𝘨𝘪𝘰𝘯𝘢𝘭 𝘤𝘰𝘯𝘴𝘶𝘮𝘦𝘳.” By creating a synthetic individuals dataset that’s consistent with published demographic statistics at the state / province level (administrative level 1) for most countries, they’re hoping to improve the data and assumptions used in global impact simulations. 𝗧𝗵𝗲𝗶𝗿 𝗗𝗮𝘁𝗮 𝗦𝗼𝘂𝗿𝗰𝗲𝘀 The team primarily used data from 2 databases: • Luxembourg Income Study, which has very detailed microdata for 50 countries. LIS data especially shines for medium and high income countries. • Demographic and Health Surveys, which has very detailed microdata for 90 countries. DHS data especially shines for low-income countries. Households and individuals in the remaining countries were generated using regional statistics. A small number of countries were excluded that were missing reliable, published statistics. This is a great dataset to explore geospatial visualizations or to build regional or global impact models. 📚 Link to the paper: https://t.co/1Uq61TGmox 🗄️ Link to the dataset: https://t.co/vx07ezoFKF #syntheticdata #machinelearning #generativeai Kudos to researchers who made this happen: Michiel Ingels, Jens de Bruijn, Hans de Moel, Lena Reimann, Wouter Botzen, Jeroen Aerts Credit to the Nature Magazine and the authors for the image showcasing the population coverage and data source for each country.

sdv_dev's tweet photo. Last week, we shared a synthetic populations dataset for the United States but this week we’re sharing one published by researchers for the whole world. 🌏

Marijin Ton et al released a gigantic synthetic population dataset that represents ~𝟳.𝟯𝟯 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗵𝘂𝗺𝗮𝗻𝘀, which matches the 2015 human population count, and ~𝟭.𝟵𝟵 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗵𝗼𝘂𝘀𝗲𝗵𝗼𝗹𝗱𝘀.

𝗧𝗵𝗲 𝗠𝗼𝘁𝗶𝘃𝗮𝘁𝗶𝗼𝗻
To understand the impact of societal changes like disease, extreme weather, and more, modelers sometimes resort to simplifying assumptions of human behavior.

According to the authors – “𝘍𝘰𝘳 𝘦𝘹𝘢𝘮𝘱𝘭𝘦, 𝘪𝘯𝘵𝘦𝘨𝘳𝘢𝘵𝘦𝘥 𝘢𝘴𝘴𝘦𝘴𝘴𝘮𝘦𝘯𝘵 𝘮𝘰𝘥𝘦𝘭𝘴 𝘰𝘧 𝘤𝘭𝘪𝘮𝘢𝘵𝘦 𝘤𝘩𝘢𝘯𝘨𝘦 𝘵𝘺𝘱𝘪𝘤𝘢𝘭𝘭𝘺 𝘢𝘴𝘴𝘶𝘮𝘦 𝘢 𝘳𝘦𝘱𝘳𝘦𝘴𝘦𝘯𝘵𝘢𝘵𝘪𝘷𝘦 𝘤𝘰𝘯𝘴𝘶𝘮𝘦𝘳 𝘰𝘧 𝘢 𝘴𝘪𝘯𝘨𝘭𝘦 𝘢𝘷𝘦𝘳𝘢𝘨𝘦 𝘨𝘭𝘰𝘣𝘢𝘭 𝘰𝘳 𝘳𝘦𝘨𝘪𝘰𝘯𝘢𝘭 𝘤𝘰𝘯𝘴𝘶𝘮𝘦𝘳.”

By creating a synthetic individuals dataset that’s consistent with published demographic statistics at the state / province level (administrative level 1) for most countries, they’re hoping to improve the data and assumptions used in global impact simulations.

𝗧𝗵𝗲𝗶𝗿 𝗗𝗮𝘁𝗮 𝗦𝗼𝘂𝗿𝗰𝗲𝘀
The team primarily used data from 2 databases:

• Luxembourg Income Study, which has very detailed microdata for 50 countries. LIS data especially shines for medium and high income countries.

• Demographic and Health Surveys, which has very detailed microdata for 90 countries. DHS data especially shines for low-income countries.

Households and individuals in the remaining countries were generated using regional statistics. A small number of countries were excluded that were missing reliable, published statistics.

This is a great dataset to explore geospatial visualizations or to build regional or global impact models.

📚 Link to the paper: https://t.co/1Uq61TGmox
🗄️ Link to the dataset: https://t.co/vx07ezoFKF

#syntheticdata #machinelearning #generativeai

Kudos to researchers who made this happen: Michiel Ingels, Jens de Bruijn, Hans de Moel, Lena Reimann, Wouter Botzen, Jeroen Aerts

Credit to the Nature Magazine and the authors for the image showcasing the population coverage and data source for each country.

110

pvkdeveloper retweeted

Synthetic Data Vault

@sdv_dev

over 1 year ago

In 2024, synthetic data routinely made headlines alongside many AI product launches. 𝗛𝗲𝗿𝗲 𝗮𝗿𝗲 𝗼𝘂𝗿 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻𝘀 𝗳𝗼𝗿 𝟮𝟬𝟮𝟱 🔮 𝟭. 𝗧𝗵𝗲 𝗿𝗶𝘀𝗲 𝗼𝗳 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗔𝗜 𝘄𝗶𝗹𝗹 𝗿𝗲𝘀𝘂𝗹𝘁 𝗶𝗻 𝗮 𝗻𝘂𝗺𝗯𝗲𝗿 𝗼𝗳 𝗟𝗟𝗠-𝗯𝗮𝘀𝗲𝗱 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 𝘁𝗼𝗼𝗹𝘀 𝗳𝗼𝗿 𝘁𝗮𝗯𝘂𝗹𝗮𝗿 𝗱𝗮𝘁𝗮. 𝗡𝗼𝗻𝗲 𝘄𝗶𝗹𝗹 𝗱𝗲𝗹𝗶𝘃𝗲𝗿 𝗼𝗻 𝘁𝗵𝗲 𝗽𝗿𝗼𝗺𝗶𝘀𝗲, 𝗯𝘂𝘁 𝘁𝗵𝗶𝘀 𝗽𝗿𝗼𝗰𝗲𝘀𝘀 𝘄𝗶𝗹𝗹 𝗵𝗲𝗹𝗽 𝗲𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝘀 𝗱𝗲𝗳𝗶𝗻𝗲 𝗿𝗲𝗾𝘂𝗶𝗿𝗲𝗺𝗲𝗻𝘁𝘀. Researchers have started to use LLM’s to generate synthetic tabular data. We predict that these efforts will show promise on toy or single-table datasets but will fall short for complex, enterprise-grade, multi-table databases that contain lots of hidden context. Even though these tools will be tested and will fail to deliver ... it will lead to the development of much more concrete requirements for tabular synthetic data generators. 𝟮. 𝗖𝗼𝗺𝗽𝗮𝗻𝗶𝗲𝘀 𝘄𝗶𝗹𝗹 𝗳𝗮𝗰𝗲 𝗮 𝗳𝗿𝗲𝗲𝘇𝗲 𝗶𝗻 𝗱𝗮𝘁𝗮 𝗮𝘀𝘀𝗲𝘁 𝗮𝘃𝗮𝗶𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗱𝘂𝗲 𝘁𝗼 𝗿𝗲𝗴𝘂𝗹𝗮𝘁𝗶𝗼𝗻𝘀 𝗮𝗻𝗱 𝗱𝗲𝗰𝗹𝗶𝗻𝗶𝗻𝗴 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗰𝗼𝗻𝘀𝗲𝗻𝘁. Increased privacy and security regulations and increased customer privacy consciousness will make it harder to use customer data to train AI models. This will lead companies to run out of usable data and turn to synthetic data as a viable solution. 𝟯. 𝗘𝘃𝗲𝗿𝘆 𝗰𝗼𝗺𝗽𝗮𝗻𝘆 𝘄𝗶𝗹𝗹, 𝗮𝘁 𝘁𝗵𝗲 𝘃𝗲𝗿𝘆 𝗹𝗲𝗮𝘀𝘁, 𝗲𝘅𝗽𝗲𝗿𝗶𝗺𝗲𝗻𝘁 𝘄𝗶𝘁𝗵 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝗶𝗻 𝟮𝟬𝟮𝟱 𝗮𝘀 𝗽𝗮𝗿𝘁 𝗼𝗳 𝘁𝗵𝗲𝗶𝗿 𝗯𝗿𝗼𝗮𝗱𝗲𝗿 𝗔𝗜 𝗱𝗮𝘁𝗮 𝘀𝘁𝗿𝗮𝘁𝗲𝗴𝘆. Synthetic data is often better than real data in AI training and can be more freely shared across the organization. AI models simply perform better when trained with upsampled, augmented, and bias-corrected synthetic data as they can identify patterns more efficiently without overfitting. We are already seeing this — the SDV software has been downloaded more than 7 million times, and as many as 10% of global Fortune 500 companies currently experiment with SDV. We predict this number will grow exponentially next year. 𝟰. 𝗦𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝗳𝗼𝗿 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀 𝘄𝗶𝗹𝗹 𝗯𝗲𝗰𝗼𝗺𝗲 𝗮 𝗺𝗼𝗿𝗲 𝗽𝗿𝗲𝘀𝘀𝗶𝗻𝗴 𝗻𝗲𝗲𝗱. Enterprises will need additional data to train more robust AI agents and synthetic data can help fill the gap. 𝟱. 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝘀 𝘄𝗶𝗹𝗹 𝗴𝗮𝗶𝗻 𝗯𝗶𝗴 𝗳𝗿𝗼𝗺 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝘁𝗮𝗯𝘂𝗹𝗮𝗿 𝗱𝗮𝘁𝗮 𝗮𝗻𝗱 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝘁𝗼 𝘁𝗿𝗮𝗶𝗻 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀. While big tech focuses on improving LLM’s, most enterprises will gain more immediate value from synthetic tabular data to improve data access, train more robust ML models, or train better AI agents. 📖 Read more about our 2025 predictions and our 2024 recap here: https://t.co/HVohrewK4F #generativeai #ai #openai #syntheticdata #machinelearning

pvkdeveloper retweeted

rohit

@seatedro

over 1 year ago

144

181

146K

almost 2 years ago

@sdv_dev @datacebo

pvkdeveloper retweeted

almost 2 years ago

One of our users exclaimed "These speedups are insane!" Our multi table synthesizer in SDV Enterprise, called HSA Synthesizer, runs in less than 1 minute what takes HMA Synthesizer an hour - across 20 datasets. ❇️ We have been focusing on multi table synthesizers. #syntheticdata platform must address the complexity of multi table enterprise data at scale. 🔥 The 70x speeds fundamentally change how one uses #SDV. If you can model that fast and sample even faster the need to save model and version it goes away. ✅ What is more interesting is that these speed ups have not been achieved by increasing the compute required, but fundamentally changing the algorithms. We are continuously evolving and more to come. You can learn more about the trade offs in this blog: https://t.co/yMOQB30Tx9 #syntheticdata, #generativeai, #performance -- @sdv_dev

101

pvkdeveloper retweeted

almost 2 years ago

#OTD in 2016 we submitted the final camera ready version of the Massachusetts Institute of Technology paper ⭐️ The synthetic data vault ⭐️ The paper said: "This synthetic data must meet two requirements: 1️⃣ First, it must somewhat resemble the original data statistically, to ensure realism and keep problems engaging for data scientists. 2️⃣ Second, it must also formally and structurally resemble the original data, so that any software written on top of it can be reused. In order to meet these requirements, the data must be statistically modeled in its original form, so that we can sample from and recreate it. In our case and in most cases, that form is the database itself. Thus, modeling must occur before any transformations and aggregations are applied." Today, #sdv counts millions of downloads, thousands of users and so many additional modules have been added to evaluate #syntheticdata, #benchmark models and so much more.. You can find the original paper here: https://t.co/NwXbBPafWL #syntheticdata, #generativeai, #tabulardata , #ai, #machinelearning, #DataScience

datacebo's tweet photo. #OTD in 2016 we submitted the final camera ready version of the Massachusetts Institute of Technology paper ⭐️ The synthetic data vault ⭐️

The paper said:
"This synthetic data must meet two requirements:

1️⃣ First, it must somewhat resemble the original data statistically, to ensure realism and keep problems engaging for data scientists.

2️⃣ Second, it must also formally and structurally resemble the original data, so that any software written on top of it can be reused.

In order to meet these requirements, the data must be statistically modeled in its original form, so that we can sample from and recreate it. In our case and in most cases, that form is the database itself. Thus, modeling must occur before any transformations and aggregations are applied."

Today, #sdv counts millions of downloads, thousands of users and so many additional modules have been added to evaluate #syntheticdata, #benchmark models and so much more..

You can find the original paper here: https://t.co/NwXbBPafWL

#syntheticdata, #generativeai, #tabulardata , #ai, #machinelearning, #DataScience

260

pvkdeveloper retweeted

almost 2 years ago

https://t.co/paTvEJxYLw

161

pvkdeveloper retweeted