1/ Launching Dataset News — operator notes on datasets, evaluations, and licensing so teams can buy, benchmark, and ship with confidence. Signals > noise. Ops > hype.
👉 https://t.co/A28c9hN2oe
One man in Ireland is trying to lower the price of a Guinness across the country.
He used AI to create the "Guinndex", a map that shows the cost of a pint at almost 2000 bars. Now bars are lowering the prices of their pint to try and compete with each other.
peace and love boss, 2min braindump no particular order w some pointers for your lit review
welcome to the community
+ open invite to join our oss ai data industry standards at croissant, >1.5m ai ready ds brought online across hf, kaggle, google, meta, neurips & counting
As we begin to agent-ify everything SaaS, data becomes a critical performance moat - and discovery data, accessing its value, and connecting to each endpoint itself becomes a massive time-suck not yet solved.
Enter @TryBrickroad.
Looking at the recent distillation allegations from Anthropic and others (Google made the claim two weeks ago, just not on Twitter) through a data lens, a few things stand out.
When access to strong weights is plentiful, data often feels like a second-order problem. Frontier models, and perhaps more strikingly, their downstream users, can get surprisingly far with a thin layer of fine-tuning data, some preference data, and a handful of evals.
Open weights have long been an adoption subsidy: they let a whole ecosystem ship products on frontier-model rails without building much of a supply chain. It’s part of why “data-centric AI” has struggled to break into the mainstream conversation, ala data centers. The other part is simpler - data procurement is really difficult.
If weights tighten and distillation starts getting treated as adversarial extraction, that subsidy ends. Builders have to move upstream. The most interesting projects won’t be the ones with the slickest wrapper. They’ll be the ones who can reliably source, license, and validate the right training and evaluation data, quickly, with provenance they can defend.
That’s also where the real ecosystem risk sits. Countries and enterprises that implicitly planned around “we’ll always have access to a strong model” may find that access becomes policy-mediated, contract-mediated, or both. When that happens, independence comes from controlling inputs and measurement: lawful data access, repeatable procurement, and evaluation discipline.
@TryBrickroad exists for that layer - the liquid data economy.
some more eyecandy on the data multiplexer architecture im working on. each node autonomously provisions, requests and routes data flow on behalf of their downstream stack. it really comes together from 10000 feet when you slice and dice the whole mesh. we are onboarding new nodes w our team @TryBrickroad and rolling out more of our research in the coming weeks: provisioning the right data flow at minimal transaction cost
A not-so-short overview of why we're working flat out to reduce procurement and eval frictions at @TryBrickroad
Deficiencies in available data force labs to overspend on compute or synthetic data to compensate - that's not good. We can help.
The profile of consumers is changing from humans to machines - agents knowing what data to consume, from whom, and when, will be critical to ensuring a smooth transition to fully agentic work processes.
Super stoked to be building the rails to make that happen with the team at @TryBrickroad
I am fairly certain this will be true - but it wont be if we don't solve data procurement.
Data procurement is fundamentally an analogue 1:1 process. It cannot scale without technology - the human data economy will never exceed $400B without tooling like @TryBrickroad
Congratulations to @LuisOala, co-founder and Chief AI Officer at https://t.co/Orn1igsZKH, as well as his co-authors, @ruoxijia, @JiachenWang97, @feiyang_ml, and @dawnsongtweets, on the release of their @NeurIPSConf position paper, "A Sustainable AI Economy Needs Data Deals That Work for Generators."
1
NeurIPS 2025 papers are out and our very own @LuisOala's paper on Efficient Data Markets is live!
Although we've been shipping daily, we'll be launching to the public in 4 weeks at NeurIPs.
The future of data is coming.
we are working on an animated 3d feature film on data deals written fully in TikZ (; ama
it will be coming to a neurips near you this winter
condensing ideas that lived rent-free in our heads for the last 2y
🫶 @ruoxijia wenjie suqin @JiachenWang97@feiyang_ml@dawnsongtweets
1/ New post: Refreshing general-knowledge benchmarks without leakage
Here’s how to run evals without rewarding recollection—so you can buy, benchmark, and ship with confidence.
👉 https://t.co/uB5AddyCON
11/ Traps to avoid:
• Tuning on the split you’ll brag about
• Shuffling choices without logging the seed
• Cross-month LiveBench comparisons
• Publishing numbers with no guess baseline or judge config