Ayush Chaurasia @loldedxd - Twitter Profile

Pinned Tweet

almost 2 years ago

Quick announcement- I no longer maintain yolo or work on adding features & answering issues. I appreciate all the lovely DMs and emails and I try not to ignore any. Also, I haven't kept up with latest developments from past months so I can't really answer those anyways. Thanks!

0

31

2

6

6K

loldedxd retweeted

Sachin

@sachdh

10 days ago

glad that we share the thesis with @EngramLab model specialization is not nice to have; but is a requirement for building truly autonomous agents congrats to @dan_biderman and the whole team

sachdh's tweet photo. glad that we share the thesis with @EngramLab

model specialization is not nice to have; but is a requirement for building truly autonomous agents

congrats to @dan_biderman and the whole team https://t.co/gayd6uiljJ

1

26

1

7

3K

Ayush Chaurasia @loldedxd

13 days ago

@ChanChan_Mao This counts as a post…

0

99

Ayush Chaurasia @loldedxd

16 days ago

Noam is a legend. This is the conclusion of his GLU paper ``` We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence. ``` https://t.co/Kzd851d3uc

0

1

0

76

Who to follow

Aurimas Griciūnas

@Aurimas_Gr

🔨 Founder & CEO @ SwirlAI 📖 Writing about #LLM, #AI, #DataEngineering, #MachineLearning and #Data ✍️ Author of SwirlAI Newsletter.

Darek Kłeczek

@dk21

Machine Learning, Kaggle and occasional pictures from Poland. LLM/AI Research at Snowflake. 4x Kaggle Grandmaster. Personal stuff only.

François Fleuret

@francoisfleuret

Research Scientist @meta (FAIR), Prof. @Unige_en, co-founder @neural_concept_. I like reality.

loldedxd retweeted

Hynek Kydlíček

@HKydlicek

23 days ago

We're launching Macrodata Labs. Me and @gui_penedo have spent the past three years in the trenches working on data for training LLMs. This gave us a unique perspective on how the field has progressed - from GPT-3-era models capable of little more than simple completions to today, where agents are writing a substantial share of the code being shipped. This progress was enabled by just two components: scaling data and compute while being extremely deliberate about what data to use and what not to use. Look at failed training runs and ask researchers what caused them - poor data quality is almost always at the top of the list. While LLMs have undergone this Cambrian explosion, robotics today feels exactly like LLMs did back then. There is still no clear recipe for what will work. Every team has its own opinions on embodiment and architecture, yet they all agree on one thing: the most important problem to solve is data and how to scale it. Nobody knows yet whether the answer lies in simulation data, egocentric data, IMU data, or something completely different. Whatever the answer turns out to be, every team still has to go through the same process: acquiring the data, filtering problematic episodes, synchronizing sensor values, annotating episodes using VLMs, splitting episodes into subtasks, or, in the case of egocentric data, extracting 21-DOF hand annotations. Finally, all of this has to be converted into training-ready datasets before training starts as choosing a bad format for training will waste GPU cycles. These pipelines need to run continuously. Every day, new episodes arrive from in-house data collection efforts and external vendors. Teams not only have to deal with the peculiarities of working with video data, ensuring sensor streams are error-free and avoiding unnecessary video decoding, but also need to support ingestion from whatever formats their data vendors provide. Many teams are solving these problems today, yet you'll quickly discover that 99% of the solutions are collections of one-off scripts, which everyone hates the moment something goes wrong. Researchers end up digging through repositories trying to find the script that performed a particular operation three months ago, not even knowing whether they're looking at the version that was actually run. What people want is something as scalable as Spark and as trackable as Weights & Biases. That is what we created Macrodata Labs to build. Our first step is Refiner, an open-source framework for processing robotics datasets. We designed Refiner to help robotics teams turn raw demonstrations into training-ready datasets. Instead of maintaining collections of one-off scripts, teams can use Refiner to ingest heterogeneous robotics data, synchronize sensors, run annotation workflows, extract signals like hand tracking, split trajectories into subtasks, and continuously process new data as it arrives. Alongside Refiner, we're also launching Refiner Cloud. With a one-line code change, the same pipeline can scale on our platform, with sharding, checkpointing, failure recovery, lineage tracking, and observability built in—so teams can focus on what matters most: data, not infrastructure plumbing. We're incredibly fortunate to have the support of @airstreet , @DrysdaleVC , OPRTRS Club, @kimaventures , YG (Alex Yazdi), >commit, @Thom_Wolf , and an amazing group of angels from leading AI labs and technology companies who share our belief that data will be one of the defining challenges in robotics. If this resonates with you, give Refiner a try, and don't hesitate to shoot me a message. We'd love to chat.

HKydlicek's tweet photo. We're launching Macrodata Labs.

Me and @gui_penedo have spent the past three years in the trenches working on data for training LLMs. This gave us a unique perspective on how the field has progressed - from GPT-3-era models capable of little more than simple completions to today, where agents are writing a substantial share of the code being shipped.

This progress was enabled by just two components: scaling data and compute while being extremely deliberate about what data to use and what not to use. Look at failed training runs and ask researchers what caused them - poor data quality is almost always at the top of the list.

While LLMs have undergone this Cambrian explosion, robotics today feels exactly like LLMs did back then. There is still no clear recipe for what will work. Every team has its own opinions on embodiment and architecture, yet they all agree on one thing: the most important problem to solve is data and how to scale it.

Nobody knows yet whether the answer lies in simulation data, egocentric data, IMU data, or something completely different.

Whatever the answer turns out to be, every team still has to go through the same process: acquiring the data, filtering problematic episodes, synchronizing sensor values, annotating episodes using VLMs, splitting episodes into subtasks, or, in the case of egocentric data, extracting 21-DOF hand annotations. Finally, all of this has to be converted into training-ready datasets before training starts as choosing a bad format for training will waste GPU cycles.

These pipelines need to run continuously. Every day, new episodes arrive from in-house data collection efforts and external vendors. Teams not only have to deal with the peculiarities of working with video data, ensuring sensor streams are error-free and avoiding unnecessary video decoding, but also need to support ingestion from whatever formats their data vendors provide.

Many teams are solving these problems today, yet you'll quickly discover that 99% of the solutions are collections of one-off scripts, which everyone hates the moment something goes wrong. Researchers end up digging through repositories trying to find the script that performed a particular operation three months ago, not even knowing whether they're looking at the version that was actually run.

What people want is something as scalable as Spark and as trackable as Weights & Biases.

That is what we created Macrodata Labs to build.

Our first step is Refiner, an open-source framework for processing robotics datasets.

We designed Refiner to help robotics teams turn raw demonstrations into training-ready datasets. Instead of maintaining collections of one-off scripts, teams can use Refiner to ingest heterogeneous robotics data, synchronize sensors, run annotation workflows, extract signals like hand tracking, split trajectories into subtasks, and continuously process new data as it arrives.

Alongside Refiner, we're also launching Refiner Cloud. With a one-line code change, the same pipeline can scale on our platform, with sharding, checkpointing, failure recovery, lineage tracking, and observability built in—so teams can focus on what matters most: data, not infrastructure plumbing.

We're incredibly fortunate to have the support of @airstreet , @DrysdaleVC , OPRTRS Club, @kimaventures , YG (Alex Yazdi), >commit, @Thom_Wolf , and an amazing group of angels from leading AI labs and technology companies who share our belief that data will be one of the defining challenges in robotics.

If this resonates with you, give Refiner a try, and don't hesitate to shoot me a message. We'd love to chat.

2

45

8

5

4K

Ayush Chaurasia @loldedxd

26 days ago

@pjreddie You are a legend 🫡

0

276

loldedxd retweeted

LanceDB

@lancedb

28 days ago

1/ May's highlight: stable-worldmodel paper published, standardizing world model pipelines on Lance, with Lance-backed S3 streaming several times faster than HDF5 for small-batch random access.

lancedb's tweet photo. 1/ May's highlight: stable-worldmodel paper published, standardizing world model pipelines on Lance, with Lance-backed S3 streaming several times faster than HDF5 for small-batch random access. https://t.co/7LcDoM4vkt

2

12

3

8

1K

loldedxd retweeted

LanceDB

@lancedb

29 days ago

1/ 3-4x faster data loading on Push-T vs HDF5 or video formats at a fraction of the disk size. stable-worldmodel uses Lance as the data layer — here's the training walkthrough.

1

8

3

1

682

loldedxd retweeted

Gabriele Berton

@gabriberton

about 1 month ago

Great quote that every AI researcher should aspire to: "I do things, which work, rather than being novel, but not working." - @ducha_aiki

15

163

15

46

14K

loldedxd retweeted

Randall Balestriero

@randall_balestr

about 1 month ago

Our stable-worldmodel library: - open-source - well tested and documented - comes with tons of envs and supports your custom ones - collect data, pretrain, plan, evaluate, ... - scalable to any cluster (thanks to our stable-pretraining backend) Time to focus on research!

2

116

11

56

11K

loldedxd retweeted

Lucas Maes

@lucasmaes_

about 1 month ago

Most work still relies on MP4 videos ... creating a major bottleneck that kills GPU utilization and training throughput. Not anymore. We teamed up with the incredible @lancedb team to build blazing-fast data loading with native @huggingface bucket streaming support! ⚡

lucasmaes_'s tweet photo. Most work still relies on MP4 videos ... creating a major bottleneck that kills GPU utilization and training throughput.

Not anymore.

We teamed up with the incredible @lancedb team to build blazing-fast data loading with native @huggingface bucket streaming support! ⚡ https://t.co/OU61ZcgTnp

2

24

6

5

3K

loldedxd retweeted

Lucas Maes

@lucasmaes_

about 1 month ago

This work was co-created with @quentinlldc! Huge thanks to all our incredible contributors and collaborators: @rwgao @luizfysouza @_fracapuano @ylecun @randall_balestr and the whole team 🙏 Paper: https://t.co/52HS8grawi Code: https://t.co/0w3peE2suD

3

39

7

17

18K

loldedxd retweeted

Lucas Maes

@lucasmaes_

about 1 month ago

Would you like to join the research effort on JEPA and World Models easily? After a full year of hard work, we’re excited to finally release stable-worldmodel: an open-source, scalable platform built to accelerate JEPA & World Model research! 📄: https://t.co/gnxGvens5A

lucasmaes_'s tweet photo. Would you like to join the research effort on JEPA and World Models easily?

After a full year of hard work, we’re excited to finally release stable-worldmodel:

an open-source, scalable platform built to accelerate JEPA & World Model research!

📄: https://t.co/gnxGvens5A

42

2K

279

2K

116K

Ayush Chaurasia @loldedxd

about 1 month ago

@kunchenguid the argument for harness is that a custom built harness augments the model. so the model doesnt have to do the heavy lifting alone. if you can get by with a simpler post-training/fine-tuning the model with some harness engineering, its a win.

0

35

Ayush Chaurasia @loldedxd

about 1 month ago

@GiorgiaMeloni man of culture

0

17

loldedxd retweeted

Quentin Lhoest 🤗 @lhoestq

about 2 months ago

The repo is public ! https://t.co/DvTy4B7tv4

0

17

3

12

6K

loldedxd retweeted

Quentin Lhoest 🤗 @lhoestq

about 2 months ago

lance for robotics ? 👀

1

20

5

7

3K

Ayush Chaurasia @loldedxd

about 2 months ago

@HKydlicek @lhoestq @tech_optimist Yeah exactly i agree 🤣 but there’s a bunch of decoding-on-fly flows online so i just wanted to confirm the workflow. I’m working on something similar for training/dataloading with lance for VLM in general. Maybe some of it could be helpful. I’ll DM you!

0

1

0

65

Ayush Chaurasia @loldedxd

about 2 months ago

@lhoestq @HKydlicek @tech_optimist Hi @HKydlicek are you sampling frames from a video on the fly in the loader or this is just reading frames directly from s3.

1

0

41