Loren @murloren - Twitter Profile

Pinned Tweet

Loren @murloren

12 months ago

Last week, I was at CVPR presenting our latest work, which was awarded as a ✨highlight✨ paper @rmcantin

Loren @murloren

about 1 year ago

Very happy to announce, that our last work "DIV-FF: Dynamic Image Video feature fields for egocentric vision" with @rmcantin and J.Guerrero, has been accepted at CVPR 2025!! https://t.co/p6UglbApSy Take a look 👇🧵

1

2

0

1

1K

0

11

1

2K

Loren @murloren

3 days ago

Happy to see my work VJEPA 2.1 on Yann LeCun slides

Chris Offner @chrisoffner3d

4 days ago

Here is @ylecun ‘s recent lecture on world models at @ETH: https://t.co/TmSMz9FMzI

6

481

68

590

91K

0

6

0

212

Loren @murloren

2 months ago

@massiviola01 Yes, but the Convnext-tiny looks to perform better than the ViT-Tiny, which is the reason? maybe its inductive biases

1

0

29

Loren @murloren

2 months ago

@massiviola01 Any inshights of what is the performance drop so big in the ViT-Tiny? Is there any intermediate option between the tiny and the small ViTs?

1

0

40

Who to follow

3 months ago

@TrainUplifted @ylecun @AdrienBardes We also released two distilled versions from the ViT-G (2B params): ViT-B (80M params) and ViT-L (300M params)

0

1

0

79

Loren @murloren

3 months ago

I am very happy to share the result of my internship at FAIR (Meta): V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning with @ylecun @AdrienBardes Our approach learns dense, spatially coherent features from video while preserving strong global understanding

murloren's tweet photo. I am very happy to share the result of my internship at FAIR (Meta): V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning with @ylecun @AdrienBardes

Our approach learns dense, spatially coherent features from video while preserving strong global understanding https://t.co/NVh5CZAGfZ

18

330

47

138

72K

murloren retweeted

Grigory Sapunov

@che_shr_cat

3 months ago

7/ The numbers are brutal. NYUv2 depth RMSE drops from an unusable 0.682 to SOTA 0.307. ADE20K segmentation mIoU jumps +23.4 points. Ego4D anticipation hits 7.71 mAP. PCA features go from noisy static to sharp object boundaries.

che_shr_cat's tweet photo. 7/ The numbers are brutal. NYUv2 depth RMSE drops from an unusable 0.682 to SOTA 0.307. ADE20K segmentation mIoU jumps +23.4 points. Ego4D anticipation hits 7.71 mAP. PCA features go from noisy static to sharp object boundaries. https://t.co/1dviLRmC6n

1

7

1

0

878

murloren retweeted

Grigory Sapunov

@che_shr_cat

3 months ago

2/ V-JEPA 2.1 by Mur-Labadia, Muckley, and the FAIR team fixes the global-local representation bottleneck. It unifies image and video representation learning into a single encoder. This is a massive step for embodied AI world models.

che_shr_cat's tweet photo. 2/ V-JEPA 2.1 by Mur-Labadia, Muckley, and the FAIR team fixes the global-local representation bottleneck. It unifies image and video representation learning into a single encoder. This is a massive step for embodied AI world models. https://t.co/c8040a0YA6

1

19

2

7

1K

murloren retweeted

Massimiliano Viola @massiviola01

3 months ago

Thread on VJEPA 2.1🤟 This DEFINITELY flew under the radar: just a few days ago, @AIatMeta released V-JEPA 2.1, taking a massive step toward closing the gap between image and video domains. For a long time, image backbones were the only option for solving dense vision tasks. This model disagrees, showing that universal spatial understanding also emerges from large-scale video models!🎥

massiviola01's tweet photo. Thread on VJEPA 2.1🤟

This DEFINITELY flew under the radar: just a few days ago, @AIatMeta released V-JEPA 2.1, taking a massive step toward closing the gap between image and video domains.

For a long time, image backbones were the only option for solving dense vision tasks. This model disagrees, showing that universal spatial understanding also emerges from large-scale video models!🎥

10

574

67

346

56K

Loren @murloren

3 months ago

@ylecun @AdrienBardes V-JEPA 2.1's new feature maps are worth a look. Genuinely. 👀 - Paper: https://t.co/PN41ZNPrAs - Code and models: ViT-G (2B), ViT-g (1B), ViT-L (300M) and ViT-B (80M) https://t.co/lX8hZnwH25

murloren's tweet photo. @ylecun @AdrienBardes V-JEPA 2.1's new feature maps are worth a look. Genuinely. 👀

- Paper: https://t.co/PN41ZNPrAs
- Code and models: ViT-G (2B), ViT-g (1B), ViT-L (300M) and ViT-B (80M) https://t.co/lX8hZnwH25 https://t.co/uYVOe6s1rj

1

8

1

4

722

Loren @murloren

3 months ago

@ylecun @AdrienBardes Global video understanding is possible with excellent dense features!! - 77.7 % Top-1 Acc on SSv2 (new SOTA, motion-metric dataset) - Competitive 87.7 % on K400 and 85.5 % on Imagenet-1K - Semantically coherent, spatially aligned and temporally consistent features

murloren's tweet photo. @ylecun @AdrienBardes Global video understanding is possible with excellent dense features!!

- 77.7 % Top-1 Acc on SSv2 (new SOTA, motion-metric dataset)
- Competitive 87.7 % on K400 and 85.5 % on Imagenet-1K
- Semantically coherent, spatially aligned and temporally consistent features https://t.co/ilMm1jcYcI

1

2

1

651

murloren retweeted

alphaXiv

@askalphaxiv

3 months ago

Yann LeCun and his team dropped yet another paper! "V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning" In this V-JEPA upgrade, they showed that if you make a video model predict every patch, not just the masked ones AND at multiple layers, they are able to turn vague scene understanding into dense + temporal stable features that actually understands "what is where". This key insight drove improvements in segmentation, depth, anticipation, and even robot planning.

askalphaxiv's tweet photo. Yann LeCun and his team dropped yet another paper!

"V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning"

In this V-JEPA upgrade, they showed that if you make a video model predict every patch, not just the masked ones AND at multiple layers, they are able to turn vague scene understanding into dense + temporal stable features that actually understands "what is where".

This key insight drove improvements in segmentation, depth, anticipation, and even robot planning.

32

1K

219

808

122K

murloren retweeted

Pascale Fung

@pascalefung

6 months ago

Introducing VL-JEPA: Vision-Language Joint Embedding Predictive Architecture for streaming, live action recognition, retrieval, VQA, and classification tasks with better performance and higher efficiency than large VLMs. • VL-JEPA is the first non-generative model that can perform general-domain vision-language tasks in real-time, built on a joint embedding predictive architecture. • We demonstrate in controlled experiments that VL-JEPA, trained with latent space embedding prediction, outperforms VLMs that rely on data space token prediction. • We show that VL-JEPA delivers significant efficiency gains over VLMs for online video streaming applications, thanks to its non-autoregressive design and native support for selective decoding. • We highlight that our VL-JEPA model, with an unified model architecture, can effectively handle a wide range of classification, retrieval, and VQA tasks at the same time. by @Delong0_0 @MustafaShukor1 @TheoMoutakanni @willyhcchung Jade Lei Yu Tejaswi Kasarla @AllenBolourchi @ylecun @pascalefung https://t.co/oUnjCaMKVv

13

555

86

394

90K

murloren retweeted

Wes Roth

@WesRoth

6 months ago

Yann LeCun explains that large language models are trained on about 30 trillion words, representing nearly all public internet text. He says it would take a human over 500,000 years to read that much. But a 4-year-old child sees just as much visual data in their first few years of life. This shows how much richer and more complex real-world experience is compared to reading text. Training on the web is huge but it still doesn’t match what a child learns just by living.

150

2K

293

931

294K

Loren

@murloren

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users