Efstathios Karypidis @k_sta8is - Twitter Profile

Pinned Tweet

over 1 year ago

1/n 🚀 Excited to share our latest work: DINO-Foresight, a new framework for predicting the future states of scenes using Vision Foundation Model features! Links to the arXiv and Github 👇

K_Sta8is's tweet photo. 1/n 🚀 Excited to share our latest work: DINO-Foresight, a new framework for predicting the future states of scenes using Vision Foundation Model features!
Links to the arXiv and Github 👇 https://t.co/9TtcCLhFm3

7

338

63

290

55K

K_Sta8is retweeted

Christian Wolf (🦋🦋🦋) @chriswolfvision

5 days ago

#ECCV2026 paper: A scalar per patch from pre-trained ViTs enables fast moving navigation in the real world 966 *REAL* nav episodes by S. Janny with Dino-v3, Dino-v2, DUNE, VC1, AM-RADIO encoders show that patch features can be bottlenecked to 1 value ➡️ affordances. 1/8

chriswolfvision's tweet photo. #ECCV2026 paper: A scalar per patch from pre-trained ViTs enables fast moving navigation in the real world

966 *REAL* nav episodes by S. Janny with Dino-v3, Dino-v2, DUNE, VC1, AM-RADIO encoders show that patch features can be bottlenecked to 1 value ➡️ affordances.

1/8 https://t.co/yz07qMch79

1

63

15

42

5K

Efstathios Karypidis @K_Sta8is

7 days ago

🎉Re2Pix accepted at #ECCV2026! 💡Should a world model predict future dynamics and render pixels simultaneously? Re2Pix says no. Forecast in VFM semantic space first 🧠, synthesize pixels second 🎨 Updated Paper and code coming soon. Details👇

Efstathios Karypidis @K_Sta8is

2 months ago

1/n 🔀 Pixel or latent world models? Video world models fall into two camps: • generate photorealistic frames • predict semantic features of the future (e.g., DINOv2) Why choose one? We introduce Re2Pix, a hierarchical approach that combines both. 🧵👇

4

243

32

208

47K

2

76

15

36

10K

K_Sta8is retweeted

Junyao Shi

@JunyaoShi

11 days ago

Academia optimizes for novelty, which has become increasingly orthogonal to making things work. In practice it rewards benchmarking-chasing, optics-maxing, and flag-planting. Sadly a major bitter lesson of robotics is: insights from the small-data, bad-system regime don’t transfer to the big-data, good-system one. The novelty we reward and the progress we need are pulling apart.

5

164

9

61

22K

K_Sta8is retweeted

Siqiao Huang

@KnightNemo_

about 1 month ago

In the last couple of months, we have witnessed significant advances in Industry-scale World Models. Yet, for the broader community, the gap between reading about these models and deploying them remains disappointingly wide. Today we're releasing Nano World Models: a minimalist, batteries-included repo for advancing world model science. 🧵 (1/9)

10

352

55

246

48K

K_Sta8is retweeted

Andrei Bursuc @abursuc

about 1 month ago

py123d: D. Dauner et al. did all the dirty work to unify the highly heterogeneous autonomous driving datasets into a single efficient data format. nuscenes+nuplan+WOD+Physical AI AV, etc., are all there. This is how you accelerate open-source AD research https://t.co/tjGbc9vyvb

abursuc's tweet photo. py123d: D. Dauner et al. did all the dirty work to unify the highly heterogeneous autonomous driving datasets into a single efficient data format.
nuscenes+nuplan+WOD+Physical AI AV, etc., are all there.
This is how you accelerate open-source AD research
https://t.co/tjGbc9vyvb https://t.co/udMxEwWZv9

0

65

15

41

5K

K_Sta8is retweeted

Panagiota Moraiti @panagiotamorai

about 2 months ago

New blog post out! 🍌 What if you could replace your entire computer vision pipeline with a single model and a text prompt? No more chaining separate models for segmentation, depth estimation and surface normal estimation. Just one model, one prompt.

1

2

1

0

88

K_Sta8is retweeted

Sham Kakade

@ShamKakade6

2 months ago

1/8 Introducing Recurrent Transformer (RT). At 300M params, RT improves validation CE over standard Transformers. The best RT model is only 6 layers, but wider at 2048 — beating deeper 12- and 24-layer Transformers by trading depth for width.

ShamKakade6's tweet photo. 1/8 Introducing Recurrent Transformer (RT). At 300M params, RT improves validation CE over standard Transformers. The best RT model is only 6 layers, but wider at 2048 — beating deeper 12- and 24-layer Transformers by trading depth for width. https://t.co/seJrdtszKJ

17

552

70

432

253K

K_Sta8is retweeted

Thodoris Kouzelis @ThKouz

2 months ago

1/n Introducing CoReDi: Coevolving Representations for Joint Image–Feature Diffusion Joint diffusion boosts image generation by injecting semantic features. But one assumption goes unquestioned: the feature space is fixed. What if it was learned instead? 🧵👇

ThKouz's tweet photo. 1/n
Introducing CoReDi: Coevolving Representations for Joint Image–Feature Diffusion

Joint diffusion boosts image generation by injecting semantic features.
But one assumption goes unquestioned: the feature space is fixed.

What if it was learned instead? 🧵👇 https://t.co/WE3DpVZhhU

1

90

8

76

11K

K_Sta8is retweeted

Panagiota Moraiti @panagiotamorai

2 months ago

🚀 My latest contribution to the Roboflow Blog is live! 🦕 I wrote a deep dive into DINOv3, Meta’s self-supervised vision foundation model, exploring how to train it with @roboflow , no coding required. #AI #ComputerVision #DINOv3 #DINO #Roboflow #SelfSupervisedLearning

1

3

1

0

195

K_Sta8is retweeted

Spyros Gidaris @SpyrosGidaris

2 months ago

Revisiting “old-school” self-supervised tasks (rotation prediction) in a new way—using them during instruction tuning to improve visual grounding in MLLMs. Simple idea with nice gains on vision-heavy tasks 👀 Kudos to @sophia_sirko for leading this work https://t.co/jXdxQYLpYm

0

21

4

7

3K

K_Sta8is retweeted

Sophia Sirko-Galouchenko @sophia_sirko

2 months ago

1/n New paper - V-GIFT 🎁 Self-supervised tasks like rotation prediction or colorization were big in 2018. Do they still matter? Yes. We turn them into visual instruction tuning data for MLLMs. Result: models rely more on the image and perform better on vision tasks 👀

sophia_sirko's tweet photo. 1/n New paper - V-GIFT 🎁

Self-supervised tasks like rotation prediction or colorization were big in 2018.
Do they still matter?

Yes.
We turn them into visual instruction tuning data for MLLMs.

Result: models rely more on the image and perform better on vision tasks 👀 https://t.co/7R1frEliDO

3

86

23

36

12K

K_Sta8is retweeted

Vasiliki Vasileiou @SilaVasileiou

2 months ago

1/n🚀 Our paper “VGGT-HPE: Reframing Head Pose Estimation as Relative Pose Prediction” has been accepted at #CVPRW 2026! 📄 https://t.co/ZCgNNubsDn

SilaVasileiou's tweet photo. 1/n🚀 Our paper “VGGT-HPE: Reframing Head Pose Estimation as Relative Pose Prediction” has been accepted at #CVPRW 2026!
📄 https://t.co/ZCgNNubsDn https://t.co/Fl2HbnJbfk

1

7

1

0

253

Efstathios Karypidis @K_Sta8is

2 months ago

9/n 🎯 We show that explicitly modeling hierarchical semantic structure, can build more efficient and temporally consistent video prediction systems Paper: https://t.co/tgVtLyPUdV Code (To be released soon!): https://t.co/6MeEpk1yKT Joint work with @SpyrosGidaris and N. Komodakis

0

21

0

9

890

Efstathios Karypidis @K_Sta8is

2 months ago

1/n 🔀 Pixel or latent world models? Video world models fall into two camps: • generate photorealistic frames • predict semantic features of the future (e.g., DINOv2) Why choose one? We introduce Re2Pix, a hierarchical approach that combines both. 🧵👇

4

243

32

208

47K

Efstathios Karypidis @K_Sta8is

2 months ago

8/n 👀 Here's what it looks like in practice! Bottom left shows the predicted DINOv2 semantic features guiding the generation. Re2Pix (bottom right) preserves scene structure and object boundaries much better than the baseline!

1

11

0

1

961

Efstathios Karypidis

@K_Sta8is

Last Seen Users on Sotwe

Trends for you

Most Popular Users