We've uploaded the slides from the #CVPR2026 VGI (Visual General Intelligence) Workshop.
Robert Geirhos
Are generative video models the path towards solving visual intelligence?
https://t.co/mgXhMEd51r
Andrej Karpathy spent 2h showing how he actually uses AI day to day
he's a co-founder of OpenAI and led AI at Tesla, so when he shows how he works, it’s worth watching
and the whole session is just him telling the machine what he wants in simple terms, like he's briefing a coworker
watch what's actually happening the entire time:
> he describes the task in normal words
> it goes off and does the work
> he glances at the result and nudges it with one more sentence
that's the whole skill, and you've had it since you learned to talk
the only gap between that and a worker that runs on its own is handing that sentence a schedule and the tools to act
check his work, then build the version that keeps working when you stop
If your conference talk is good, you should upload it somewhere (I use YouTube).
If your talk is not good enough for you to feel comfortable uploading it somewhere, then it is *definitely* not good enough to present at the conference, and you should fix it.
A year ago I argued ML saved CV. CVPR’26 best paper D4RT shows how vision isn’t being blindly assimilated by AI - it’s undergoing architectural consolidation. Unifying tracking, depth&pose into a 4D model speeds up pose estimation 100x. Geometric vision remains alive & relevant
Introducing D4RT: A unified AI model for 4D scene reconstruction and tracking across space and time. 🎯 Catch the demo with Skanda Koppula at 12 pm at our #CVPR2026 Google booth kiosk! https://t.co/p6SclNe1zi @GoogleDeepMind
Short history of last 6 years of image matching (all by transformers):
2020: @pesarlin SuperGlue
2023: @Vinc3nt_Leroy DUSt3R
2024: @Parskatt RoMa
2025: @jianyuan_wang VGGT
2026: @davnords "hold my beer" (scales LightGlue)
2025: @jianyuan_wang "no, you hold MY beer" (scales VGGT)
This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to present its output as slideshows, etc.
More generally, imo audio is the human-preferred input to AIs but vision (images/animations/video) is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision, it is the 10-lane superhighway of information into brain. As AI improves, I think we'll see a progression that takes advantage:
1) raw text (hard/effortful to read)
2) markdown (bold, italic, headings, tables, a bit easier on the eyes) <-- current default
3) HTML (still procedural with underlying code, but a lot more flexibility on the graphics, layout, even interactivity) <-- early but forming new good default
...4,5,6,...
n) interactive neural videos/simulations
Imo the extrapolation (though the technology doesn't exist just yet) ends in some kind of interactive videos generated directly by a diffusion neural net. Many open questions as to how exact/procedural "Software 1.0" artifacts (e.g. interactive simulations) may be woven together with neural artifacts (diffusion grids), but generally something in the direction of the recently viral https://t.co/z21CP5iQfu
There are also improvements necessary and pending at the input. Audio nor text nor video alone are not enough, e.g. I feel a need to point/gesture to things on the screen, similar to all the things you would do with a person physically next to you and your computer screen.
TLDR The input/output mind meld between humans and AIs is ongoing and there is a lot of work to do and significant progress to be made, way before jumping all the way into neuralink-esque BCIs and all that. For what's worth exploring at the current stage, hot tip try ask for HTML.
After reading this blog by @willccbb, I fell down a rabbit hole on On-Policy Distillation.
Here's my breakdown: the problem, the existing fixes, why they fall short, and what OPD actually changes.
A thread 🧵
There’s a serious gap in multimodal models – they work with images, but still reason in language, which isn’t that precise for visual stuff.
@deepseek_ai just dropped an idea to solve this: let the model literally point to exact locations in the image while it thinks.
They call it "Thinking with Visual Primitives."
These visual primitives are:
- points (specific locations)
- bounding boxes (areas in the image)
Using them, the model knows what exactly it’s referring to and achieves ~77% better accuracy on average (vs. Gemini 3 Flash's 76.5% and 71.1% for GPT-5.4)
Plus, only ~80–90 visual tokens are kept in memory after compression thanks to the efficient architecture
Here is how it works:
The Top AI Papers of the Week (April 26 - May 3)
- Latent Agents
- RecursiveMAS
- OneManCompany
- AgenticQwen-30B-A3B
- Agentic World Modeling
- Agentic Harness Engineering
- From Skill Text to Skill Structure
Read on for more:
Instead of watching an hour of Netflix, watch this 2-hour Stanford lecture.
It will teach you more about how LLMs like ChatGPT and Claude are actually built than most people in top AI companies learn across their entire careers.
Save this.
Anthropic vient de publier officiellement le blueprint pour créer une entreprise avec Claude Code et c'est hallucinant😭
PDG : 1 humain (qui dort)
Employés : plusieurs IA
Activités: les IA se répartissent les tâches et avancent seules
Le travail est littéralement en train de mourir... J'ai résumé le guide complet en français, lis ça quand t'as 5 min ⤵️
Si tu veux que l'IA bosse pendant que tu dors → garde ça en signet 🔖
The personal knowledge base build, in 60 seconds:
Total setup: 45 minutes this weekend. Then it compounds forever.
1. 5 minutes: Setup
Create 3 folders: raw/, wiki/, outputs/. Drop a CLAUDE.md schema file in the root. Done.
2. 10 minutes: Dump
Copy-paste articles, notes, screenshots, meeting transcripts into raw/. Don't rename. Don't organize.
3. 30 minutes: Let the AI build
Point Claude at the folder. "Read everything in raw/. Compile a wiki following CLAUDE.md rules. Create INDEX.md first."
Walk away. Come back to organized articles, [[linked]] topics, and a searchable index.
4. Ongoing: The compounding loop
Ask questions. Save answers back to raw/. Every query makes the next answer better.
5. Monthly: Health check
Tell the AI to flag contradictions, find unexplained topics, and suggest 3 new articles to fill gaps.
The system gets smarter the longer you use it.
Day 1 it's basic. Day 90 it's a company asset nobody else has.