Visual Geometry Grounded Transformer (VGGT) was a 1.2 billion parameter model developed by Meta AI and the University of Oxford, designed for high-speed, accurate 3D scene reconstruction. It acts as a foundational model for 3D vision, capable of inferring camera parameters, depth maps, point maps, and 3D point tracks from one to hundreds of images in a single forward pass.
VGGT-Ω now shows that the quality of VGGT models scales predictably with model and data size!
VGGT-Ω uses only ∼30% of the GPU memory of its predecessor, which allows us to train VGGT-Ω with 15× more supervised data than prior work and to leverage vast amounts of unlabeled video data.
Paper Title: VGGT-$\Omega$
Project: https://t.co/R40nT0FDic
Link: https://t.co/P96gbuzwpu
Seed3D 1.0 explored end-to-end generation from a single image to high-quality 3D models and made notable progress in texture generation.
Seed3D 2.0 introduces a Coarse-to-Fine two-stage generation strategy that decouples "overall structure" from "fine details", allowing them to be optimized separately. This breakthrough tackles major geometry generation challenges, such as sharp edges, thin-walled structures, and complex topologies.
Seed3D 2.0 adopts a unified PBR generative model to jointly model the full set of PBR maps. It adopts an MoE architecture to improve high-resolution material details and boundary precision.
Paper Title: Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content
Project: https://t.co/xRnpvA7Va7
Link: https://t.co/gd7XGlmx5D
Time-Aligned Video-to-Music Generation!
Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. V2M-ZERO is a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control (e.g., genre, mood) from video while requiring zero video-music pairs at training time.
Prompt: A tense urgent song, with electronic cinematic style, for a trailer action, high energy
Paper Title: V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
Project: https://t.co/DtSYFmDjEt
Link: https://t.co/W5giOsEkKK
Vision Banana from DeepMind is a SOTA unified model for both image understanding and generation.
Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks.
Paper Title: Image Generators are Generalist Vision Learners
Project: https://t.co/bJA7Lc35l5
Link: https://t.co/t5fyoF7hW7
Video-guided 3D animation holds immense potential for content creation, offering intuitive and precise control over dynamic assets. However, practical deployment faces a critical yet frequently overlooked hurdle: the pose misalignment dilemma. In real-world scenarios, the initial pose of a user-provided static mesh rarely aligns with the starting frame of a reference video. Naively forcing a mesh to follow a mismatched trajectory inevitably leads to severe geometric distortion or animation failure.
Rectified Dynamic Mesh (R-DMesh), a unified framework designed to generate high-fidelity 4D meshes that are ``rectified'' to align with video context.
Paper Title: R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
Project: https://t.co/nmOEhZZ6CM
Link: https://t.co/aIRR12qNLS
GOR-IS, a novel framework that enhances both the physical consistency and visual coherence of 3D object removal.
As shown in the example, it simultaneously removes the target object (red dashed box) and its corresponding reflection, while naturally inpainting the region previously occluded by the object.
Paper Title: GOR-IS: 3D Gaussian Object Removal in the Intrinsic Space
Project: https://t.co/Uk99zXHaDn
Link: https://t.co/RZTfqzwK6K
UniVidX, a unified multimodal framework designed to enable versatile video generation.
Recent progress has shown that video diffusion models (VDMs) can be repurposed to solve various multimodal graphics tasks. However, existing approaches typically train separate models for each specific problem setting. This practice not only ignores the joint correlations across modalities, but also locks models into fixed input-output mappings, severely limiting their flexibility.
It proposes three key designs:
1) Stochastic Condition Masking (SCM): by randomly partitioning modalities into clean conditions and noisy targets during training, we enable the model to learn omni-directional conditional generation rather than fixed mappings.
2) Decoupled Gated LoRA (DGL): we attach per-modality LoRAs and activate them when a modality serves as a generation target, thereby preserving the VDM's strong priors.
3) Cross-Modal Self-Attention (CMSA): we explicitly share keys/values across modalities while maintaining modality-specific queries, facilitating information exchange and inter-modal alignment.
Prompt “A small otter, wearing blue overalls and yellow safety helmet, is standing on a small ladder, holding a small hammer with its front paws, repairing a small wooden wall with a focused and serious expression. It is located in a workspace filled with tools, with warm lights illuminating the workbench and various small tools neatly hanging on the wall.”
Paper Title: UniVidX: A Unified Multimodal Framework for Versatile Video Generation
Project: https://t.co/9f1WhaHG3h
How Well Does GPT-4o Understand Vision?
This work benchmarks popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard semantic and geometric computer vision tasks using established datasets.
Considering a simple classification task, while MFMs didn't quite match the performance of specialized vision models like Model Soups ViT-G and OpenCLIP H, they showed impressive capabilities. GPT-4o emerged as the standout performer, followed by Gemini 2.0 Flash, Gemini 1.5 Pro, Claude 3.5 Sonnet, Qwen2-VL, o4-mini and Llama 3.2.
Paper Title: How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation
Project: https://t.co/jpSoMaqIcj
Link: https://t.co/UeX6xHEv96
The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling.
While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution.
Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion.
Paper Title: Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas
Project: https://t.co/A5eEHVBTb8
Link: https://t.co/fULY2fjE7a
Semantic Foam, extending the recently proposed Radiant Foam representation to semantic decomposition.
It combines a volumetric Voronoi mesh with an explicit semantic feature field defined at the cell level, enabling direct spatial regularization and improving consistency across views.
It achieves improved object-level segmentation quality compared to prior approaches such as Gaussian Grouping and SAGA.
Paper Title: Semantic Foam: Unifying Spatial and Semantic Scene Decomposition
Project: https://t.co/mu3nCv0DDC
Link: https://t.co/ncpC4AvSNU
Creative Generation (CG) leverages generative models to automatically produce advertising content that highlights product features, and it has been a significant focus of recent research.
While CG has advanced considerably, most efforts have concentrated on generating advertising text and images, leaving Creative Video Generation (CVG) relatively underexplored.
This work proposes a comprehensive Advertising Creative Knowledge Base (ACKB) as a foundational resource and proposes a knowledge-driven approach (KD-CVG) to overcome the knowledge limitations of existing models.
Paper Title: KD-CVG: A Knowledge-Driven Approach for Creative Video Generation
Project: https://t.co/ZpWEBfz1b5
Link: https://t.co/8ROMmcnjV7
IMU-to-4D, a large foundation model that jointly reasons over human motion, activity descriptions, and 3D scene layouts purely from wearable IMU signals.
Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability.
An alternative? 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure.
Paper Title: Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
Project: https://t.co/tzssIlgSOJ
Link: https://t.co/eixxOAnZd1
How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds?
Seeing Fast and Slow:Learning the Flow of Time in Videos!
This project explores how to perceive and manipulate the flow of time in videos through four complementary tasks:
Speed-change detection locates the exact moments when playback speed shifts.
Video speed estimation infers how much a video has been sped up or slowed down.
Extreme Temporal super-resolution converts low-FPS, blurry videos into high-FPS, clear counterparts.
Speed-conditioned video generation synthesizes the same event at user-specified temporal speeds.
Together, these capabilities highlight fine-grained temporal perception alongside controllable video generation.
Paper Title: Seeing Fast and Slow: Learning the Flow of Time in Videos
Project: https://t.co/iVcYiPoovb
Link: https://t.co/j3077EKltL
SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding!
Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While multimodal large language models (MLLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages.
SlideAgent is a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks.
SlideAgent employs specialized agents and decomposes reasoning into three specialized levels--global, page, and element--to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues.
Paper Title: SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual
Project: https://t.co/kPiDHSEXH3
Link: https://t.co/czS958OMPP
360° videos have emerged as a promising medium to represent our dynamic visual world. Compared to the "tunnel vision" of standard cameras, their borderless field of view offers a more holistic perspective of our surroundings. However, while existing video models excel at producing standard videos, their ability to generate full panoramic videos remains elusive.
Argus generates a full 360° panoramic video (visualized as environment maps), with the red box indicating the corresponding region in the generated frame.
Paper Title: Beyond the Frame: Generating 360 Panoramic Videos from Perspective
Project: https://t.co/JXWmgpG2ha
Link: https://t.co/iGrHFYtWO2
Adaptive Patch Transformers (APT), a method to accelerate vision transformers (ViTs) by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by using larger patch sizes in more homogeneous image regions, and smaller patches in more complex ones.
APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance.
It can be applied to a previously fine-tuned ViT and converges in as little as 1 epoch, enabling training on high-resolution images with minimal compute budgets. It also significantly reduces training and inference time with no performance degradation on high-resolution dense visual tasks, achieving up to 30% faster training and inference on visual QA, object detection and semantic segmentation.
Paper Title: Accelerating Vision Transformers with Adaptive Patch Sizes
Project: https://t.co/SXnJSEdUP8
Link: https://t.co/3JbdcbcLiC
NUMINA is a training-free framework that tackles numerical misalignment in text-to-video diffusion models - the persistent failure of T2V models to generate the correct count of objects specified in prompts (e.g., producing 2 or 4 cats when "three cats" is requested).
Unlike seed search or prompt enhancement approaches that treat the generation pipeline as a black box and rely on brute-force resampling or LLM-based prompt rewriting, NUMINA directly identifies where and why counting errors occur inside the model by analyzing cross-attention and self-attention maps at selected DiT layers.
Paper Title: When Numbers Speak: Aligning Textual Numerals and Visual Instances in
Project: https://t.co/IszwsB0zWA
Link: https://t.co/3obuO8ygEp
Visually-grounded Humanoid Agents!
Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are passively animated, relying on privileged state or scripted control, which limits scalability to novel environments.
Visually-grounded Humanoid Agents, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they look, perceive, reason, and behave like real people in real-world 3D scenes.
Paper Title: Visually-grounded Humanoid Agents
Project: https://t.co/d244wCGs0k
Link: https://t.co/T8atPKOdwa
Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action models (VLAs), often struggle with novel tasks and unseen environments.
Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-models (LLMs) and vision-language-models (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation.
LAMP lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations, enabling robust generalization across diverse manipulation tasks from monocular RGB-D observations and promptable instructions.
Paper Title: LAMP: Lift Image-Editing as General 3D Priors for Open-world
Project: https://t.co/vIHsZPTf5K.
Link: https://t.co/dnejqmSFfn