AI Bites | YouTube Channel

@ai_bites

AI tools, papers and hands-on coding to solve problems with AI. Former @UniofOxford @Oxford_VGG Our products:

YouTube →

Joined July 2014

693 Following

2.3K Followers

3.5K Posts

AI Bites | YouTube Channel

@ai_bites

20 days ago

Visual Geometry Grounded Transformer (VGGT) was a 1.2 billion parameter model developed by Meta AI and the University of Oxford, designed for high-speed, accurate 3D scene reconstruction. It acts as a foundational model for 3D vision, capable of inferring camera parameters, depth maps, point maps, and 3D point tracks from one to hundreds of images in a single forward pass. VGGT-Ω now shows that the quality of VGGT models scales predictably with model and data size! VGGT-Ω uses only ∼30% of the GPU memory of its predecessor, which allows us to train VGGT-Ω with 15× more supervised data than prior work and to leverage vast amounts of unlabeled video data. Paper Title: VGGT-$\Omega$ Project: https://t.co/R40nT0FDic Link: https://t.co/P96gbuzwpu

365

AI Bites | YouTube Channel

@ai_bites

20 days ago

Seed3D 1.0 explored end-to-end generation from a single image to high-quality 3D models and made notable progress in texture generation. Seed3D 2.0 introduces a Coarse-to-Fine two-stage generation strategy that decouples "overall structure" from "fine details", allowing them to be optimized separately. This breakthrough tackles major geometry generation challenges, such as sharp edges, thin-walled structures, and complex topologies. Seed3D 2.0 adopts a unified PBR generative model to jointly model the full set of PBR maps. It adopts an MoE architecture to improve high-resolution material details and boundary precision. Paper Title: Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Project: https://t.co/xRnpvA7Va7 Link: https://t.co/gd7XGlmx5D

181

AI Bites | YouTube Channel

@ai_bites

21 days ago

Time-Aligned Video-to-Music Generation! Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. V2M-ZERO is a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control (e.g., genre, mood) from video while requiring zero video-music pairs at training time. Prompt: A tense urgent song, with electronic cinematic style, for a trailer action, high energy Paper Title: V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation Project: https://t.co/DtSYFmDjEt Link: https://t.co/W5giOsEkKK

124

AI Bites | YouTube Channel

@ai_bites

21 days ago

Vision Banana from DeepMind is a SOTA unified model for both image understanding and generation. Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. Paper Title: Image Generators are Generalist Vision Learners Project: https://t.co/bJA7Lc35l5 Link: https://t.co/t5fyoF7hW7

ai_bites's tweet photo. Vision Banana from DeepMind is a SOTA unified model for both image understanding and generation.

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks.

Paper Title: Image Generators are Generalist Vision Learners
Project: https://t.co/bJA7Lc35l5
Link: https://t.co/t5fyoF7hW7

181

Who to follow

Lysandre

@LysandreJik

Chief Open-Source Officer (COSO) at Hugging Face

Machine Learning Street Talk

@MLStreetTalk

MLST is by Dr. Tim Scarfe @ecsquendor w/ cameos from @DoctorDuggar https://t.co/5YCv2SdFwN (early access/priv.discord) - Sponsor us!

ritwik

@ritwik_raha

MLE @Google ,ex-@PyImageSearch , ex-@GoogleDevExpert in ML

AI Bites | YouTube Channel

@ai_bites

21 days ago

Video-guided 3D animation holds immense potential for content creation, offering intuitive and precise control over dynamic assets. However, practical deployment faces a critical yet frequently overlooked hurdle: the pose misalignment dilemma. In real-world scenarios, the initial pose of a user-provided static mesh rarely aligns with the starting frame of a reference video. Naively forcing a mesh to follow a mismatched trajectory inevitably leads to severe geometric distortion or animation failure. Rectified Dynamic Mesh (R-DMesh), a unified framework designed to generate high-fidelity 4D meshes that are ``rectified'' to align with video context. Paper Title: R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow Project: https://t.co/nmOEhZZ6CM Link: https://t.co/aIRR12qNLS

177

AI Bites | YouTube Channel

@ai_bites

about 1 month ago

GOR-IS, a novel framework that enhances both the physical consistency and visual coherence of 3D object removal. As shown in the example, it simultaneously removes the target object (red dashed box) and its corresponding reflection, while naturally inpainting the region previously occluded by the object. Paper Title: GOR-IS: 3D Gaussian Object Removal in the Intrinsic Space Project: https://t.co/Uk99zXHaDn Link: https://t.co/RZTfqzwK6K

216

AI Bites | YouTube Channel

@ai_bites

about 1 month ago

UniVidX, a unified multimodal framework designed to enable versatile video generation. Recent progress has shown that video diffusion models (VDMs) can be repurposed to solve various multimodal graphics tasks. However, existing approaches typically train separate models for each specific problem setting. This practice not only ignores the joint correlations across modalities, but also locks models into fixed input-output mappings, severely limiting their flexibility. It proposes three key designs: 1) Stochastic Condition Masking (SCM): by randomly partitioning modalities into clean conditions and noisy targets during training, we enable the model to learn omni-directional conditional generation rather than fixed mappings. 2) Decoupled Gated LoRA (DGL): we attach per-modality LoRAs and activate them when a modality serves as a generation target, thereby preserving the VDM's strong priors. 3) Cross-Modal Self-Attention (CMSA): we explicitly share keys/values across modalities while maintaining modality-specific queries, facilitating information exchange and inter-modal alignment. Prompt “A small otter, wearing blue overalls and yellow safety helmet, is standing on a small ladder, holding a small hammer with its front paws, repairing a small wooden wall with a focused and serious expression. It is located in a workspace filled with tools, with warm lights illuminating the workbench and various small tools neatly hanging on the wall.” Paper Title: UniVidX: A Unified Multimodal Framework for Versatile Video Generation Project: https://t.co/9f1WhaHG3h

126

AI Bites | YouTube Channel

@ai_bites

about 1 month ago

How Well Does GPT-4o Understand Vision? This work benchmarks popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard semantic and geometric computer vision tasks using established datasets. Considering a simple classification task, while MFMs didn't quite match the performance of specialized vision models like Model Soups ViT-G and OpenCLIP H, they showed impressive capabilities. GPT-4o emerged as the standout performer, followed by Gemini 2.0 Flash, Gemini 1.5 Pro, Claude 3.5 Sonnet, Qwen2-VL, o4-mini and Llama 3.2. Paper Title: How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Project: https://t.co/jpSoMaqIcj Link: https://t.co/UeX6xHEv96

ai_bites's tweet photo. How Well Does GPT-4o Understand Vision?

This work benchmarks popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard semantic and geometric computer vision tasks using established datasets.

Considering a simple classification task, while MFMs didn't quite match the performance of specialized vision models like Model Soups ViT-G and OpenCLIP H, they showed impressive capabilities. GPT-4o emerged as the standout performer, followed by Gemini 2.0 Flash, Gemini 1.5 Pro, Claude 3.5 Sonnet, Qwen2-VL, o4-mini and Llama 3.2.

Paper Title: How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation
Project: https://t.co/jpSoMaqIcj
Link: https://t.co/UeX6xHEv96

488

AI Bites | YouTube Channel

@ai_bites

about 1 month ago

@amentialinked Not our work. We just feature good research projects in AI, vision and graphics from the community

AI Bites | YouTube Channel

@ai_bites

about 1 month ago

The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Paper Title: Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas Project: https://t.co/A5eEHVBTb8 Link: https://t.co/fULY2fjE7a

AI Bites | YouTube Channel

@ai_bites

about 1 month ago

Semantic Foam, extending the recently proposed Radiant Foam representation to semantic decomposition. It combines a volumetric Voronoi mesh with an explicit semantic feature field defined at the cell level, enabling direct spatial regularization and improving consistency across views. It achieves improved object-level segmentation quality compared to prior approaches such as Gaussian Grouping and SAGA. Paper Title: Semantic Foam: Unifying Spatial and Semantic Scene Decomposition Project: https://t.co/mu3nCv0DDC Link: https://t.co/ncpC4AvSNU

200

AI Bites | YouTube Channel

@ai_bites

about 1 month ago

Creative Generation (CG) leverages generative models to automatically produce advertising content that highlights product features, and it has been a significant focus of recent research. While CG has advanced considerably, most efforts have concentrated on generating advertising text and images, leaving Creative Video Generation (CVG) relatively underexplored. This work proposes a comprehensive Advertising Creative Knowledge Base (ACKB) as a foundational resource and proposes a knowledge-driven approach (KD-CVG) to overcome the knowledge limitations of existing models. Paper Title: KD-CVG: A Knowledge-Driven Approach for Creative Video Generation Project: https://t.co/ZpWEBfz1b5 Link: https://t.co/8ROMmcnjV7

134

AI Bites | YouTube Channel

@ai_bites

about 1 month ago

IMU-to-4D, a large foundation model that jointly reasons over human motion, activity descriptions, and 3D scene layouts purely from wearable IMU signals. Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. An alternative? 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Paper Title: Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs Project: https://t.co/tzssIlgSOJ Link: https://t.co/eixxOAnZd1

222

AI Bites | YouTube Channel

@ai_bites

about 1 month ago

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Seeing Fast and Slow:Learning the Flow of Time in Videos! This project explores how to perceive and manipulate the flow of time in videos through four complementary tasks: Speed-change detection locates the exact moments when playback speed shifts. Video speed estimation infers how much a video has been sped up or slowed down. Extreme Temporal super-resolution converts low-FPS, blurry videos into high-FPS, clear counterparts. Speed-conditioned video generation synthesizes the same event at user-specified temporal speeds. Together, these capabilities highlight fine-grained temporal perception alongside controllable video generation. Paper Title: Seeing Fast and Slow: Learning the Flow of Time in Videos Project: https://t.co/iVcYiPoovb Link: https://t.co/j3077EKltL

489

AI Bites | YouTube Channel

@ai_bites

about 1 month ago

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding! Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While multimodal large language models (MLLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. SlideAgent is a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels--global, page, and element--to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. Paper Title: SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Project: https://t.co/kPiDHSEXH3 Link: https://t.co/czS958OMPP

ai_bites's tweet photo. SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding!

Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While multimodal large language models (MLLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages.

SlideAgent is a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks.

SlideAgent employs specialized agents and decomposes reasoning into three specialized levels--global, page, and element--to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues.

Paper Title: SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual
Project: https://t.co/kPiDHSEXH3
Link: https://t.co/czS958OMPP

142

AI Bites | YouTube Channel

@ai_bites

about 1 month ago

360° videos have emerged as a promising medium to represent our dynamic visual world. Compared to the "tunnel vision" of standard cameras, their borderless field of view offers a more holistic perspective of our surroundings. However, while existing video models excel at producing standard videos, their ability to generate full panoramic videos remains elusive. Argus generates a full 360° panoramic video (visualized as environment maps), with the red box indicating the corresponding region in the generated frame. Paper Title: Beyond the Frame: Generating 360 Panoramic Videos from Perspective Project: https://t.co/JXWmgpG2ha Link: https://t.co/iGrHFYtWO2

201

AI Bites | YouTube Channel

@ai_bites

about 1 month ago

Adaptive Patch Transformers (APT), a method to accelerate vision transformers (ViTs) by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by using larger patch sizes in more homogeneous image regions, and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance. It can be applied to a previously fine-tuned ViT and converges in as little as 1 epoch, enabling training on high-resolution images with minimal compute budgets. It also significantly reduces training and inference time with no performance degradation on high-resolution dense visual tasks, achieving up to 30% faster training and inference on visual QA, object detection and semantic segmentation. Paper Title: Accelerating Vision Transformers with Adaptive Patch Sizes Project: https://t.co/SXnJSEdUP8 Link: https://t.co/3JbdcbcLiC

197

131

10K

AI Bites | YouTube Channel

@ai_bites

about 2 months ago

NUMINA is a training-free framework that tackles numerical misalignment in text-to-video diffusion models - the persistent failure of T2V models to generate the correct count of objects specified in prompts (e.g., producing 2 or 4 cats when "three cats" is requested). Unlike seed search or prompt enhancement approaches that treat the generation pipeline as a black box and rely on brute-force resampling or LLM-based prompt rewriting, NUMINA directly identifies where and why counting errors occur inside the model by analyzing cross-attention and self-attention maps at selected DiT layers. Paper Title: When Numbers Speak: Aligning Textual Numerals and Visual Instances in Project: https://t.co/IszwsB0zWA Link: https://t.co/3obuO8ygEp

117

AI Bites | YouTube Channel

@ai_bites

about 2 months ago

Visually-grounded Humanoid Agents! Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are passively animated, relying on privileged state or scripted control, which limits scalability to novel environments. Visually-grounded Humanoid Agents, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they look, perceive, reason, and behave like real people in real-world 3D scenes. Paper Title: Visually-grounded Humanoid Agents Project: https://t.co/d244wCGs0k Link: https://t.co/T8atPKOdwa

145

AI Bites | YouTube Channel

@ai_bites

about 2 months ago

Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-models (LLMs) and vision-language-models (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. LAMP lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations, enabling robust generalization across diverse manipulation tasks from monocular RGB-D observations and promptable instructions. Paper Title: LAMP: Lift Image-Editing as General 3D Priors for Open-world Project: https://t.co/vIHsZPTf5K. Link: https://t.co/dnejqmSFfn

136

AI Bites | YouTube Channel

@ai_bites

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users