Just published in @ScienceAdvances, our work demonstrating the ability of AI and 3D computer vision to produce automated measurement of human interactions in video data from early child development research -- providing over 100x time savings compared to human annotation and enabling quantitative, big data studies. We use our method, HARMONI, to characterize longitudinal trends in infant and toddler interaction with caregivers, in over 500 hours of video data. Work led by @JenWeng4 together with co-PI @SandersMDMPH and @K_L_Humphreys, and with a great interdiscplinary team including Laura Bravo Sanchez, @bergelsonlab, @akanazawa, @StanfordCERC, and many others!
https://t.co/ouyazVXqIH
1/3 Today, an anecdote shared by an invited speaker at #NeurIPS2024 left many Chinese scholars, myself included, feeling uncomfortable. As a community, I believe we should take a moment to reflect on why such remarks in public discourse can be offensive and harmful.
Our lab at Stanford has postdoc openings! Candidates should have expertise and interests in one or multiple of: multimodal large language models, video understanding (including video-language models), AI for science / biology, or AI for surgery. Please send inquiries by email and see https://t.co/3DB5k5AC2o for more information.
🤗First benchmark on multimodal judge’s feedback for text-to-image generation!!
🏃Come and pick up your personal advice and package to choose the best judge to fine-tune your diffusion model 👉 https://t.co/7ENkz1Fbmx
Paper: https://t.co/F77BCQlJtA
Code: https://t.co/baiK2H1yuq
🌟NEW Paper Alert 🌟
👩⚖️MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? (https://t.co/hKQaAOzrf4)
🧐Also wonder about the best judge model to provide feedback for your diffusion models?
We evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias.
Key findings:
👉1. While closed-source VLM judges typically perform better, smaller CLIP-based models offer better text-image alignment and image quality feedback due to extensive pre-training on text-vision corpus. Conversely, VLMs provide more accurate feedback on safety and generation bias, thanks to their stronger reasoning capabilities.
👉2. VLM judges can provide more accurate and stable feedback in natural language (e.g. Poor, Average, Good) than numerical scales.
Led by @ZRChen_AISafety, Yichao Du, Zichen Wen, @AiYiyangZ.
https://t.co/6cwN9yrVOm
🌟Just completed my PhD at @Stanford! 🌟 A huge thanks to my advisor @yeung_levy, my family and friends, committee and collaborators, and everyone who supported me along the way. Excited to start my next chapter at @Waymo, working on foundation models for self-driving cars!
If you’re in Davos, we just started giving a tutorial on Gaussian Splatting at 3DV. With @GKopanas@Snosixtytwo@antoine_guedon
https://t.co/zMVqL2s7z9
https://t.co/MJwOOuXBYu
Thanks @_akhaliq for sharing our work! Letting LLM be an agent and long-form videos as an environment, and allowing LLM to interact with videos and decide where to look iteratively, we achieve SoTA zero-shot performance and show potential on processing extremely long videos!
Are you hiring top AI talent?
Here is a list of Ph.D. students affiliated with @StanfordAILab who are on the industry and academic job markets this year! This list showcases diverse research areas and 41% of these graduates are URMs!
Check it out: https://t.co/WiTN8FKHhO
Single-View 3D Human Digitalization with Large Reconstruction Models
paper page: https://t.co/JRrI8By7U5
introduce Human-LRM, a single-stage feed-forward Large Reconstruction Model designed to predict human Neural Radiance Fields (NeRF) from a single image. Our approach demonstrates remarkable adaptability in training using extensive datasets containing 3D scans and multi-view capture. Furthermore, to enhance the model's applicability for in-the-wild scenarios especially with occlusions, we propose a novel strategy that distills multi-view reconstruction into single-view via a conditional triplane diffusion model. This generative extension addresses the inherent variations in human body shapes when observed from a single view, and makes it possible to reconstruct the full body human from an occluded image. Through extensive experiments, we show that Human-LRM surpasses previous methods by a significant margin on several benchmarks.
What are differences between image datasets? (e.g. ImageNet & ImageNetv2) Errors by one model vs. another? (e.g. CLIP & ResNet) Correct vs. incorrect predictions?
VisDiff can answer by describing differences in image sets w/ language. Work led by @Zhang_Yu_hui and @lisabdunlap!
Check out Adobe's Project Scene Change
The interesting experimental #AI tech automatically composites an actor from one shot into the environment from another without the need for #rotoscoping or #cameratracking
https://t.co/gP3iK2mTXR
#compositing#VFX#motiongraphics
Pls stop at #CVPR2023 poster *Tue AM 110* to learn about GC-KPL: a novel method for learning 3D human keypoints from point clouds w/o human labels.
Project: https://t.co/XYgT2MtAEW
Joint work w/ awesome folks @gorban Jingwei Ji, @MahyarNajibi, Yin Zhou, Dragomir Anguelov, @Waymo
Check out our #CVPR2023 paper on 3D Human Keypoints Estimation From Point Clouds in the Wild Without Human Labels https://t.co/KspaOpK9Fd
Huge shout out to @JenWeng4 who interned in our team last summer and did all the work!
Have videos of your tennis practice and wish you can put your own motion in 3D? 🎾 👟 🏋🏻
#CVPR2023 We present, NeMo, a 3D motion recovery method that is more accurate by leveraging information shared across multiple instances/repetitions!
👇🏻Resources in 🧵
Check out our #CVPR2023 paper on 3D Human Keypoints Estimation From Point Clouds in the Wild Without Human Labels https://t.co/KspaOpK9Fd
Huge shout out to @JenWeng4 who interned in our team last summer and did all the work!
(1/8)
Can you diagnose and rectify a #vision model using #language? Check our work in #ICLR2023!
Our analysis reveals when and how text embeddings can be used as a proxy for image embeddings to debug vision models.
Paper: https://t.co/8sGkwOQhhz
Code: https://t.co/VFzu1AmvUs