[1/8] New social navigation paper + benchmark: SocialNav-SUB πΆπ€ Recent work puts VLMs on robots for navigation, but can they really interpret scenes and extract key details for social navigation? π https://t.co/2rlcIQpf6h
How can robots follow complex instructions in dynamic environments?
π€ Meet ComposableNav β a diffusion-based planner that enables robots to generate novel navigation behaviors that satisfy diverse instruction specifications on the fly β no retraining needed.
π Just accepted to CoRL 2025
π Project: https://t.co/FX3O0ZYYyD
A Thread (1/8)
[1/8] New social navigation paper + benchmark: SocialNav-SUB πΆπ€ Recent work puts VLMs on robots for navigation, but can they really interpret scenes and extract key details for social navigation? π https://t.co/2rlcIQpf6h
[8/8] π€ SocialNav-SUB is a human-grounded check on whether VLMs understand social navigation scenes β¨ Please read our paper for more info: https://t.co/FpL3kqMFVP #Robotics#VLM#SocialNavigation
[7/8] SocialNavSUB is also fully open-source, actively maintained, and easily extendable to customized prompts and/or additional VLMs! Pull requests are always welcome! https://t.co/IJCjCUo5Io
[6/8] π§ͺ Does chain-of-thought (using spatial/spatiotemporal VQAs first) improve social reasoning? β Yes. Does BEV context help models? βοΈ Model-dependent (sometimes a lot). Does better spatial(temporal) context improve social reasoning? β Yes.
[4/8] π₯ We collected human data from an IRB-approved human-subject study to construct our benchmark and evaluate whether models align with human judgments in social navigation scenes.
[3/8] SocialNav-SUB features real-world social navigation scenarios built from SCAND scenarios @ 4 Hz β PHALP tracking β front-view & BEV with labeled pedestrians, combining them with a set of carefully designed questions to create our VQA prompts (5k in total).
[2/8] We introduce SocialNav-SUB: a VQA benchmark to evaluate spatial, spatiotemporal, and social reasoning for real-world social navigation scenarios with object-centric grounding (front view + Birdβs-Eye-View (BEV) + numbered markers) to provide rich context to VLMs.