Multi-Object 3D Grounding with Dynamic Modules
and Language Informed Spatial Attention

NeurIPS 2024

Abstract

Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task that has numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach that incorporates three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.

Overall Pipeline



Our proposed Multi-Object 3D Grounding with Dynamic Modules and Language Informed Spatial Attention (D-LISA) is designed with a novel vision module that allows for a dynamic number of proposal boxes and extracts features from dynamic viewpoints per scene. Furthermore, we propose a fusion module that is spatially aware with explicit language conditioning.

Language Informed Spatial Attention (LISA)



Given the visual feature matrix $\bm{F}$ and the sentence feature $\bm{g}$, language-informed spatial attention block updates the visual features with spatial information by balancing the visual attention weights and spatial relations guided by language.

Qualitative Results



Qualitative examples of Multi3DRefer val set in single target with distractors (ST w/D) category.



Qualitative examples of Multi3DRefer val set in multiple targets (MT) category.