Multi-Object 3D Grounding with Dynamic Modules
and Language Informed Spatial Attention
NeurIPS 2024
- Haomeng Zhang
- Chiao-An Yang
- Raymond A. Yeh
Purdue University
Abstract
Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task that has numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach that incorporates three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.
Overall Pipeline
Our proposed Multi-Object 3D Grounding with Dynamic Modules and Language Informed Spatial Attention (D-LISA) is designed with a novel vision module that allows for a dynamic number of proposal boxes and extracts features from dynamic viewpoints per scene. Furthermore, we propose a fusion module that is spatially aware with explicit language conditioning.
Language Informed Spatial Attention (LISA)
Given the visual feature matrix $\bm{F}$ and the sentence feature $\bm{g}$, language-informed spatial attention block updates the visual features with spatial information by balancing the visual attention weights and spatial relations guided by language.
Qualitative Results
Qualitative examples of Multi3DRefer val set in single target with distractors (ST w/D) category.
Qualitative examples of Multi3DRefer val set in multiple targets (MT) category.