Abstract

Embodied AI has made significant progress acting in unexplored environments. However, tasks such as object search have largely focused on efficient policy learning. In this work, we identify several gaps in current search methods: They largely focus on dated perception models, neglect temporal aggregation, and transfer from ground truth directly to noisy perception at test time, without accounting for the resulting overconfidence in the perceived state. We address the identified problems through calibrated perception probabilities and uncertainty across aggregation and found decisions, thereby adapting the models for sequential tasks. The resulting methods can be directly integrated with pretrained models across a wide family of existing search approaches at no additional training cost. We perform extensive evaluations of aggre- gation methods across both different semantic perception models and policies, confirming the importance of calibrated uncertainties in both the aggregation and found decisions.

Teaser image

In order to quantify and address these problems, we first evaluate the impact of different semantic perception models and aggregation methods for sequential decision tasks. This differs from pure single-step perception evaluation based on the IoU (Intersection over Union) or precision. In contrast, we measure the results over the full sequence of observations and actions, where early errors may impact or prevent later decisions. The barplot shows the large perception gap to ground truth semantics. While newer models can reduce this gap, we find that temporal aggregation on the perception level is a key to closing the gap. To draw meaningful comparisons, we focus on one of the most used system structures, modular perception - mapping - policy pipelines, in one of the most explored tasks, ObjectNav.

Technical Approach

Overview of our approach
Figure: Overview of modular object search pipelines. A semantic segmentation model classifies the current image. A mapping module then fuses this information into a semantic pointcloud and integrates it into a global map. From this map, either an egocentric map is extracted for RL agents or the full map is used by a planner. An agent then determines navigation and found decision for a given target class c. We develop general methods to incorporate calibrated uncertainties in this system for temporal aggregation of the semantic perception and consistent found decisions.

We introduce uncertainty-based aggregation and found decisions to address the identified problems. While previous methods develop complex, heuristic-based map aggregation strategies to cope with the overconfident and uncalibrated predicted probabilities, we incorporate calibrated perception models with uncertainty estimation capabilities that can quantify this factor. In the second step, we evaluate these models with a learned search policy and across different semantic models. We find that our conclusions generalize to different search strategies and semantic perception models. The resulting methods directly integrate with existing approaches across a wide range of models without any additional training costs.

Qualitative Results

Qualitative Results
Figure: Semantic maps showing, from left to right, the ground truth semantics, the aggregated predictions of our Weighted Averaging approach and the resulting uncertainty map. Circles indicate positions where a target object was falsely detected, but due to the high uncertainty, no false found decision was raised. The uncertainty varies from blue-yellow corresponding to 0.0-1.0 normalized entropy.

Code

A software implementation of this project can be found in our GitHub repository for academic usage and is released under the GPLv3 license. For any commercial purpose, please contact the authors.

Publications

If you find our work useful, please consider citing our paper:

Sai Prasanna, Daniel Honerkamp, Kshitij Sirohi, Tim Welschehold, Wolfram Burgard, and Abhinav Valada
Perception Matters: Enhancing Embodied AI with Uncertainty-Aware Semantic Segmentation
Proceedings of the International Symposium on Robotics Research (ISRR), 2024.

(PDF) (BibTeX)

Authors

Sai Prasanna

Sai Prasanna

University of Freiburg

Daniel Honerkamp

Daniel Honerkamp

University of Freiburg

Kshitij Sirohi

Kshitij Sirohi

University of Freiburg

Tim Welschehold

Tim Welschehold

University of Freiburg

Wolfram Burgard

Wolfram Burgard

University of Technology Nuremberg

Abhinav Valada

Abhinav Valada

University of Freiburg

Acknowledgment

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 417962828 and the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA).