Egocentric Scene Understanding via Multimodal Spatial Rectifier

Tien Do; Khiem Vuong; Hyun Soo Park

マルチモーダル空間整流器による自己中心的なシーンの理解

本論文では、自己中心的なシーン理解の問題、すなわち、自己中心的な画像から深度と表面法線を予測することを研究します。自己中心的なシーンの理解には、前例のない課題があります。（1）頭の動きが大きいため、画像は、ジオメトリ予測の既存のモデルが適用されない非標準的な視点（つまり、傾斜した画像）から取得されます。（2）手などの動的な前景オブジェクトは、視覚シーンの大部分を占めます。これらの課題は、ScanNetやNYUv2などの大規模な屋内データセットから学習した既存のモデルのパフォーマンスを制限します。これらのデータセットは、主に静止シーンの直立した画像で構成されています。自己中心的な画像を一連の参照方向に安定させるマルチモーダル空間整流器を提示します。これにより、コヒーレントな視覚表現の学習が可能になります。自己中心的な画像に対して過度の遠近法ワープを生成することが多いユニモーダル空間整流器とは異なり、マルチモーダル空間整流器は、遠近法ワープの影響を最小限に抑えることができる複数の方向から学習します。動的な前景オブジェクトの視覚的表現を学習するために、50万を超える同期RGBDフレームと重力方向を含むEDINA（日常の屋内活動の自己中心的な深さ）と呼ばれる新しいデータセットを提示します。マルチモーダル空間整流器とEDINAデータセットを備えた、シングルビュー深度と表面法線推定に関する提案された方法は、EDINAデータセットだけでなく、First Person Hand Action（FPHA）などの他の一般的な自己中心性データセットのベースラインを大幅に上回っています。）およびEPIC-KITCHENS。

In this paper, we study a problem of egocentric scene understanding, i.e., predicting depths and surface normals from an egocentric image. Egocentric scene understanding poses unprecedented challenges: (1) due to large head movements, the images are taken from non-canonical viewpoints (i.e., tilted images) where existing models of geometry prediction do not apply; (2) dynamic foreground objects including hands constitute a large proportion of visual scenes. These challenges limit the performance of the existing models learned from large indoor datasets, such as ScanNet and NYUv2, which comprise predominantly upright images of static scenes. We present a multimodal spatial rectifier that stabilizes the egocentric images to a set of reference directions, which allows learning a coherent visual representation. Unlike unimodal spatial rectifier that often produces excessive perspective warp for egocentric images, the multimodal spatial rectifier learns from multiple directions that can minimize the impact of the perspective warp. To learn visual representations of the dynamic foreground objects, we present a new dataset called EDINA (Egocentric Depth on everyday INdoor Activities) that comprises more than 500K synchronized RGBD frames and gravity directions. Equipped with the multimodal spatial rectifier and the EDINA dataset, our proposed method on single-view depth and surface normal estimation significantly outperforms the baselines not only on our EDINA dataset, but also on other popular egocentric datasets, such as First Person Hand Action (FPHA) and EPIC-KITCHENS.

updated: Thu Jul 14 2022 17:26:00 GMT+0000 (UTC)

published: Thu Jul 14 2022 17:26:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト