Localizing Objects in 3D from Egocentric Videos with Visual Queries

Jinjie Mai; Abdullah Hamdi; Silvio Giancola; Chen Zhao; Bernard Ghanem

ビジュアルクエリを使用して自己中心的なビデオからオブジェクトを 3D でローカライズする

最近のビデオと 3D の理解の進歩により、両方の概念を融合させた新しい 4D 時空間の課題が浮上しています。この方向に向かって、Ego4D Episodic Memory Benchmark は、3D ローカリゼーション (VQ3D) を使用したビジュアルクエリのタスクを提案しました。自己中心的なビデオクリップとクエリオブジェクトを表す画像クロップが与えられた場合、クエリフレームのカメラポーズに関して、そのクエリオブジェクトの中心の 3D 位置をローカライズすることが目標です。現在の方法は、姉妹タスクの 2D ローカリゼーションを使用したビジュアルクエリ (VQ2D) の 2D ローカリゼーション結果を 3D 再構成に持ち上げることによって、VQ3D の問題に取り組んでいます。それでも、以前の VQ3D メソッドからのポーズを使用したクエリ (QwP) の数が少ないことが、全体的な成功率をいくつか妨げていることを指摘し、VQ3D タスクに取り組むために 3D モデリングでさらに努力する必要があることを強調しています。この作業では、自己中心的なビデオからの 2D オブジェクト検索で 3D マルチビュージオメトリをより適切に絡ませるパイプラインを形式化します。より堅牢なカメラポーズを推定することで、オブジェクトクエリがより成功し、VQ3D のパフォーマンスが大幅に向上します。実際、私たちの方法は、Ego4D Episodic Memory Benchmark VQ3D で 86.36% のトップ 1 の全体的な成功率に達し、以前の最先端技術の 10 倍の改善です。さらに、VQ3D の残りの課題を強調する完全な実証研究を提供します。

With the recent advances in video and 3D understanding, novel 4D spatio-temporal challenges fusing both concepts have emerged. Towards this direction, the Ego4D Episodic Memory Benchmark proposed a task for Visual Queries with 3D Localization (VQ3D). Given an egocentric video clip and an image crop depicting a query object, the goal is to localize the 3D position of the center of that query object with respect to the camera pose of a query frame. Current methods tackle the problem of VQ3D by lifting the 2D localization results of the sister task Visual Queries with 2D Localization (VQ2D) into a 3D reconstruction. Yet, we point out that the low number of Queries with Poses (QwP) from previous VQ3D methods severally hinders their overall success rate and highlights the need for further effort in 3D modeling to tackle the VQ3D task. In this work, we formalize a pipeline that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos. We estimate more robust camera poses, leading to more successful object queries and substantially improved VQ3D performance. In practice, our method reaches a top-1 overall success rate of 86.36% on the Ego4D Episodic Memory Benchmark VQ3D, a 10x improvement over the previous state-of-the-art. In addition, we provide a complete empirical study highlighting the remaining challenges in VQ3D.

updated: Wed Dec 14 2022 01:28:12 GMT+0000 (UTC)

published: Wed Dec 14 2022 01:28:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト