OWL (Observe, Watch, Listen): Audiovisual Temporal Context for Localizing Actions in Egocentric Videos

Merey Ramazanova; Victor Escorcia; Fabian Caba Heilbron; Chen Zhao; Bernard Ghanem

OWL (Observe, Watch, Listen): 自己中心的なビデオのアクションをローカライズするためのオーディオビジュアルの時間的コンテキスト

自己中心的なビデオは、一人称視点から人間の活動のシーケンスをキャプチャし、豊富なマルチモーダルシグナルを提供できます。ただし、現在のローカリゼーション方法のほとんどは、第三者のビデオを使用しており、視覚情報のみを組み込んでいます。この作業では、自己中心的なビデオのアクションを検出する際の視聴覚コンテキストの有効性を詳しく調べ、観察、観察、およびリスニング (OWL) によるシンプルでありながら効果的なアプローチを紹介します。 OWL は、視聴覚情報とコンテキストを活用して、自己中心的な時間的行動の位置特定 (TAL) を行います。 2 つの大規模なデータセット、EPIC-Kitchens と HOMAGE でアプローチを検証します。広範な実験により、視聴覚の時間的コンテキストの関連性が実証されています。つまり、上記のデータセットでは、ビジュアルのみのモデルよりもローカリゼーションパフォーマンス (mAP) が +2.23% および +3.35% 向上しています。

Egocentric videos capture sequences of human activities from a first-person perspective and can provide rich multimodal signals. However, most current localization methods use third-person videos and only incorporate visual information. In this work, we take a deep look into the effectiveness of audiovisual context in detecting actions in egocentric videos and introduce a simple-yet-effective approach via Observing, Watching, and Listening (OWL). OWL leverages audiovisual information and context for egocentric temporal action localization (TAL). We validate our approach in two large-scale datasets, EPIC-Kitchens, and HOMAGE. Extensive experiments demonstrate the relevance of the audiovisual temporal context. Namely, we boost the localization performance (mAP) over visual-only models by +2.23% and +3.35% in the above datasets.

updated: Wed Oct 26 2022 13:24:39 GMT+0000 (UTC)

published: Thu Feb 10 2022 10:50:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト