OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

Merey Ramazanova; Victor Escorcia; Fabian Caba Heilbron; Chen Zhao; Bernard Ghanem

OWL（観察、視聴、聴取）：視聴覚時間的コンテキストを介した自己中心的なビデオのアクションのローカライズ

一時的なアクションのローカリゼーション（TAL）は、近年、サードパーソンビデオで広く調査され、改善された重要なタスクです。最近では、一人称のビデオに対してきめ細かい時間的ローカリゼーションを実行するための取り組みが行われています。ただし、現在のTALメソッドは視覚信号のみを使用し、ほとんどのビデオに存在し、自己中心的なビデオで意味のあるアクション情報を示すオーディオモダリティを無視します。この作業では、自己中心的なビデオのアクションを検出する際のオーディオの有効性を詳しく調べ、自己中心的なTALの視聴覚情報とコンテキストを活用するために、観察、視聴、およびリスニング（OWL）を介したシンプルでありながら効果的なアプローチを紹介します。。そのために、次のことを行います。1）2つのモダリティをどこでどのように融合するかについて、さまざまな戦略を比較および調査します。 2）一時的な視聴覚コンテキストを組み込むための変圧器ベースのモデルを提案します。私たちの実験は、私たちのアプローチがEPIC-KITCHENS-100で最先端のパフォーマンスを達成することを示しています。

Temporal action localization (TAL) is an important task extensively explored and improved for third-person videos in recent years. Recent efforts have been made to perform fine-grained temporal localization on first-person videos. However, current TAL methods only use visual signals, neglecting the audio modality that exists in most videos and that shows meaningful action information in egocentric videos. In this work, we take a deep look into the effectiveness of audio in detecting actions in egocentric videos and introduce a simple-yet-effective approach via Observing, Watching, and Listening (OWL) to leverage audio-visual information and context for egocentric TAL. For doing that, we: 1) compare and study different strategies for where and how to fuse the two modalities; 2) propose a transformer-based model to incorporate temporal audio-visual context. Our experiments show that our approach achieves state-of-the-art performance on EPIC-KITCHENS-100.

updated: Mon Feb 14 2022 15:30:49 GMT+0000 (UTC)

published: Thu Feb 10 2022 10:50:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト