NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory

Santhosh Kumar Ramakrishnan; Ziad Al-Halah; Kristen Grauman

NaQ: ナレーションをクエリとして活用してエピソード記憶を監視する

自然言語クエリ (NLQ) を使用して長い自己中心的なビデオを検索することは、拡張現実とロボット工学に魅力的なアプリケーションを持っています。そこでは、人 (エージェント) が以前に見たすべてのものへの流動的なインデックスが人間の記憶を増強し、必要に応じて関連情報を明らかにすることができます。ただし、学習問題の構造化された性質 (自由形式のテキストクエリ入力、ローカライズされたビデオの一時的なウィンドウ出力) とその干し草の山の針のような性質により、技術的に困難であり、監視するのに費用がかかります。 Narrations-as-Queries (NaQ) を導入します。これは、標準のビデオテキストナレーションをビデオクエリローカリゼーションモデルのトレーニングデータに変換するデータ拡張戦略です。 Ego4D ベンチマークで私たちのアイデアを検証すると、実際に多大な影響があることがわかりました。 NaQ は、複数の上位モデルを大幅に改善し (精度を 2 倍にしても)、Ego4D NLQ チャレンジでこれまでで最高の結果をもたらし、CVPR および ECCV 2022 コンペティションですべてのチャレンジ勝者を確実に上回り、現在の公開リーダーボードを上回っています。 NLQ の最先端を達成するだけでなく、ゼロショットおよび少数ショット NLQ を実行する機能や、ロングテールオブジェクトカテゴリに関するクエリのパフォーマンスの向上など、独自のアプローチの特性も実証します。コードとモデル: http://vision.cs.utexas.edu/projects/naq.

Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories. Code and models: http://vision.cs.utexas.edu/projects/naq.

updated: Sat Mar 25 2023 04:46:18 GMT+0000 (UTC)

published: Mon Jan 02 2023 16:40:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト