Learning to Ground Instructional Articles in Videos through Narrations

Effrosyni Mavroudi; Triantafyllos Afouras; Lorenzo Torresani

ナレーションを通じてビデオ内の教育記事を根付かせる方法を学ぶ

この論文では、ナレーション付きのハウツービデオで手順アクティビティのステップをローカライズするためのアプローチを紹介します。大規模なラベル付きデータの不足に対処するために、さまざまな手順タスクの説明記事を含む言語ナレッジベース (wikiHow) からステップの説明を取得します。手動による監視を一切行わずに、私たちのモデルは、フレーム、ナレーション、ステップの説明という 3 つのモダリティを照合することで、ハウツービデオ内の手順記事のステップを時間的に基礎付けることを学習します。具体的には、私たちの方法は、2 つの異なる経路からの情報を融合することによって、ステップをビデオに合わせます。i) フレームに対するステップの説明の直接的な位置合わせ、ii) ナレーションとビデオの対応関係を持つステップからナレーションへの合成によって得られる間接的な位置合わせ。特に、私たちのアプローチは、順序情報を利用することで記事内のすべてのステップのグローバルな時間的グラウンディングを一度に実行し、反復的に洗練され、積極的にフィルタリングされるステップ擬似ラベルでトレーニングされます。モデルを検証するために、新しい評価ベンチマークである HT-Step を導入します。これは、HowTo100MA テストサーバーの 124 時間のサブセットに手動で注釈を付けることで取得されます。 https://eval.ai/web/challenges/challenge-page からアクセスできます。 /2082。手順は wikiHow の記事から引用しています。このベンチマークでの実験と CrossTask でのゼロショット評価では、マルチモダリティの調整により、いくつかのベースラインや以前の研究に比べて劇的な向上が得られることが実証されています。最後に、ナレーションとビデオをマッチングするための内部モジュールが、HTM-Align ナレーションとビデオの位置合わせベンチマークで最先端のパフォーマンスを大幅に上回っていることを示します。

In this paper we present an approach for localizing steps of procedural activities in narrated how-to videos. To deal with the scarcity of labeled data at scale, we source the step descriptions from a language knowledge base (wikiHow) containing instructional articles for a large variety of procedural tasks. Without any form of manual supervision, our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities: frames, narrations, and step descriptions. Specifically, our method aligns steps to video by fusing information from two distinct pathways: i) direct alignment of step descriptions to frames, ii) indirect alignment obtained by composing steps-to-narrations with narrations-to-video correspondences. Notably, our approach performs global temporal grounding of all steps in an article at once by exploiting order information, and is trained with step pseudo-labels which are iteratively refined and aggressively filtered. In order to validate our model we introduce a new evaluation benchmark -- HT-Step -- obtained by manually annotating a 124-hour subset of HowTo100MA test server is accessible at https://eval.ai/web/challenges/challenge-page/2082. with steps sourced from wikiHow articles. Experiments on this benchmark as well as zero-shot evaluations on CrossTask demonstrate that our multi-modality alignment yields dramatic gains over several baselines and prior works. Finally, we show that our inner module for matching narration-to-video outperforms by a large margin the state of the art on the HTM-Align narration-video alignment benchmark.

updated: Tue Jun 06 2023 15:45:53 GMT+0000 (UTC)

published: Tue Jun 06 2023 15:45:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト