A Comprehensive Review of the Video-to-Text Problem

Jesus Perez-Martin; Benjamin Bustos; Silvio Jamil F. Guimarães; Ivan Sipiran; Jorge Pérez; Grethel Coello Said

ビデオからテキストへの問題の包括的なレビュー

ビジョンと言語の分野での研究は、視覚的情報とテキスト情報を結びつけようとする挑戦的なトピックを網羅しています。視覚情報がビデオに関連している場合、これはビデオテキスト研究に私たちを連れて行きます。これには、ビデオ質問応答、自然言語によるビデオ要約、ビデオからテキストおよびテキストからビデオへの変換など、いくつかの難しいタスクが含まれます。このホワイトペーパーでは、入力ビデオをそのテキストの説明に関連付けることが目標である、ビデオからテキストへの問題について説明します。この関連付けは、主に、コーパスから最も関連性の高い説明を取得するか、コンテキストビデオを指定して新しい説明を生成することによって行うことができます。これらの2つの方法は、ビデオタスクおよびビデオキャプション/説明タスクからのテキスト検索と呼ばれる、コンピュータービジョンおよび自然言語処理コミュニティにとって不可欠なタスクを表しています。これらの2つのタスクは、画像から1つの文を予測または取得するよりも大幅に複雑です。ビデオに存在する時空間情報は、視覚的コンテンツと関連する言語記述の構造に関する多様性と複雑さをもたらします。このレビューでは、ビデオからテキストへの問題に対する最先端の手法を分類して説明します。主なビデオからテキストへの方法と、それらのパフォーマンスを評価する方法について説明します。 26のベンチマークデータセットを分析し、問題の要件に対するそれらの欠点と長所を示します。また、研究者が各データセットで達成した進捗状況を示し、この分野の課題を取り上げ、今後の研究の方向性について話し合います。

Research in the Vision and Language area encompasses challenging topics that seek to connect visual and textual information. When the visual information is related to videos, this takes us into Video-Text Research, which includes several challenging tasks such as video question answering, video summarization with natural language, and video-to-text and text-to-video conversion. This paper reviews the video-to-text problem, in which the goal is to associate an input video with its textual description. This association can be mainly made by retrieving the most relevant descriptions from a corpus or generating a new one given a context video. These two ways represent essential tasks for Computer Vision and Natural Language Processing communities, called text retrieval from video task and video captioning/description task. These two tasks are substantially more complex than predicting or retrieving a single sentence from an image. The spatiotemporal information present in videos introduces diversity and complexity regarding the visual content and the structure of associated language descriptions. This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. It covers the main video-to-text methods and the ways to evaluate their performance. We analyze twenty-six benchmark datasets, showing their drawbacks and strengths for the problem requirements. We also show the progress that researchers have made on each dataset, we cover the challenges in the field, and we discuss future research directions.

updated: Tue Nov 30 2021 20:18:30 GMT+0000 (UTC)

published: Sat Mar 27 2021 02:12:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト