A Comprehensive Review on Recent Methods and Challenges of Video Description

Alok Singh; Thoudam Doren Singh; Sivaji Bandyopadhyay

ビデオ記述の最近の方法と課題に関する包括的なレビュー

ビデオの説明には、ビデオ内のアクション、イベント、およびオブジェクトの自然言語による説明の生成が含まれます。視覚障害者の言語と視覚のギャップを埋める、コンテンツに基づく自動タイトル提案の生成、コンテンツに基づくビデオの閲覧、ビデオガイドによる機械翻訳[86]など、ビデオ記述のさまざまなアプリケーションがあります。 10年の間、この分野では、ビデオの説明、評価指標、およびデータセットのアプローチ/方法に関していくつかの作業が行われてきました。ビデオ記述タスクの進捗状況を分析するには、最近の深層学習アプローチに特に焦点を当てて、ビデオ記述アプローチのすべてのフェーズをカバーする包括的な調査が必要です。この作業では、ビデオ記述アプローチのフェーズ、ビデオ記述のデータセット、評価メトリック、ビデオ記述に関する研究を動機付けるための公開コンテスト、この分野での未解決の課題、および将来の研究の方向性に関する包括的な調査を報告します。この調査では、すべてのデータセットに対して提案されている最先端のアプローチとその長所と短所について説明します。この研究領域の成長のためには、多数のベンチマークデータセットの可用性が基本的なニーズです。さらに、すべてのデータセットをオープンドメインデータセットとドメイン固有データセットの2つのクラスに分類します。私たちの調査から、ビデオの説明のタスクはコンピュータービジョンと自然言語処理の交差点にあるため、この分野の作業はペースの速い開発であることがわかります。しかし、それでも、視覚的特徴の品質に影響を与える同様のフレームによる冗長性、より多様なコンテンツを含むデータセットの可用性、効果的な評価指標の可用性などのさまざまな課題のため、ビデオの説明の作業は飽和段階にはほど遠いです。

Video description involves the generation of the natural language description of actions, events, and objects in the video. There are various applications of video description by filling the gap between languages and vision for visually impaired people, generating automatic title suggestion based on content, browsing of the video based on the content and video-guided machine translation [86] etc.In the past decade, several works had been done in this field in terms of approaches/methods for video description, evaluation metrics,and datasets. For analyzing the progress in the video description task, a comprehensive survey is needed that covers all the phases of video description approaches with a special focus on recent deep learning approaches. In this work, we report a comprehensive survey on the phases of video description approaches, the dataset for video description, evaluation metrics, open competitions for motivating the research on the video description, open challenges in this field, and future research directions. In this survey, we cover the state-of-the-art approaches proposed for each and every dataset with their pros and cons. For the growth of this research domain,the availability of numerous benchmark dataset is a basic need. Further, we categorize all the dataset into two classes: open domain dataset and domain-specific dataset. From our survey, we observe that the work in this field is in fast-paced development since the task of video description falls in the intersection of computer vision and natural language processing. But still, the work in the video description is far from saturation stage due to various challenges like the redundancy due to similar frames which affect the quality of visual features, the availability of dataset containing more diverse content and availability of an effective evaluation metric.

updated: Mon Nov 30 2020 13:08:45 GMT+0000 (UTC)

published: Mon Nov 30 2020 13:08:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト