A Review of Deep Learning for Video Captioning

Moloud Abdar; Meenakshi Kollati; Swaraja Kuraparthi; Farhad Pourpanah; Daniel McDuff; Mohammad Ghavamzadeh; Shuicheng Yan; Abduallah Mohamed; Abbas Khosravi; Erik Cambria; Fatih Porikli

ビデオキャプションのための深層学習のレビュー

ビデオキャプション (VC) は、コンピュータービジョン、自然言語処理 (NLP)、言語学、および人間とコンピューターの相互作用の分野での作業を橋渡しする、急速に変化する学際的な研究分野です。本質的に、VC には、ビデオを理解し、それを言葉で説明することが含まれます。キャプションは、よりアクセスしやすいインターフェイス (ロービジョンナビゲーションなど) の作成からビデオ質問応答 (V-QA)、ビデオ検索、コンテンツ生成まで、多くのアプリケーションで使用されています。この調査は、注意ベースのアーキテクチャ、グラフネットワーク、強化学習、敵対的ネットワーク、高密度ビデオキャプション (DVC) などを含むがこれらに限定されないディープラーニングベースの VC を対象としています。フィールドで使用されるデータセットと評価指標、VC の制限、アプリケーション、課題、および将来の方向性について説明します。

Video captioning (VC) is a fast-moving, cross-disciplinary area of research that bridges work in the fields of computer vision, natural language processing (NLP), linguistics, and human-computer interaction. In essence, VC involves understanding a video and describing it with language. Captioning is used in a host of applications from creating more accessible interfaces (e.g., low-vision navigation) to video question answering (V-QA), video retrieval and content generation. This survey covers deep learning-based VC, including but, not limited to, attention-based architectures, graph networks, reinforcement learning, adversarial networks, dense video captioning (DVC), and more. We discuss the datasets and evaluation metrics used in the field, and limitations, applications, challenges, and future directions for VC.

updated: Sat Apr 22 2023 15:30:54 GMT+0000 (UTC)

published: Sat Apr 22 2023 15:30:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト