PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense Video Captioning

Yifan Lu; Ziqi Zhang; Yuxin Chen; Chunfeng Yuan; Bing Li; Weiming Hu

PIC 4番目の課題：高密度ビデオキャプションのためのセマンティック支援マルチ機能エンコーディングとマルチヘッドデコーディング

Dense Video Captioning（DVC）のタスクは、1つのビデオ内の複数のイベントのタイムスタンプ付きのキャプションを生成することを目的としています。セマンティック情報は、DVCのローカリゼーションと説明の両方で重要な役割を果たします。エンコーディング-デコーディングフレームワークに基づくセマンティック支援の高密度ビデオキャプションモデルを提示します。エンコーディング段階では、セマンティック情報を抽出するためのコンセプト検出器を設計します。セマンティック情報は、入力ビデオを十分に表現するためにマルチモーダル視覚機能と融合されます。デコード段階では、セマンティック監視を提供するために、ローカリゼーションヘッドおよびキャプションヘッドと並行して分類ヘッドを設計します。私たちの方法は、DVC評価メトリクスの下でYouMakeupデータセットを大幅に改善し、PIC4thChallengeのMakeupDenseVideo Captioning（MDVC）タスクで高いパフォーマンスを実現します。

The task of Dense Video Captioning (DVC) aims to generate captions with timestamps for multiple events in one video. Semantic information plays an important role for both localization and description of DVC. We present a semantic-assisted dense video captioning model based on the encoding-decoding framework. In the encoding stage, we design a concept detector to extract semantic information, which is then fused with multi-modal visual features to sufficiently represent the input video. In the decoding stage, we design a classification head, paralleled with the localization and captioning heads, to provide semantic supervision. Our method achieves significant improvements on the YouMakeup dataset under DVC evaluation metrics and achieves high performance in the Makeup Dense Video Captioning (MDVC) task of PIC 4th Challenge.

updated: Mon Jul 11 2022 07:20:42 GMT+0000 (UTC)

published: Wed Jul 06 2022 10:56:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト