Support-set based Multi-modal Representation Enhancement for Video Captioning

Xiaoya Chen; Jingkuan Song; Pengpeng Zeng; Lianli Gao; Heng Tao Shen

ビデオキャプションのためのサポートセットベースのマルチモーダル表現の強化

ビデオキャプションは、視覚的なシーンを完全に理解する必要がある難しい作業です。既存の方法は、典型的な1対1のマッピングに従います。これは、限られたサンプルスペースに集中し、サンプル間の固有のセマンティックアソシエーションを無視するため、厳密で情報量の少ない式になります。この問題に対処するために、サンプル間で共有されるセマンティックサブスペースで豊富な情報をマイニングするために、新規で柔軟なフレームワーク、つまりサポートセットベースのマルチモーダル表現拡張（SMRE）モデルを提案します。具体的には、サンプル間の基礎となる接続を学習し、セマンティック関連の視覚要素を取得するためのサポートセットを構築するためのサポートセット構築（SC）モジュールを提案します。このプロセス中に、相対距離を制限し、自己監視方式でマルチモーダル相互作用を管理するセマンティックスペーストランスフォーメーション（SST）モジュールを設計します。 MSVDおよびMSR-VTTデータセットに関する広範な実験は、SMREが最先端のパフォーマンスを実現していることを示しています。

Video captioning is a challenging task that necessitates a thorough comprehension of visual scenes. Existing methods follow a typical one-to-one mapping, which concentrates on a limited sample space while ignoring the intrinsic semantic associations between samples, resulting in rigid and uninformative expressions. To address this issue, we propose a novel and flexible framework, namely Support-set based Multi-modal Representation Enhancement (SMRE) model, to mine rich information in a semantic subspace shared between samples. Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements. During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way. Extensive experiments on MSVD and MSR-VTT datasets demonstrate that our SMRE achieves state-of-the-art performance.

updated: Thu May 19 2022 03:40:29 GMT+0000 (UTC)

published: Thu May 19 2022 03:40:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト