Lightweight Attentional Feature Fusion for Video Retrieval by Text

Fan Hu; Aozhu Chen; Ziyue Wang; Fangming Zhou; Xirong Li

テキストによるビデオ検索のための軽量注意機能融合

この論文では、テキストによるビデオ検索の新しいコンテキストで、昔ながらのトピックである機能融合を再検討します。片方の端だけで機能の融合を検討する以前の研究とは異なり、ビデオまたはテキストとして、統一されたフレームワーク内で両端の機能の融合を目指しています。特徴の凸結合を最適化することは、計算量の多いマルチヘッドの自己注意によってそれらの相関をモデル化するよりも好ましいと仮定します。したがって、我々は軽量注意機能融合（LAFF）を提案します。 LAFFは、初期段階と後期段階の両方、およびビデオとテキストの両方の終わりで機能融合を実行し、多様な（既成の）機能を活用するための強力な方法になります。 4つの公開データセット、つまりMSR-VTT、MSVD、TGIF、VATEX、および大規模なTRECVID AVSベンチマーク評価（2016-2020）での広範な実験により、LAFFの実行可能性が示されています。さらに、LAFFは実装が非常に簡単であるため、実際の展開に適しています。

In this paper, we revisit feature fusion, an old-fashioned topic, in the new context of video retrieval by text. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self-attention. Accordingly, we propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. Extensive experiments on four public datasets, i.e. MSR-VTT, MSVD, TGIF, VATEX, and the large-scale TRECVID AVS benchmark evaluations (2016-2020) show the viability of LAFF. Moreover, LAFF is extremely simple to implement, making it appealing for real-world deployment.

updated: Fri Dec 03 2021 10:41:12 GMT+0000 (UTC)

published: Fri Dec 03 2021 10:41:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト