Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals

Lu Jin; Zechao Li; Jinhui Tang

スケーラブルな画像テキストおよびビデオテキスト検索のためのディープセマンティックマルチモーダルハッシュネットワーク

ハッシュは、計算と保存の効率が高いため、大規模なマルチメディアデータのマルチモーダル検索に広く適用されています。この記事では、スケーラブルな画像テキストおよびビデオテキスト検索のための新しいディープセマンティックマルチモーダルハッシュネットワーク（DSMHN）を提案します。提案されたディープハッシュフレームワークは、バックボーンネットワークとして2次元畳み込みニューラルネットワーク（CNN）を利用して画像テキスト検索の空間情報をキャプチャし、3DCNNをバックボーンネットワークとして活用してビデオの空間情報と時間情報をキャプチャします。テキスト検索。 DSMHNでは、モダリティ固有のハッシュ関数の2つのセットが、モダリティ間の類似性とモダリティ内のセマンティックラベルの両方を明示的に保持することによって共同で学習されます。具体的には、学習したハッシュコードが分類タスクに最適であるという前提で、2つのストリームネットワークが共同でトレーニングされ、結果のハッシュコードにセマンティックラベルを埋め込むことでハッシュ関数を学習します。さらに、特徴表現学習、インターモダリティ類似性保存学習、セマンティックラベル保存学習、および異なるタイプの損失関数を使用したハッシュ関数学習を同時に活用することにより、コンパクトで高品質のハッシュコードを学習するための統合ディープマルチモーダルハッシュフレームワークが提案されています。提案されたDSMHNメソッドは、画像テキストとビデオテキストの両方の取得のための汎用的でスケーラブルなディープハッシュフレームワークであり、さまざまなタイプの損失関数と柔軟に統合できます。広く使用されている4つのマルチモーダル検索データセットに対して、シングルモーダル検索タスクとクロスモーダル検索タスクの両方について広範な実験を行います。画像テキスト検索タスクとビデオテキスト検索タスクの両方での実験結果は、DSMHNが最先端の方法を大幅に上回っていることを示しています。

Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. In this article, we propose a novel deep semantic multimodal hashing network (DSMHN) for scalable image-text and video-text retrieval. The proposed deep hashing framework leverages 2-D convolutional neural networks (CNN) as the backbone network to capture the spatial information for image-text retrieval, while the 3-D CNN as the backbone network to capture the spatial and temporal information for video-text retrieval. In the DSMHN, two sets of modality-specific hash functions are jointly learned by explicitly preserving both intermodality similarities and intramodality semantic labels. Specifically, with the assumption that the learned hash codes should be optimal for the classification task, two stream networks are jointly trained to learn the hash functions by embedding the semantic labels on the resultant hash codes. Moreover, a unified deep multimodal hashing framework is proposed to learn compact and high-quality hash codes by exploiting the feature representation learning, intermodality similarity-preserving learning, semantic label-preserving learning, and hash function learning with different types of loss functions simultaneously. The proposed DSMHN method is a generic and scalable deep hashing framework for both image-text and video-text retrievals, which can be flexibly integrated with different types of loss functions. We conduct extensive experiments for both single modal- and cross-modal-retrieval tasks on four widely used multimodal-retrieval data sets. Experimental results on both image-text- and video-text-retrieval tasks demonstrate that the DSMHN significantly outperforms the state-of-the-art methods.

updated: Wed Jan 05 2022 03:36:16 GMT+0000 (UTC)

published: Wed Jan 09 2019 10:27:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト