VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search

Xiaopeng Lu; Tiancheng Zhao; Kyusong Lee

VisualSparta：大規模なテキストから画像への検索のためのスパーストランスフォーマーフラグメントレベルのマッチング

テキストから画像への検索は、マルチモーダル情報検索で不可欠なタスクです。つまり、テキストクエリを指定して、ラベルのない大規模な画像データセットから関連する画像を検索します。この論文では、精度と効率の両方で既存のモデルよりも大幅に改善された、新しいテキストから画像への検索モデルであるVisualSpartaを提案します。 VisualSpartaが、MSCOCOおよびFlickr30Kの以前のすべてのスケーラブルなメソッドよりも優れたパフォーマンスを発揮できることを示します。また、検索速度が大幅に向上します。つまり、100万枚の画像を含むインデックスの場合、VisualSpartaは標準のベクトル検索と比較して391倍以上の速度を実現します。 VisualSpartaは転置インデックスとして効率的に実装できるため、実験によると、この速度の利点はデータセットが大きいほど大きくなります。私たちの知る限り、VisualSpartaは、非常に大きなデータセットのリアルタイム検索を実現できる最初のトランスフォーマーベースのテキストから画像への検索モデルであり、以前の最先端の方法と比較して大幅な精度の向上を実現しています。

Text-to-image retrieval is an essential task in multi-modal information retrieval, i.e. retrieving relevant images from a large and unlabelled image dataset given textual queries. In this paper, we propose VisualSparta, a novel text-to-image retrieval model that shows substantial improvement over existing models on both accuracy and efficiency. We show that VisualSparta is capable of outperforming all previous scalable methods in MSCOCO and Flickr30K. It also shows substantial retrieving speed advantages, i.e. for an index with 1 million images, VisualSparta gets over 391x speed up compared to standard vector search. Experiments show that this speed advantage even gets bigger for larger datasets because VisualSparta can be efficiently implemented as an inverted index. To the best of our knowledge, VisualSparta is the first transformer-based text-to-image retrieval model that can achieve real-time searching for very large dataset, with significant accuracy improvement compared to previous state-of-the-art methods.

updated: Fri Jan 01 2021 16:29:17 GMT+0000 (UTC)

published: Fri Jan 01 2021 16:29:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト