LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Siqi Sun; Yen-Chun Chen; Linjie Li; Shuohang Wang; Yuwei Fang; Jingjing Liu

LightningDOT：リアルタイムの画像テキスト検索のためのビジュアルセマンティック埋め込みの事前トレーニング

マルチモーダル事前トレーニングは、視覚と言語の研究に大きな進歩をもたらしました。これらの大規模な事前トレーニング済みモデルは、成功したものの、主にTransformerアーキテクチャでのクロスモーダルな注意による莫大な計算コストのために、運命的に推論速度が遅くなります。実際のアプリケーションに適用すると、このような遅延と計算の要求により、事前にトレーニングされたモデルの実際の使用が大幅に妨げられます。この論文では、V + Lアプリケーションの最も成熟したシナリオである画像テキスト検索（ITR）について研究します。これは、最近の事前トレーニング済みモデルが登場する前から広く研究されてきました。精度を犠牲にすることなく、ITRの推論時間を数千倍高速化するシンプルでありながら非常に効果的なアプローチであるLightningDOTを提案します。 LightningDOTは、3つの新しい学習目標の事前トレーニング、オフラインでの特徴インデックスの抽出、さらに再ランク付けを伴うインスタントドット積マッチングの採用により、時間のかかるクロスモーダルな注意を取り除きます。これにより、検索プロセスが大幅に高速化されます。実際、LightningDOTは、Flickr30k、COCO、Multi30Kなどの複数のITRベンチマークで新しい最先端技術を実現し、計算時間を1000倍も消費する既存の事前トレーニング済みモデルを上回っています。コードとトレーニング前のチェックポイントは、https：//github.com/intersun/LightningDOTで入手できます。

Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computation cost mainly from cross-modal attention in Transformer architecture. When applied to real-life applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by pre-training on three novel learning objectives, extracting feature indexes offline, and employing instant dot-product matching with further re-ranking, which significantly speeds up retrieval process. In fact, LightningDOT achieves new state of the art across multiple ITR benchmarks such as Flickr30k, COCO and Multi30K, outperforming existing pre-trained models that consume 1000x magnitude of computational hours. Code and pre-training checkpoints are available at https://github.com/intersun/LightningDOT.

updated: Tue Mar 16 2021 00:35:28 GMT+0000 (UTC)

published: Tue Mar 16 2021 00:35:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト