Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching

Nicola Messina; Davide Alessandro Coccomini; Andrea Esuli; Fabrizio Falchi

ウィキペディアの画像キャプションマッチングのためのトランスフォーマーベースのマルチモーダル提案と再ランク付け

Webおよびオンライン百科事典のアクセス可能性が高まるにつれ、管理するデータの量は絶えず増加しています。たとえば、ウィキペディアには、複数の言語で書かれた何百万ものページがあります。これらのページには、テキストコンテキストが不足していることが多く、概念的に浮かんでいるため、検索や管理が難しい画像が含まれています。この作業では、KaggleでのWikipedia Image-Caption Matchingチャレンジに参加するために設計したシステムを紹介します。このチャレンジの目的は、画像に関連付けられたデータ（URLとビジュアルデータ）を使用して、利用可能な画像の大規模なプールから正しいキャプションを見つけることです。。このタスクを実行できるシステムは、大規模なオンライン百科事典のマルチメディアコンテンツのアクセシビリティと完全性を向上させます。具体的には、クエリ画像データとキャプションの間の関連性スコアを効率的かつ効果的に推測できる、両方とも最近のTransformerモデルを利用した2つのモデルのカスケードを提案します。提案された2モデルのアプローチが、推論時に全体的な計算の複雑さを制限しながら、画像とキャプションの大規模なプールを処理する効果的な方法であることを、広範な実験を通じて検証します。私たちのアプローチは驚くべき結果を達成し、Kaggleチャレンジのプライベートリーダーボードで正規化された割引累積ゲイン（nDCG）値0.53を取得します。

With the increased accessibility of web and online encyclopedias, the amount of data to manage is constantly increasing. In Wikipedia, for example, there are millions of pages written in multiple languages. These pages contain images that often lack the textual context, remaining conceptually floating and therefore harder to find and manage. In this work, we present the system we designed for participating in the Wikipedia Image-Caption Matching challenge on Kaggle, whose objective is to use data associated with images (URLs and visual data) to find the correct caption among a large pool of available ones. A system able to perform this task would improve the accessibility and completeness of multimedia content on large online encyclopedias. Specifically, we propose a cascade of two models, both powered by the recent Transformer model, able to efficiently and effectively infer a relevance score between the query image data and the captions. We verify through extensive experimentation that the proposed two-model approach is an effective way to handle a large pool of images and captions while maintaining bounded the overall computational complexity at inference time. Our approach achieves remarkable results, obtaining a normalized Discounted Cumulative Gain (nDCG) value of 0.53 on the private leaderboard of the Kaggle challenge.

updated: Tue Jun 21 2022 14:30:14 GMT+0000 (UTC)

published: Tue Jun 21 2022 14:30:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト