Retrieval-Augmented Transformer for Image Captioning

Sara Sarto; Marcella Cornia; Lorenzo Baraldi; Rita Cucchiara

画像キャプション用の検索拡張トランスフォーマー

画像キャプションモデルは、入力画像の自然言語記述を提供することにより、視覚と言語を結び付けることを目的としています。過去数年間、このタスクは、パラメトリックモデルを学習し、視覚的特徴抽出の進歩を提案することによって、またはより優れたマルチモーダル接続をモデル化することによって取り組まれてきました。この論文では、生成プロセスを支援するために外部コーパスから知識を取得できるkNNメモリを使用した画像キャプションアプローチの開発について調査します。私たちのアーキテクチャは、視覚的な類似性に基づく知識レトリバー、識別可能なエンコーダー、およびkNN拡張アテンションレイヤーを組み合わせて、過去のコンテキストと外部メモリから取得したテキストに基づいてトークンを予測します。 COCOデータセットで実施された実験結果は、明示的な外部メモリを使用すると、生成プロセスを支援し、キャプションの品質を向上できることを示しています。私たちの仕事は、画像キャプションモデルをより大規模に改善するための新しい道を切り開きます。

Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens based on the past context and on text retrieved from the external memory. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality. Our work opens up new avenues for improving image captioning models at larger scale.

updated: Mon Aug 22 2022 07:52:34 GMT+0000 (UTC)

published: Tue Jul 26 2022 19:35:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト