Describing and Localizing Multiple Changes with Transformers

Yue Qiu; Shintaro Yamamoto; Kodai Nakashima; Ryota Suzuki; Kenji Iwata; Hirokatsu Kataoka; Yutaka Satoh

トランスフォーマーを使用した複数の変更の説明とローカライズ

変更キャプションタスクは、シーン変更の前後に観察された画像ペアの変更を検出し、変更の自然言語記述を生成することを目的としています。既存の変更キャプション調査は、主に1つの変更があるシーンに焦点を合わせています。ただし、複雑なシナリオへの適応性を高めるには、画像ペアの複数の変更された部分を検出して説明することが不可欠です。上記の問題を3つの側面から解決します。（i）CGベースのマルチチェンジキャプションデータセットを提案します。（ii）マルチチェンジキャプションでシングルチェンジキャプションの既存の最先端の方法をベンチマークします。（iii）さらに、画像ペアの異なる領域を密に相関させることによって変化領域を識別し、関連する変化領域を文中の単語と動的に決定するマルチチェンジキャプショントランスフォーマー（MCCFormers）を提案します。提案された方法は、マルチチェンジキャプションの4つの従来のチェンジキャプション評価メトリックで最高のスコアを取得しました。さらに、既存の方法では、複数の変更に対して単一の注意マップが生成され、変更領域を区別する機能がありません。対照的に、提案された方法は、変更ごとに注意マップを分離でき、変更のローカリゼーションに関して良好に機能します。さらに、提案されたフレームワークは、既存の変更キャプションベンチマークであるCLEVR-Changeの以前の最先端の方法を大幅に上回り（BLEU-4で+ 6.1、CIDErスコアで+9.7）、その一般的な能力を示しています。キャプションタスクの変更。

Change captioning tasks aim to detect changes in image pairs observed before and after a scene change and generate a natural language description of the changes. Existing change captioning studies have mainly focused on scenes with a single change. However, detecting and describing multiple changed parts in image pairs is essential for enhancing adaptability to complex scenarios. We solve the above issues from three aspects: (i) We propose a CG-based multi-change captioning dataset; (ii) We benchmark existing state-of-the-art methods of single change captioning on multi-change captioning; (iii) We further propose Multi-Change Captioning transformers (MCCFormers) that identify change regions by densely correlating different regions in image pairs and dynamically determines the related change regions with words in sentences. The proposed method obtained the highest scores on four conventional change captioning evaluation metrics for multi-change captioning. In addition, existing methods generate a single attention map for multiple changes and lack the ability to distinguish change regions. In contrast, our proposed method can separate attention maps for each change and performs well with respect to change localization. Moreover, the proposed framework outperformed the previous state-of-the-art methods on an existing change captioning benchmark, CLEVR-Change, by a large margin (+6.1 on BLEU-4 and +9.7 on CIDEr scores), indicating its general ability in change captioning tasks.

updated: Thu Mar 25 2021 21:52:03 GMT+0000 (UTC)

published: Thu Mar 25 2021 21:52:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト