Learning Combinatorial Prompts for Universal Controllable Image Captioning

Zhen Wang; Jun Xiao; Lei Chen; Fei Gao; Jian Shao; Long Chen

ユニバーサル制御可能な画像キャプションのための組み合わせプロンプトの学習

制御可能な画像キャプション (CIC) -- 与えられた制御信号のガイダンスの下で画像に関する自然言語の説明を生成する -- は、次世代のキャプションシステムに向けた最も有望な方向の 1 つです。これまで、コンテンツ関連の制御から構造関連の制御まで、さまざまな種類の CIC の制御信号が提案されてきました。ただし、さまざまな制御信号の形式とターゲットのギャップにより、既存のすべての CIC 作品 (またはアーキテクチャ) は 1 つの特定の制御信号のみに焦点を当てており、人間のような組み合わせ能力を見落としています。「コンビナトリアル」とは、説明を生成するときに人間が複数のニーズ (または制約) を同時に簡単に満たすことができることを意味します。この目的のために、ComPro と呼ばれるコンビナトリアルプロンプトを学習することにより、CIC の新しいプロンプトベースのフレームワークを提案します。具体的には、事前学習済みの言語モデル GPT-2 を言語モデルとして直接利用します, これは、異なるシグナル固有の CIC アーキテクチャ間のギャップを埋めるのに役立ちます. 次に、CIC をプロンプトガイド文生成問題として再定式化し、新しい軽量化を提案しますさまざまな種類の制御信号の組み合わせプロンプトを生成するプロンプト生成ネットワーク. さまざまな制御信号に対して, プロンプトベースのCICを実現するための新しいマスクアテンションメカニズムをさらに設計します. その単純さにより, ComProはより複雑に簡単に拡張できます.これらのプロンプトを連結することにより、制御信号を組み合わせました. 2 つの一般的な CIC ベンチマークでの広範な実験により、当社の ComPro o の有効性と効率が検証されました。 n 単一制御信号と結合制御信号の両方。

Controllable Image Captioning (CIC) -- generating natural language descriptions about images under the guidance of given control signals -- is one of the most promising directions towards next-generation captioning systems. Till now, various kinds of control signals for CIC have been proposed, ranging from content-related control to structure-related control. However, due to the format and target gaps of different control signals, all existing CIC works (or architectures) only focus on one certain control signal, and overlook the human-like combinatorial ability. By ``combinatorial", we mean that our humans can easily meet multiple needs (or constraints) simultaneously when generating descriptions. To this end, we propose a novel prompt-based framework for CIC by learning Combinatorial Prompts, dubbed as ComPro. Specifically, we directly utilize a pretrained language model GPT-2 as our language model, which can help to bridge the gap between different signal-specific CIC architectures. Then, we reformulate the CIC as a prompt-guide sentence generation problem, and propose a new lightweight prompt generation network to generate the combinatorial prompts for different kinds of control signals. For different control signals, we further design a new mask attention mechanism to realize the prompt-based CIC. Due to its simplicity, our ComPro can easily be extended to more complex combined control signals by concatenating these prompts. Extensive experiments on two prevalent CIC benchmarks have verified the effectiveness and efficiency of our ComPro on both single and combined control signals.

updated: Sat Mar 11 2023 07:53:15 GMT+0000 (UTC)

published: Sat Mar 11 2023 07:53:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト