Caption Anything: Interactive Image Description with Diverse Multimodal Controls

Teng Wang; Jinrui Zhang; Junjie Fei; Hao Zheng; Yunlong Tang; Zhe Li; Mingqi Gao; Shanshan Zhao

Caption Anything: 多様なマルチモーダルコントロールによるインタラクティブな画像の説明

制御可能な画像キャプションは、特定の領域を見る、特定のテキストスタイルで伝えるなど、人間の目的に沿った自然言語で画像を説明することを目的とした新しいマルチモーダルトピックです。最先端のメソッドは、注釈付きの入力コントロールと出力キャプションのペアでトレーニングされます。ただし、このような十分に注釈が付けられたマルチモーダルデータが不足しているため、対話型 AI システムの使いやすさとスケーラビリティが大幅に制限されます。ユニモーダルな命令に従う基盤モデルを活用することは、より広範なデータソースから恩恵を受ける有望な代替手段です。このホワイトペーパーでは、さまざまなマルチモデルコントロールをサポートする基盤モデル拡張画像キャプションフレームワークである Caption AnyThing (CAT) を紹介します。 2) 感情、長さ、言語、事実などの言語制御。 Segment Anything Model (SAM) と ChatGPT を利用して、ビジュアルと言語のプロンプトをモジュール化されたフレームワークに統合し、異なるコントロール間の柔軟な組み合わせを可能にします。広範なケーススタディは、視覚言語アプリケーションにおける効果的なユーザーインタラクションモデリングに光を当て、フレームワークのユーザー意図調整機能を示しています。私たちのコードは、https://github.com/ttenwang/Caption-Anything で公開されています。

Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose, e.g., looking at the specified regions or telling in a particular text style. State-of-the-art methods are trained on annotated pairs of input controls and output captions. However, the scarcity of such well-annotated multimodal data largely limits their usability and scalability for interactive AI systems. Leveraging unimodal instruction-following foundation models is a promising alternative that benefits from broader sources of data. In this paper, we present Caption AnyThing (CAT), a foundation model augmented image captioning framework supporting a wide range of multimodel controls: 1) visual controls, including points, boxes, and trajectories; 2) language controls, such as sentiment, length, language, and factuality. Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework, enabling the flexible combination between different controls. Extensive case studies demonstrate the user intention alignment capabilities of our framework, shedding light on effective user interaction modeling in vision-language applications. Our code is publicly available at https://github.com/ttengwang/Caption-Anything.

updated: Thu Jul 06 2023 13:47:21 GMT+0000 (UTC)

published: Thu May 04 2023 09:48:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト