VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining

Junjie Ke; Keren Ye; Jiahui Yu; Yonghui Wu; Peyman Milanfar; Feng Yang

VILA: 視覚言語事前トレーニングによるユーザーコメントからの画像美学の学習

画像は構成、色、スタイル、高レベルのセマンティクスなどの複数の要因の影響を受けるため、画像の美学を評価することは困難です。既存の画像美的評価 (IAA) メソッドは主に、人間が知覚する視覚的な美的情報を過度に単純化する、人間がラベル付けした評価スコアに依存しています。逆に、ユーザーのコメントはより包括的な情報を提供し、画像の美学に関する人間の意見や好みを表現するより自然な方法です。これに照らして、ユーザーのコメントから画像の美学を学習し、マルチモーダルな美的表現を学習するための視覚言語の事前トレーニング方法を検討することを提案します。具体的には、画像とコメントのペアを使用して画像とテキストのエンコーダー/デコーダーモデルを事前トレーニングし、対照的で生成的な目的を使用して、人間のラベルを付けずに豊かで一般的な美的セマンティクスを学習します。事前トレーニング済みのモデルを下流の IAA タスクに効率的に適応させるために、テキストをアンカーとして使用して審美的なランキングの概念を学習する、軽量のランクベースのアダプターをさらに提案します。私たちの結果は、事前トレーニング済みの美的視覚言語モデルが、AVA-Captions データセットを介した画像美的キャプションに関する以前の研究よりも優れていることを示しています。多くの監視されたベースライン。提案されたアダプターモジュールを使用した最小限の微調整パラメーターのみで、モデルは AVA データセットで最先端の IAA パフォーマンスを実現します。

Assessing the aesthetics of an image is challenging, as it is influenced by multiple factors including composition, color, style, and high-level semantics. Existing image aesthetic assessment (IAA) methods primarily rely on human-labeled rating scores, which oversimplify the visual aesthetic information that humans perceive. Conversely, user comments offer more comprehensive information and are a more natural way to express human opinions and preferences regarding image aesthetics. In light of this, we propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations. Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels. To efficiently adapt the pretrained model for downstream IAA tasks, we further propose a lightweight rank-based adapter that employs text as an anchor to learn the aesthetic ranking concept. Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset, and it has powerful zero-shot capability for aesthetic tasks such as zero-shot style classification and zero-shot IAA, surpassing many supervised baselines. With only minimal finetuning parameters using the proposed adapter module, our model achieves state-of-the-art IAA performance over the AVA dataset.

updated: Fri Mar 24 2023 23:57:28 GMT+0000 (UTC)

published: Fri Mar 24 2023 23:57:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト