Iconographic Image Captioning for Artworks

Eva Cetinic

アートワークの図像画像のキャプション

画像のキャプションは、視覚的な入力のみに基づいて画像のテキストによる説明を自動的に生成することを意味します。これは近年広く取り上げられている研究トピックですが、美術史データの分野ではあまり貢献されていません。この特定のコンテキストでは、画像キャプションのタスクは、画像とテキストのペアの大規模なデータセットの欠如、アートワークの記述に関連する意味の複雑さ、専門家レベルの注釈の必要性など、さまざまな課題に直面しています。この作品は、アートと図像学のために設計されたIconclass分類システムからの概念で注釈が付けられたアートワーク画像の新しい大規模データセットを利用することによって、これらの課題のいくつかに対処することを目的としています。注釈はクリーンなテキスト記述に処理され、画像キャプションタスクでディープニューラルネットワークモデルをトレーニングするのに適したデータセットを作成します。自然画像のキャプションを生成することで達成された最先端の結果に動機付けられて、トランスフォーマーベースの視覚言語の事前トレーニング済みモデルは、アートワーク画像データセットを使用して微調整されます。結果の定量的評価は、標準の画像キャプションメトリックを使用して実行されます。生成されたキャプションの品質と新しいデータに一般化するモデルの能力は、新しい絵画のコレクションにモデルを採用し、一般的に生成されたキャプションと芸術的ジャンルとの関係の分析を実行することによって調査されます。全体的な結果は、特に自然画像データセットでのみトレーニングされたモデルから取得されたキャプションと比較して、モデルが美術史のコンテキストとの関連性が高い意味のあるキャプションを生成できることを示唆しています。

Image captioning implies automatically generating textual descriptions of images based only on the visual input. Although this has been an extensively addressed research topic in recent years, not many contributions have been made in the domain of art historical data. In this particular context, the task of image captioning is confronted with various challenges such as the lack of large-scale datasets of image-text pairs, the complexity of meaning associated with describing artworks and the need for expert-level annotations. This work aims to address some of those challenges by utilizing a novel large-scale dataset of artwork images annotated with concepts from the Iconclass classification system designed for art and iconography. The annotations are processed into clean textual description to create a dataset suitable for training a deep neural network model on the image captioning task. Motivated by the state-of-the-art results achieved in generating captions for natural images, a transformer-based vision-language pre-trained model is fine-tuned using the artwork image dataset. Quantitative evaluation of the results is performed using standard image captioning metrics. The quality of the generated captions and the model's capacity to generalize to new data is explored by employing the model on a new collection of paintings and performing an analysis of the relation between commonly generated captions and the artistic genre. The overall results suggest that the model can generate meaningful captions that exhibit a stronger relevance to the art historical context, particularly in comparison to captions obtained from models trained only on natural image datasets.

updated: Sun Feb 07 2021 23:11:33 GMT+0000 (UTC)

published: Sun Feb 07 2021 23:11:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト