EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata

Chenhao Zheng; Ayush Shrivastava; Andrew Owens

言語としての EXIF: 画像とカメラメタデータ間のクロスモーダルアソシエーションの学習

特定の写真を記録したカメラに関する情報をキャプチャする視覚的表現を学習します。これを行うために、画像パッチと、カメラが画像ファイルに自動的に挿入する EXIF メタデータとの間のマルチモーダル埋め込みをトレーニングします。私たちのモデルは、このメタデータを単純にテキストに変換してからトランスフォーマーで処理することによって表します。私たちが学習した機能は、下流の画像フォレンジックおよびキャリブレーションタスクにおいて、他の自己監視機能および監視機能よりも大幅に優れています。特に、画像内のすべてのパッチの視覚的埋め込みをクラスタリングすることにより、スプライスされた画像領域「ゼロショット」の位置を特定することに成功しました。

We learn a visual representation that captures information about the camera that recorded a given photo. To do this, we train a multimodal embedding between image patches and the EXIF metadata that cameras automatically insert into image files. Our model represents this metadata by simply converting it to text and then processing it with a transformer. The features that we learn significantly outperform other self-supervised and supervised features on downstream image forensics and calibration tasks. In particular, we successfully localize spliced image regions "zero shot" by clustering the visual embeddings for all of the patches within an image.

updated: Wed Jan 11 2023 18:59:16 GMT+0000 (UTC)

published: Wed Jan 11 2023 18:59:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト