Transformer for Image Quality Assessment

Junyong You; Jari Korhonen

画質評価用トランス

Transformerは、自然言語処理（NLP）の新しい標準的な方法になり、コンピュータービジョン分野の研究者の関心も集めています。この論文では、画質（TRIQ）評価におけるTransformerのアプリケーションを調査します。 Vision Transformer（ViT）で採用された元のTransformerエンコーダーに続いて、畳み込みニューラルネットワーク（CNN）によって抽出されたフィーチャマップの上に浅いTransformerエンコーダーを使用するアーキテクチャを提案します。 Transformerエンコーダーでは、任意の解像度の画像を処理するために、適応型の位置埋め込みが採用されています。 Transformerアーキテクチャのさまざまな設定が、公開されている画質データベースで調査されています。提案されたTRIQアーキテクチャが卓越したパフォーマンスを達成することがわかりました。 TRIQの実装は、Github（https://github.com/junyongyou/triq）で公開されています。

Transformer has become the new standard method in natural language processing (NLP), and it also attracts research interests in computer vision area. In this paper we investigate the application of Transformer in Image Quality (TRIQ) assessment. Following the original Transformer encoder employed in Vision Transformer (ViT), we propose an architecture of using a shallow Transformer encoder on the top of a feature map extracted by convolution neural networks (CNN). Adaptive positional embedding is employed in the Transformer encoder to handle images with arbitrary resolutions. Different settings of Transformer architectures have been investigated on publicly available image quality databases. We have found that the proposed TRIQ architecture achieves outstanding performance. The implementation of TRIQ is published on Github (https://github.com/junyongyou/triq).

updated: Wed Dec 30 2020 18:43:11 GMT+0000 (UTC)

published: Wed Dec 30 2020 18:43:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト