S2WAT: Image Style Transfer via Hierarchical Vision Transformer using Strips Window Attention

Chiyu Zhang; Jun Yang; Lei Wang; Zaiyan Dai

S2WAT: ストリップウィンドウを使用した Hierarchical Vision Transformer による画像スタイルの転送

この論文では、Strips Window Attention Transformer (S2WAT) と呼ばれる、画像スタイル転送用の新しい階層型ビジョン Transformer を紹介します。これは、エンコーダー転送デコーダーアーキテクチャのエンコーダーとして機能します。階層機能により、S2WAT は、機能ピラミッドネットワーク (FPN) や U-Net など、コンピュータービジョンの他の分野で実績のある手法を活用して、将来の作業で画像スタイルを転送できます。ただし、既存のウィンドウベースの Transformer では、イメージスタイルトランスファーに直接導入すると、スタイライズされたイメージがグリッド状になるという問題が発生します。この問題を解決するために、表現が Strips Window Attention (SpW Attention) で計算される S2WAT を提案します。 SpW Attention は、Attn Merge という新しい機能融合スキームによって、水平方向と垂直方向のローカル情報と長距離依存関係の両方を統合できます。さらに、以前のウィンドウベースのトランスフォーマーでは、フィーチャの解像度がウィンドウサイズで割り切れる必要があり、任意のサイズの入力が制限されていました。この論文では、パディングとアンパディング操作を利用して、S2WAT が任意のサイズの入力をサポートできるようにします。質的および量的実験は、S2WAT が最先端の CNN ベース、フローベース、およびトランスフォーマーベースのアプローチと同等のパフォーマンスを達成することを示しています。

This paper presents a new hierarchical vision Transformer for image style transfer, called Strips Window Attention Transformer (S2WAT), which serves as an encoder of encoder-transfer-decoder architecture. With hierarchical features, S2WAT can leverage proven techniques in other fields of computer vision, such as feature pyramid networks (FPN) or U-Net, to image style transfer in future works. However, the existing window-based Transformers will cause a problem that the stylized images will be grid-like when introducing them into image style transfer directly. To solve this problem, we propose S2WAT whose representation is computed with Strips Window Attention (SpW Attention). The SpW Attention can integrate both local information and long-range dependencies in horizontal and vertical directions by a novel feature fusion scheme named Attn Merge. Moreover, previous window-based Transformers require that the resolution of features needs to be divisible by window size which limits the inputs of arbitrary size. In this paper, we take advantages of padding & un-padding operations to make S2WAT support inputs of arbitrary size. Qualitative and quantitative experiments demonstrate that S2WAT achieves comparable performance of state-of-the-art CNN-based, Flow-based and Transformer-based approaches.

updated: Sat Oct 22 2022 07:56:13 GMT+0000 (UTC)

published: Sat Oct 22 2022 07:56:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト