FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Rui Liu; Hanming Deng; Yangyi Huang; Xiaoyu Shi; Lewei Lu; Wenxiu Sun; Xiaogang Wang; Jifeng Dai; Hongsheng Li

FuseFormer：ビデオ修復用のトランスフォーマーでのきめ細かい情報の融合

Transformerは、長距離の関係をモデル化するための強力で柔軟なアーキテクチャとして、ビジョンタスクで広く探求されてきました。ただし、きめ細かい表現が必要なビデオ修復で使用する場合、既存の方法では、パッチの分割が難しいため、エッジがぼやけて詳細に表示されるという問題があります。ここでは、新しいSoftSplitおよびSoftComposition操作に基づくきめ細かい機能融合によるビデオインペインティング用に設計されたTransformerモデルであるFuseFormerを提案することにより、この問題に取り組むことを目指しています。ソフトスプリットは、フィーチャマップを指定されたオーバーラップ間隔で多くのパッチに分割します。それどころか、ソフトコンポジションは、重なり合う領域のピクセルが合計されるフィーチャマップ全体にさまざまなパッチをステッチすることによって機能します。これらの2つのモジュールは、トークンと機能の間の効果的なマッピングのために、Transformerレイヤーの前のトークン化とTransformerレイヤーの後のトークン化解除で最初に使用されます。したがって、サブパッチレベルの情報の相互作用が有効になり、隣接するパッチ間でより効果的な機能の伝播が可能になり、ビデオの穴領域の鮮やかなコンテンツが合成されます。さらに、FuseFormerでは、ソフトコンポジションとソフトスプリットをフィードフォワードネットワークに精巧に挿入し、1D線形レイヤーが2D構造をモデル化できるようにします。また、サブパッチレベルの機能融合機能がさらに強化されています。定量的評価と定性的評価の両方で、提案されたFuseFormerは最先端の方法を上回っています。また、詳細な分析を行い、その優位性を検証しています。

Transformer, as a strong and flexible architecture for modelling long-range relations, has been widely explored in vision tasks. However, when used in video inpainting that requires fine-grained representation, existed method still suffers from yielding blurry edges in detail due to the hard patch splitting. Here we aim to tackle this problem by proposing FuseFormer, a Transformer model designed for video inpainting via fine-grained feature fusion based on novel Soft Split and Soft Composition operations. The soft split divides feature map into many patches with given overlapping interval. On the contrary, the soft composition operates by stitching different patches into a whole feature map where pixels in overlapping regions are summed up. These two modules are first used in tokenization before Transformer layers and de-tokenization after Transformer layers, for effective mapping between tokens and features. Therefore, sub-patch level information interaction is enabled for more effective feature propagation between neighboring patches, resulting in synthesizing vivid content for hole regions in videos. Moreover, in FuseFormer, we elaborately insert the soft composition and soft split into the feed-forward network, enabling the 1D linear layers to have the capability of modelling 2D structure. And, the sub-patch level feature fusion ability is further enhanced. In both quantitative and qualitative evaluations, our proposed FuseFormer surpasses state-of-the-art methods. We also conduct detailed analysis to examine its superiority.

updated: Tue Sep 07 2021 10:13:29 GMT+0000 (UTC)

published: Tue Sep 07 2021 10:13:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト