VMFormer: End-to-End Video Matting with Transformer

Jiachen Li; Vidit Goel; Marianna Ohanyan; Shant Navasardyan; Yunchao Wei; Humphrey Shi

VMFormer: Transformer によるエンドツーエンドのビデオマッティング

ビデオマッティングは、特定の入力ビデオシーケンスから各フレームのアルファマットを予測することを目的としています。ビデオマッティングに対する最近のソリューションは、過去数年間、深い畳み込みニューラルネットワーク (CNN) によって支配されており、学界と産業界の両方で事実上の標準となっています。ただし、CNN ベースのアーキテクチャにより、局所性の誘導バイアスが組み込まれており、画像のグローバルな特性を捉えることができません。また、複数のフレームの特徴マップを処理する際の計算コストを考慮した、長期的な時間モデリングも欠けています。このホワイトペーパーでは、VMFormer を提案します。これは、ビデオマッティングのためのトランスフォーマーベースのエンドツーエンド方式です。ビデオ入力シーケンスが与えられた学習可能なクエリから、各フレームのアルファマットを予測します。具体的には、セルフアテンションレイヤーを活用して、連続するフレームでの短距離時間モデリングによるフィーチャシーケンスのグローバルな統合を構築します。さらに、クエリを適用して、すべてのクエリに対して長期的な時間モデリングを使用して、トランスフォーマーデコーダーでクロスアテンションを通じてグローバル表現を学習します。予測段階では、クエリと対応する特徴マップの両方を使用して、アルファマットの最終的な予測を行います。実験では、合成されたベンチマークで、VMFormer が以前の CNN ベースのビデオマッティング方法よりも優れていることが示されています。私たちの知る限り、これは学習可能なクエリの予測を備えた完全なビジョントランスフォーマーに基づいて構築された最初のエンドツーエンドのビデオマッティングソリューションです。このプロジェクトは、https://chrisjuniorli.github.io/project/VMFormer/ でオープンソース化されています。

Video matting aims to predict the alpha mattes for each frame from a given input video sequence. Recent solutions to video matting have been dominated by deep convolutional neural networks (CNN) for the past few years, which have become the de-facto standard for both academia and industry. However, they have inbuilt inductive bias of locality and do not capture global characteristics of an image due to the CNN-based architectures. They also lack long-range temporal modeling considering computational costs when dealing with feature maps of multiple frames. In this paper, we propose VMFormer: a transformer-based end-to-end method for video matting. It makes predictions on alpha mattes of each frame from learnable queries given a video input sequence. Specifically, it leverages self-attention layers to build global integration of feature sequences with short-range temporal modeling on successive frames. We further apply queries to learn global representations through cross-attention in the transformer decoder with long-range temporal modeling upon all queries. In the prediction stage, both queries and corresponding feature maps are used to make the final prediction of alpha matte. Experiments show that VMFormer outperforms previous CNN-based video matting methods on the composited benchmarks. To our best knowledge, it is the first end-to-end video matting solution built upon a full vision transformer with predictions on the learnable queries. The project is open-sourced at https://chrisjuniorli.github.io/project/VMFormer/

updated: Fri Aug 26 2022 17:51:02 GMT+0000 (UTC)

published: Fri Aug 26 2022 17:51:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト