Object-Region Video Transformers

Roei Herzig; Elad Ben-Avraham; Karttikeya Mangalam; Amir Bar; Gal Chechik; Anna Rohrbach; Trevor Darrell; Amir Globerson

オブジェクト領域ビデオトランスフォーマー

最近、ビデオトランスフォーマーは、CNNのパフォーマンスを超えて、ビデオの理解に大きな成功を収めています。ただし、既存のビデオトランスフォーマーモデルはオブジェクトを明示的にモデル化しませんが、オブジェクトはアクションの認識に不可欠な場合があります。この作業では、オブジェクト表現を直接組み込んだブロックでビデオトランスフォーマーレイヤーを拡張するオブジェクト中心のアプローチであるオブジェクト領域ビデオトランスフォーマー（ORViT）を紹介します。重要なアイデアは、初期のレイヤーから始まるオブジェクト中心の表現を融合し、それらをトランスフォーマーレイヤーに伝播して、ネットワーク全体の時空間表現に影響を与えることです。 ORViTブロックは、外観とダイナミクスの2つのオブジェクトレベルのストリームで構成されています。アピアランスストリームでは、「Object-Region Attention」モジュールが、パッチとオブジェクト領域に自己注意を適用します。このようにして、ビジュアルオブジェクト領域は均一なパッチトークンと相互作用し、コンテキスト化されたオブジェクト情報でそれらを強化します。さらに、軌道の相互作用をキャプチャする別の「オブジェクトダイナミクスモジュール」を介してオブジェクトダイナミクスをモデル化し、2つのストリームを統合する方法を示します。 4つのタスクと5つのデータセットでモデルを評価します。SomethingElseでの構図と数ショットのアクション認識、AVAでの時空間アクション検出、Something-Something V2、Diving48、Epic-Kitchen100での標準アクション認識です。検討したすべてのタスクとデータセットでパフォーマンスが大幅に向上し、オブジェクト表現をトランスフォーマーアーキテクチャに組み込んだモデルの価値を示しています。コードと事前トレーニング済みモデルについては、https：//roeiherz.github.io/ORViT/のプロジェクトページにアクセスしてください。

Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In this work, we present Object-Region Video Transformers (ORViT), an object-centric approach that extends video transformer layers with a block that directly incorporates object representations. The key idea is to fuse object-centric representations starting from early layers and propagate them into the transformer-layers, thus affecting the spatio-temporal representations throughout the network. Our ORViT block consists of two object-level streams: appearance and dynamics. In the appearance stream, an "Object-Region Attention" module applies self-attention over the patches and object regions. In this way, visual object regions interact with uniform patch tokens and enrich them with contextualized object information. We further model object dynamics via a separate "Object-Dynamics Module", which captures trajectory interactions, and show how to integrate the two streams. We evaluate our model on four tasks and five datasets: compositional and few-shot action recognition on SomethingElse, spatio-temporal action detection on AVA, and standard action recognition on Something-Something V2, Diving48 and Epic-Kitchen100. We show strong performance improvement across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a transformer architecture. For code and pretrained models, visit the project page at https://roeiherz.github.io/ORViT/

updated: Tue Nov 30 2021 15:49:19 GMT+0000 (UTC)

published: Wed Oct 13 2021 17:51:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト