Transformed ROIs for Capturing Visual Transformations in Videos

Abhinav Rai; Fadime Sener; Angela Yao

ビデオの視覚的な変換をキャプチャするための変換された ROI

アクションがシーンにもたらす視覚的な変化をモデル化することは、ビデオを理解する上で重要です。現在、CNN は一度に 1 つのローカル近隣を処理するため、より長い範囲でのコンテキスト関係は、学習可能ではありますが、間接的です。 TROI は、CNN が空間と時間で分離されている中間レベルの特徴表現の間で推論するためのプラグアンドプレイモジュールです。このモジュールは、手や相互作用するオブジェクトなどのローカライズされた視覚エンティティを関連付け、対応する関心領域を畳み込み層の特徴マップに直接変換します。 TROI を使用すると、大規模なデータセットSomething-Something-V2とEpic-Kitchens-100で最先端のアクション認識結果を達成できます。

Modeling the visual changes that an action brings to a scene is critical for video understanding. Currently, CNNs process one local neighbourhood at a time, so contextual relationships over longer ranges, while still learnable, are indirect. We present TROI, a plug-and-play module for CNNs to reason between mid-level feature representations that are otherwise separated in space and time. The module relates localized visual entities such as hands and interacting objects and transforms their corresponding regions of interest directly in the feature maps of convolutional layers. With TROI, we achieve state-of-the-art action recognition results on the large-scale datasets Something-Something-V2 and Epic-Kitchens-100.

updated: Sun Jun 06 2021 15:59:53 GMT+0000 (UTC)

published: Sun Jun 06 2021 15:59:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト