Transformed ROIs for Capturing Visual Transformations in Videos

Abhinav Rai; Fadime Sener; Angela Yao

ビデオの視覚的変化をキャプチャするための変換された ROI

アクションがシーンにもたらす視覚的な変化をモデル化することは、ビデオの理解にとって重要です。現在、CNN は一度に 1 つのローカル近傍を処理するため、より長い範囲にわたるコンテキスト関係は学習可能ですが、間接的です。 CNN のプラグアンドプレイモジュールである TROI を提示し、そうでなければ空間と時間で分離されている中間レベルの機能表現を推論します。このモジュールは、手や相互作用するオブジェクトなどのローカライズされた視覚エンティティを関連付け、対応する関心領域を畳み込み層の特徴マップに直接変換します。 TROI を使用すると、大規模なデータセット Something-Something-V2 および EPIC-Kitchens-100 で最先端の行動認識結果を達成できます。

Modeling the visual changes that an action brings to a scene is critical for video understanding. Currently, CNNs process one local neighbourhood at a time, thus contextual relationships over longer ranges, while still learnable, are indirect. We present TROI, a plug-and-play module for CNNs to reason between mid-level feature representations that are otherwise separated in space and time. The module relates localized visual entities such as hands and interacting objects and transforms their corresponding regions of interest directly in the feature maps of convolutional layers. With TROI, we achieve state-of-the-art action recognition results on the large-scale datasets Something-Something-V2 and EPIC-Kitchens-100.

updated: Sat Nov 05 2022 17:57:37 GMT+0000 (UTC)

published: Sun Jun 06 2021 15:59:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト