TCAM: Temporal Class Activation Maps for Object Localization in Weakly-Labeled Unconstrained Videos

Soufiane Belharbi; Ismail Ben Ayed; Luke McCaffrey; Eric Granger

TCAM: 弱いラベルが付けられた制約のないビデオにおけるオブジェクトローカリゼーションのための時間クラスアクティベーションマップ

弱い教師ありビデオオブジェクトローカリゼーション (WSVOL) を使用すると、オブジェクトクラスなどのグローバルビデオタグのみを使用して、ビデオ内のオブジェクトを見つけることができます。最先端の方法は、複数の独立した段階に依存しており、最初の時空間的な提案が視覚的および動きの手がかりを使用して生成され、次に顕著なオブジェクトが識別され、洗練されます。ローカリゼーションは、1 つまたは複数のビデオの最適化問題を解決することによって行われ、ビデオタグは通常、ビデオクラスタリングに使用されます。これには、ビデオごとまたはクラスごとのモデルが必要であり、コストのかかる推論を行います。さらに、ローカライズされた領域は、オプティカルフローのような監視されていないモーションメソッドのため、またはビデオタグが最適化から破棄されるため、必ずしも判別式ではありません。このホワイトペーパーでは、静止画像に基づく WSOL 用に設計された、成功したクラスアクティベーションマッピング (CAM) メソッドを活用します。新しい Temporal CAM (TCAM) メソッドが導入され、連続した CAM で CAM-Temporal Max Pooling (CAM-TMP) と呼ばれる集約メカニズムを使用して、ビデオの時空間情報を活用する判別深層学習 (DL) モデルをトレーニングします。特に、対象領域 (ROI) の活性化は、DL モデルをトレーニングするためのピクセル単位の擬似ラベルを構築するために、事前トレーニング済みの CNN 分類器によって生成された CAM から収集されます。さらに、グローバルな教師なしサイズの制約と、CRF などのローカルな制約を使用して、より正確な CAM を生成します。単一の独立したフレームに対する推論により、フレームのクリップの並列処理とリアルタイムのローカリゼーションが可能になります。制約のないビデオのための 2 つの挑戦的な YouTube オブジェクトデータセットに関する広範な実験は、CAM メソッド (独立したフレームでトレーニングされたもの) が適切なローカリゼーション精度を生み出すことができることを示しています。提案された TCAM メソッドは、WSVOL 精度の新しい最先端を達成し、視覚的な結果は、視覚オブジェクトの追跡や検出などの後続のタスクに適応できることを示唆しています。コードは公開されています。

Weakly supervised video object localization (WSVOL) allows locating object in videos using only global video tags such as object class. State-of-art methods rely on multiple independent stages, where initial spatio-temporal proposals are generated using visual and motion cues, then prominent objects are identified and refined. Localization is done by solving an optimization problem over one or more videos, and video tags are typically used for video clustering. This requires a model per-video or per-class making for costly inference. Moreover, localized regions are not necessary discriminant because of unsupervised motion methods like optical flow, or because video tags are discarded from optimization. In this paper, we leverage the successful class activation mapping (CAM) methods, designed for WSOL based on still images. A new Temporal CAM (TCAM) method is introduced to train a discriminant deep learning (DL) model to exploit spatio-temporal information in videos, using an aggregation mechanism, called CAM-Temporal Max Pooling (CAM-TMP), over consecutive CAMs. In particular, activations of regions of interest (ROIs) are collected from CAMs produced by a pretrained CNN classifier to build pixel-wise pseudo-labels for training the DL model. In addition, a global unsupervised size constraint, and local constraint such as CRF are used to yield more accurate CAMs. Inference over single independent frames allows parallel processing of a clip of frames, and real-time localization. Extensive experiments on two challenging YouTube-Objects datasets for unconstrained videos, indicate that CAM methods (trained on independent frames) can yield decent localization accuracy. Our proposed TCAM method achieves a new state-of-art in WSVOL accuracy, and visual results suggest that it can be adapted for subsequent tasks like visual object tracking and detection. Code is publicly available.

updated: Tue Aug 30 2022 21:20:34 GMT+0000 (UTC)

published: Tue Aug 30 2022 21:20:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト