Colar: Effective and Efficient Online Action Detection by Consulting Exemplars

Le Yang; Junwei Han; Dingwen Zhang

Colar：模範に相談することによる効果的かつ効率的なオンラインアクション検出

オンラインアクション検出は、近年ますます研究の関心を集めています。現在の作品は、過去の依存関係をモデル化し、将来を予測して、ビデオセグメント内のアクションの進化を認識し、検出精度を向上させます。ただし、既存のパラダイムはカテゴリレベルのモデリングを無視し、効率に十分な注意を払っていません。カテゴリを考えると、その代表的なフレームはさまざまな特徴を示しています。したがって、カテゴリレベルのモデリングは、時間依存性モデリングに補完的なガイダンスを提供できます。この論文は、最初にフレームと例示的なフレームとの間の類似性を測定し、次に類似性の重みに基づいて例示的な特徴を集約する効果的な模範相談メカニズムを開発する。類似性の測定と特徴の集約の両方で必要な計算が限られているため、これも効率的なメカニズムです。エグザンプラ相談メカニズムに基づいて、履歴フレームをエグザンプラと見なすことで長期的な依存関係をキャプチャでき、カテゴリレベルのモデリングは、カテゴリの代表的なフレームをエグザンプラと見なすことで実現できます。カテゴリレベルのモデリングからの補完性により、私たちの方法は軽量アーキテクチャを採用していますが、3つのベンチマークで新しい高性能を実現しています。さらに、時空間ネットワークを使用してビデオフレームに取り組むことで、私たちの方法は有効性と効率の間の適切なトレードオフを行います。コードはhttps://github.com/VividLe/Online-Action-Detectionで入手できます。

Online action detection has attracted increasing research interests in recent years. Current works model historical dependencies and anticipate the future to perceive the action evolution within a video segment and improve the detection accuracy. However, the existing paradigm ignores category-level modeling and does not pay sufficient attention to efficiency. Considering a category, its representative frames exhibit various characteristics. Thus, the category-level modeling can provide complimentary guidance to the temporal dependencies modeling. This paper develops an effective exemplar-consultation mechanism that first measures the similarity between a frame and exemplary frames, and then aggregates exemplary features based on the similarity weights. This is also an efficient mechanism, as both similarity measurement and feature aggregation require limited computations. Based on the exemplar-consultation mechanism, the long-term dependencies can be captured by regarding historical frames as exemplars, while the category-level modeling can be achieved by regarding representative frames from a category as exemplars. Due to the complementarity from the category-level modeling, our method employs a lightweight architecture but achieves new high performance on three benchmarks. In addition, using a spatio-temporal network to tackle video frames, our method makes a good trade-off between effectiveness and efficiency. Code is available at https://github.com/VividLe/Online-Action-Detection.

updated: Tue Mar 22 2022 13:31:53 GMT+0000 (UTC)

published: Wed Mar 02 2022 12:13:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト