PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points

Jing Tan; Xiaotong Zhao; Xintian Shi; Bin Kang; Limin Wang

PointTAD: 学習可能なクエリポイントを使用したマルチラベルの時間アクション検出

従来の時間アクション検出 (TAD) は、通常、単一のラベル (ActivityNet、THUMOS など) からの少数のアクションインスタンスを含むトリミングされていないビデオを処理します。ただし、実際にはさまざまなクラスのアクションが同時に発生することが多いため、この設定は非現実的かもしれません。このホワイトペーパーでは、マルチラベルのトリミングされていないビデオからすべてのアクションインスタンスをローカライズすることを目的とした、マルチラベルの一時的なアクション検出のタスクに焦点を当てています。マルチラベル TAD は、単一のビデオ内でのきめ細かなクラス識別と、同時発生インスタンスの正確な位置特定が必要になるため、より困難です。この問題を軽減するために、スパースクエリベースの検出パラダイムを従来の TAD から拡張し、PointTAD のマルチラベル TAD フレームワークを提案します。具体的には、PointTAD は学習可能なクエリポイントの小さなセットを導入して、各アクションインスタンスの重要なフレームを表します。このポイントベースの表現は、境界での識別フレームとアクション内の重要なフレームをローカライズするための柔軟なメカニズムを提供します。さらに、マルチレベルインタラクティブモジュールを使用してアクションデコードプロセスを実行し、ポイントレベルとインスタンスレベルの両方のアクションセマンティクスをキャプチャします。最後に、当社の PointTAD は、簡単に展開できるように、RGB 入力に基づいたエンドツーエンドのトレーニング可能なフレームワークを採用しています。 2 つの一般的なベンチマークで提案された方法を評価し、マルチラベル TAD の検出 mAP の新しいメトリックを紹介します。私たちのモデルは、検出mAPメトリックの下で以前のすべての方法よりも大幅に優れており、セグメンテーションmAPメトリックの下でも有望な結果を達成しています。コードは https://github.com/MCG-NJU/PointTAD で入手できます。

Traditional temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label (e.g., ActivityNet, THUMOS). However, this setting might be unrealistic as different classes of actions often co-occur in practice. In this paper, we focus on the task of multi-label temporal action detection that aims to localize all action instances from a multi-label untrimmed video. Multi-label TAD is more challenging as it requires for fine-grained class discrimination within a single video and precise localization of the co-occurring instances. To mitigate this issue, we extend the sparse query-based detection paradigm from the traditional TAD and propose the multi-label TAD framework of PointTAD. Specifically, our PointTAD introduces a small set of learnable query points to represent the important frames of each action instance. This point-based representation provides a flexible mechanism to localize the discriminative frames at boundaries and as well the important frames inside the action. Moreover, we perform the action decoding process with the Multi-level Interactive Module to capture both point-level and instance-level action semantics. Finally, our PointTAD employs an end-to-end trainable framework simply based on RGB input for easy deployment. We evaluate our proposed method on two popular benchmarks and introduce the new metric of detection-mAP for multi-label TAD. Our model outperforms all previous methods by a large margin under the detection-mAP metric, and also achieves promising results under the segmentation-mAP metric. Code is available at https://github.com/MCG-NJU/PointTAD.

updated: Sat Oct 22 2022 04:38:38 GMT+0000 (UTC)

published: Thu Oct 20 2022 06:08:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト