Actor-identified Spatiotemporal Action Detection -- Detecting Who Is Doing What in Videos

Fan Yang; Norimichi Ukita; Sakriani Sakti; Satoshi Nakamura

俳優を特定した時空間動作検出 -- 動画で誰が何をしているかを検出する

ビデオアクション認識 (AR) でのディープラーニングの成功により、研究者は、関連するタスクを粗いレベルから細かいレベルまで段階的に促進するようになりました。ビデオ全体のアクションラベルのみを予測する従来の AR と比較して、Temporal Action Detection (TAD) は、ビデオ内の各アクションの開始時間と終了時間を推定するために調査されています。 TAD をさらに一歩進めて、時空間アクション検出 (SAD) は、ビデオ内のアクションを空間的および時間的にローカライズするために研究されています。ただし、誰がアクションを実行するかは一般に SAD では無視されますが、アクターを特定することも重要になる可能性があります。この目的のために、SADと俳優の識別の間のギャップを埋めるために、新しいタスクである俳優識別の時空間アクション検出（ASAD）を提案します。 ASAD では、インスタンスレベルのアクションの時空間境界を検出するだけでなく、各アクターに一意の ID を割り当てます。 ASAD にアプローチするには、Multiple Object Tracking (MOT) と Action Classification (AC) が 2 つの基本要素です。 MOT を使用することにより、各アクターの時空間境界が取得され、一意のアクター ID に割り当てられます。 AC を使用することにより、アクションクラスは対応する時空間境界内で推定されます。 ASAD は新しいタスクであるため、既存の方法では対処できない多くの新しい課題が生じます。i) ASAD 用に特別に作成されたデータセットがない、ii) ASAD 用に設計された評価指標がない、iii) 現在の MOT パフォーマンスが取得のボトルネックである満足のいく ASAD 結果。これらの問題に対処するために、私たちは、i) 新しい ASAD データセットに注釈を付ける、ii) マルチラベルアクションとアクターの識別を考慮して ASAD 評価指標を提案する、iii) MOT のデータ関連付け戦略を改善して MOT パフォーマンスを向上させる、に貢献します。より良い ASAD 結果。コードは https://github.com/fandulu/ASAD で入手できます。

The success of deep learning on video Action Recognition (AR) has motivated researchers to progressively promote related tasks from the coarse level to the fine-grained level. Compared with conventional AR that only predicts an action label for the entire video, Temporal Action Detection (TAD) has been investigated for estimating the start and end time for each action in videos. Taking TAD a step further, Spatiotemporal Action Detection (SAD) has been studied for localizing the action both spatially and temporally in videos. However, who performs the action, is generally ignored in SAD, while identifying the actor could also be important. To this end, we propose a novel task, Actor-identified Spatiotemporal Action Detection (ASAD), to bridge the gap between SAD and actor identification. In ASAD, we not only detect the spatiotemporal boundary for instance-level action but also assign the unique ID to each actor. To approach ASAD, Multiple Object Tracking (MOT) and Action Classification (AC) are two fundamental elements. By using MOT, the spatiotemporal boundary of each actor is obtained and assigned to a unique actor identity. By using AC, the action class is estimated within the corresponding spatiotemporal boundary. Since ASAD is a new task, it poses many new challenges that cannot be addressed by existing methods: i) no dataset is specifically created for ASAD, ii) no evaluation metrics are designed for ASAD, iii) current MOT performance is the bottleneck to obtain satisfactory ASAD results. To address those problems, we contribute to i) annotate a new ASAD dataset, ii) propose ASAD evaluation metrics by considering multi-label actions and actor identification, iii) improve the data association strategies in MOT to boost the MOT performance, which leads to better ASAD results. The code is available at https://github.com/fandulu/ASAD.

updated: Sat Aug 27 2022 06:51:12 GMT+0000 (UTC)

published: Sat Aug 27 2022 06:51:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト