Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection

Rui Dai; Srijan Das; Francois Bremond

アクション検出のためのクロスモーダル知識蒸留による拡張RGB表現の学習

ビデオの理解では、ほとんどのクロスモーダル知識蒸留（KD）メソッドは、トリミングされたビデオの識別表現に焦点を当てて、分類タスク用に調整されています。ただし、アクションの検出には、アクションを分類するだけでなく、トリミングされていないビデオにローカライズする必要があります。したがって、時間的関係に関連する知識を転送することは、以前のクロスモーダルKDフレームワークに欠けているこのタスクにとって重要です。この目的のために、KDを介したトレーニング時に追加のモダリティを利用して、アクション検出用の拡張RGB表現を学習することを目指しています。 2つのレベルの蒸留からなるKDフレームワークを提案します。一方では、原子レベルの蒸留は、RGBの生徒が対照的な方法で教師からのアクションのサブ表現を学ぶことを奨励します。一方、シーケンスレベルの蒸留は、生徒が教師から時間的知識を学ぶことを奨励します。これは、グローバルな文脈関係と行動境界の顕著性を伝達することで構成されます。その結果、推論時にRGBのみを使用しながら、2ストリームネットワークとして競争力のあるパフォーマンスを実現できる拡張RGBストリームが得られます。広範な実験分析は、提案された蒸留フレームワークが一般的であり、アクション検出タスクで他の一般的なクロスモーダル蒸留方法よりも優れていることを示しています。

In video understanding, most cross-modal knowledge distillation (KD) methods are tailored for classification tasks, focusing on the discriminative representation of the trimmed videos. However, action detection requires not only categorizing actions, but also localizing them in untrimmed videos. Therefore, transferring knowledge pertaining to temporal relations is critical for this task which is missing in the previous cross-modal KD frameworks. To this end, we aim at learning an augmented RGB representation for action detection, taking advantage of additional modalities at training time through KD. We propose a KD framework consisting of two levels of distillation. On one hand, atomic-level distillation encourages the RGB student to learn the sub-representation of the actions from the teacher in a contrastive manner. On the other hand, sequence-level distillation encourages the student to learn the temporal knowledge from the teacher, which consists of transferring the Global Contextual Relations and the Action Boundary Saliency. The result is an Augmented-RGB stream that can achieve competitive performance as the two-stream network while using only RGB at inference time. Extensive experimental analysis shows that our proposed distillation framework is generic and outperforms other popular cross-modal distillation methods in action detection task.

updated: Sun Aug 08 2021 12:04:14 GMT+0000 (UTC)

published: Sun Aug 08 2021 12:04:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト