Multi-view knowledge distillation transformer for human action recognition

Ying-Chen Lin; Vincent S. Tseng

人間の行動認識のための多視点知識蒸留変換器

最近、人間の行動認識のパフォーマンスを向上させるために、Transformer ベースの方法が利用されています。ただし、これらの研究のほとんどは、マルチビューデータが完全であることを前提としており、実際のシナリオでは必ずしもそうとは限りません。したがって、この論文では、教師ネットワークと生徒ネットワークで構成される新しいマルチビュー知識蒸留変換器 (MKDT) フレームワークを提示します。このフレームワークは、現実世界のアプリケーションで不完全な人間の行動の問題を処理することを目的としています。具体的には、マルチビュー知識蒸留トランスフォーマーは、ウィンドウをシフトした階層型ビジョントランスフォーマーを使用して、より多くの時空間情報を取得します。実験結果は、私たちのフレームワークが 3 つの公開データセットで CNN ベースの方法よりも優れていることを示しています。

Recently, Transformer-based methods have been utilized to improve the performance of human action recognition. However, most of these studies assume that multi-view data is complete, which may not always be the case in real-world scenarios. Therefore, this paper presents a novel Multi-view Knowledge Distillation Transformer (MKDT) framework that consists of a teacher network and a student network. This framework aims to handle incomplete human action problems in real-world applications. Specifically, the multi-view knowledge distillation transformer uses a hierarchical vision transformer with shifted windows to capture more spatial-temporal information. Experimental results demonstrate that our framework outperforms the CNN-based method on three public datasets.

updated: Sat Mar 25 2023 04:47:31 GMT+0000 (UTC)

published: Sat Mar 25 2023 04:47:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト