Generalized Few-Shot Video Classification with Video Retrieval and Feature Generation

Yongqin Xian; Bruno Korbar; Matthijs Douze; Lorenzo Torresani; Bernt Schiele; Zeynep Akata

ビデオ検索と特徴生成を伴う一般化された数ショットのビデオ分類

少数のショットの学習は、いくつかの例から新しいクラスを認識することを目的としています。画像領域では大きな進歩が見られましたが、数ショットのビデオ分類は比較的未踏です。以前の方法はビデオ特徴学習の重要性を過小評価していると主張し、3DCNNを使用して時空間特徴を学習することを提案します。基本クラスでビデオ機能を学習した後、新しいクラスで分類子を微調整する2段階のアプローチを提案し、この単純なベースラインアプローチが、既存のベンチマークで以前の数ショットのビデオ分類方法よりも20ポイント以上優れていることを示します。ラベル付けされた例の必要性を回避するために、さらなる改善をもたらす2つの新しいアプローチを提示します。まず、タグ検索を使用して大規模なデータセットからタグラベル付きの動画を活用し、次に視覚的に類似した最適なクリップを選択します。次に、セマンティック埋め込みから新しいクラスのビデオ機能を生成する生成的敵対的ネットワークを学習します。さらに、既存のベンチマークは、各テストエピソードで5つの新しいクラスにのみ焦点を当て、より多くの新しいクラス、つまり数ショットの学習、および新しいクラスと基本クラスの混合、つまり一般化されたものを含めることによって、より現実的なベンチマークを導入するため、制限されていることがわかります数ショットの学習。実験結果は、検索と特徴生成のアプローチが、新しいベンチマークのベースラインアプローチを大幅に上回っていることを示しています。

Few-shot learning aims to recognize novel classes from a few examples. Although significant progress has been made in the image domain, few-shot video classification is relatively unexplored. We argue that previous methods underestimate the importance of video feature learning and propose to learn spatiotemporal features using a 3D CNN. Proposing a two-stage approach that learns video features on base classes followed by fine-tuning the classifiers on novel classes, we show that this simple baseline approach outperforms prior few-shot video classification methods by over 20 points on existing benchmarks. To circumvent the need of labeled examples, we present two novel approaches that yield further improvement. First, we leverage tag-labeled videos from a large dataset using tag retrieval followed by selecting the best clips with visual similarities. Second, we learn generative adversarial networks that generate video features of novel classes from their semantic embeddings. Moreover, we find existing benchmarks are limited because they only focus on 5 novel classes in each testing episode and introduce more realistic benchmarks by involving more novel classes, i.e. few-shot learning, as well as a mixture of novel and base classes, i.e. generalized few-shot learning. The experimental results show that our retrieval and feature generation approach significantly outperform the baseline approach on the new benchmarks.

updated: Wed Oct 13 2021 13:31:06 GMT+0000 (UTC)

published: Thu Jul 09 2020 13:05:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト