Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Vivek Rathod; Bryan Seybold; Sudheendra Vijayanarasimhan; Austin Myers; Xiuye Gu; Vighnesh Birodkar; David A. Ross

既製の画像テキスト機能を使用したオープン語彙の時間的動作検出

トリミングされていないビデオでのアクションの検出は、小規模なクローズドクラスのセットに限定されるべきではありません。事前トレーニング済みの画像とテキストの共同埋め込みを利用した、オープン語彙の一時的なアクション検出のためのシンプルでありながら効果的な戦略を提示します。ビデオではなく静的画像でトレーニングされているにもかかわらず、画像とテキストの同時埋め込みにより、完全に監視されたモデルと競合するオープン語彙のパフォーマンスが可能になることを示しています。オプティカルフローベースの機能やオーディオなどの他のモダリティなど、ローカルモーションをエンコードする機能を使用してイメージテキスト機能をアンサンブルすることで、パフォーマンスをさらに改善できることを示します。さらに、ActivityNet データセットのより合理的なオープン語彙評価設定を提案します。ここでは、カテゴリ分割は、ランダムな割り当てではなく類似性に基づいています。

Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.

updated: Tue Jan 10 2023 19:44:37 GMT+0000 (UTC)

published: Tue Dec 20 2022 19:12:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト