Zero-Shot Action Recognition from Diverse Object-Scene Compositions

Carlo Bretti; Pascal Mettes

多様な物体シーン構成からのゼロショットアクション認識

この論文では、アクションが見られるトレーニングビデオが利用できない状況でのゼロショットアクション認識の問題を調査します。この困難なシナリオの場合、現在の主要なアプローチは、事前にトレーニングされたネットワークを使用してビデオ内のオブジェクトを認識し、続いてオブジェクトとアクション間のセマンティックマッチングを行うことにより、画像ドメインから知識を転送することです。オブジェクトがビデオのコンテンツのローカルビューを提供する場合、この作業では、アクションが発生するシーンのグローバルビューも含めるようにします。シーン自体も、オブジェクトよりもわずかではありますが、目に見えないアクションを認識でき、オブジェクトベースのスコアとシーンベースのスコアを直接組み合わせると、アクション認識のパフォーマンスが低下することがわかります。オブジェクトとシーンを最大限に活用するために、可能なすべての構成のデカルト積としてそれらを構築することを提案します。ビデオ内のオブジェクトシーン構成の可能性を判断する方法、およびオブジェクトシーン構成からアクションへのセマンティックマッチングを概説し、各アクションに最も関連する構成間の多様性を強制します。シンプルではありますが、コンポジションベースのアプローチは、オブジェクトベースのアプローチや、トレーニングや知識の伝達のために何百ものアクションが見られる大規模なビデオデータセットに依存する最先端のゼロショットアプローチよりも優れています。

This paper investigates the problem of zero-shot action recognition, in the setting where no training videos with seen actions are available. For this challenging scenario, the current leading approach is to transfer knowledge from the image domain by recognizing objects in videos using pre-trained networks, followed by a semantic matching between objects and actions. Where objects provide a local view on the content in videos, in this work we also seek to include a global view of the scene in which actions occur. We find that scenes on their own are also capable of recognizing unseen actions, albeit more marginally than objects, and a direct combination of object-based and scene-based scores degrades the action recognition performance. To get the best out of objects and scenes, we propose to construct them as a Cartesian product of all possible compositions. We outline how to determine the likelihood of object-scene compositions in videos, as well as a semantic matching from object-scene compositions to actions that enforces diversity among the most relevant compositions for each action. While simple, our composition-based approach outperforms object-based approaches and even state-of-the-art zero-shot approaches that rely on large-scale video datasets with hundreds of seen actions for training and knowledge transfer.

updated: Tue Oct 26 2021 08:23:14 GMT+0000 (UTC)

published: Tue Oct 26 2021 08:23:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト