Learning Multimodal Representations for Unseen Activities

AJ Piergiovanni; Michael S. Ryoo

目に見えない活動のためのマルチモーダル表現の学習

ビデオで目に見えない活動の認識を可能にする共同マルチモーダル表現空間を学習する方法を提示します。最初に、テキストとビデオのペアのデータを使用して、埋め込みスペースにさまざまな制約を配置する効果を比較します。また、対を成す定式化を使用して結合埋め込みスペースを改善し、ペアになっていないテキストおよびビデオデータからメリットを得られるようにする方法も提案します。ペアになっていないテキストデータを使用することにより、目に見えないアクティビティをより的確に捉えた表現を学習する能力を示します。公開されているデータセットでのテストに加えて、新しい大規模なテキスト/ビデオデータセットを紹介します。ペアになったデータとペアになっていないデータを使用して共有埋め込みスペースを学習すると、3つの困難なタスク（i）ゼロショットアクティビティ分類、（ii）教師なしアクティビティの発見、（iii）目に見えないアクティビティキャプション、最新のパフォーマンスよりも優れていることが実験的に確認されました-芸術。

We present a method to learn a joint multimodal representation space that enables recognition of unseen activities in videos. We first compare the effect of placing various constraints on the embedding space using paired text and video data. We also propose a method to improve the joint embedding space using an adversarial formulation, allowing it to benefit from unpaired text and video data. By using unpaired text data, we show the ability to learn a representation that better captures unseen activities. In addition to testing on publicly available datasets, we introduce a new, large-scale text/video dataset. We experimentally confirm that using paired and unpaired data to learn a shared embedding space benefits three difficult tasks (i) zero-shot activity classification, (ii) unsupervised activity discovery, and (iii) unseen activity captioning, outperforming the state-of-the-arts.

updated: Tue Jul 07 2020 17:36:54 GMT+0000 (UTC)

published: Thu Jun 21 2018 13:58:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト