Actor and Action Modular Network for Text-based Video Segmentation

Jianhua Yang; Yan Huang; Kai Niu; Zhanyu Ma; Liang Wang

テキストベースのビデオセグメンテーションのためのアクターとアクションのモジュラーネットワーク

アクターとアクションのセマンティックセグメンテーションは、アクターとアクションの共同理解を必要とする困難な問題であり、事前定義されたアクターとアクションのラベルペアからセグメント化することを学習します。ただし、このタスクの既存の方法では、同じスーパーカテゴリを持つアクターを区別できず、固定されたアクターとアクションの語彙の外側にあるアクターとアクションのペアを識別できません。最近の研究では、単語レベルのアクターとアクションのペアの代わりにテキストクエリを使用してこのタスクを拡張し、アクターとアクションを柔軟に指定できるようにしています。この論文では、テキストベースのアクターとアクションのセグメンテーション問題に焦点を当てます。これは、ビデオでアクターとアクションのきめ細かい理解を実行します。以前の作品は、与えられたビデオとテキストクエリのマージされた異種機能からセグメンテーションマスクを予測しましたが、テキストクエリの言語的変化とビデオの視覚的意味の不一致を無視し、ビデオの畳み込みボリューム間の非対称マッチングにつながりました。グローバルクエリ表現。前述の問題を軽減するために、アクターとアクションを2つの別々のモジュールに個別にローカライズする新しいアクターとアクションのモジュラーネットワークを提案します。まず、ビデオおよびテキストクエリの俳優/アクション関連のコンテンツを学習し、次にそれらを対称的に照合してターゲット領域をローカライズします。ターゲット領域には、セグメンテーションマスクを予測するために完全畳み込みネットワークに供給される目的のアクターとアクションが含まれます。モデル全体で、アクターとアクションのマッチングとセグメンテーションの共同学習が可能になり、A2DセンテンスとJ-HMDBセンテンスのデータセットで最先端のパフォーマンスが実現します。

The actor and action semantic segmentation is a challenging problem that requires joint actor and action understanding, and learns to segment from pre-defined actor and action label pairs. However, existing methods for this task fail to distinguish those actors that have same super-category and identify the actor-action pairs that outside of the fixed actor and action vocabulary. Recent studies have extended this task using textual queries, instead of word-level actor-action pairs, to make the actor and action can be flexibly specified. In this paper, we focus on the text-based actor and action segmentation problem, which performs fine-grained actor and action understanding in the video. Previous works predicted segmentation masks from the merged heterogenous features of a given video and textual query, while they ignored that the linguistic variation of the textual query and visual semantic discrepancy of the video, and led to the asymmetric matching between convolved volumes of the video and the global query representation. To alleviate aforementioned problem, we propose a novel actor and action modular network that individually localizes the actor and action in two separate modules. We first learn the actor-/action-related content for the video and textual query, and then match them in a symmetrical manner to localize the target region. The target region includes the desired actor and action which is then fed into a fully convolutional network to predict the segmentation mask. The whole model enables joint learning for the actor-action matching and segmentation, and achieves the state-of-the-art performance on A2D Sentences and J-HMDB Sentences datasets.

updated: Mon Nov 02 2020 07:32:39 GMT+0000 (UTC)

published: Mon Nov 02 2020 07:32:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト