Actor and Action Modular Network for Text-based Video Segmentation

Jianhua Yang; Yan Huang; Kai Niu; Linjiang Huang; Zhanyu Ma; Liang Wang

テキストベースのビデオセグメンテーションのための俳優およびアクションモジュラーネットワーク

テキストベースのビデオセグメンテーションは、俳優とその実行アクションをテキストクエリで指定することにより、ビデオシーケンス内の俳優をセグメント化することを目的としています。以前の方法では、セマンティックの非対称性の問題により、アクターとそのアクションに応じて、ビデオコンテンツをテキストクエリにきめ細かく整列させることができませんでした。セマンティックの非対称性は、マルチモーダルフュージョンプロセス中に 2 つのモダリティが異なる量のセマンティック情報を含むことを意味します。この問題を軽減するために、アクターとそのアクションを 2 つの別個のモジュールに個別にローカライズする、新しいアクターとアクションのモジュラーネットワークを提案します。具体的には、最初にビデオとテキストクエリから俳優/アクション関連のコンテンツを学習し、次にそれらを対称的に照合してターゲットチューブをローカライズします。ターゲットチューブには目的のアクターとアクションが含まれており、アクターのセグメンテーションマスクを予測するために完全な畳み込みネットワークに供給されます。私たちの方法はまた、提案された一時的な提案集約メカニズムを使用して、複数のフレームにまたがるオブジェクトの関連付けを確立します。これにより、ビデオを効果的にセグメント化し、予測の時間的一貫性を維持することができます。モデル全体は、アクターアクションマッチングとセグメンテーションの共同学習が可能であり、A2D センテンスと J-HMDB センテンスデータセットでのシングルフレームセグメンテーションとフルビデオセグメンテーションの両方で最先端のパフォーマンスを実現します。

Text-based video segmentation aims to segment an actor in video sequences by specifying the actor and its performing action with a textual query. Previous methods fail to explicitly align the video content with the textual query in a fine-grained manner according to the actor and its action, due to the problem of semantic asymmetry. The semantic asymmetry implies that two modalities contain different amounts of semantic information during the multi-modal fusion process. To alleviate this problem, we propose a novel actor and action modular network that individually localizes the actor and its action in two separate modules. Specifically, we first learn the actor-/action-related content from the video and textual query, and then match them in a symmetrical manner to localize the target tube. The target tube contains the desired actor and action which is then fed into a fully convolutional network to predict segmentation masks of the actor. Our method also establishes the association of objects cross multiple frames with the proposed temporal proposal aggregation mechanism. This enables our method to segment the video effectively and keep the temporal consistency of predictions. The whole model is allowed for joint learning of the actor-action matching and segmentation, as well as achieves the state-of-the-art performance for both single-frame segmentation and full video segmentation on A2D Sentences and J-HMDB Sentences datasets.

updated: Mon Aug 22 2022 01:49:29 GMT+0000 (UTC)

published: Mon Nov 02 2020 07:32:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト