ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

Chen Liang; Yu Wu; Yawei Luo; Yi Yang

ClawCraneNet：テキストベースのビデオセグメンテーションのためのオブジェクトレベルの関係の活用

テキストベースのビデオセグメンテーションは、ビデオ内の自然言語で参照されるオブジェクトをセグメント化する難しいタスクです。それは本質的に意味理解ときめ細かいビデオ理解を必要とします。既存の方法は、言語表現をボトムアップ方式でセグメンテーションモデルに導入します。これは、ConvNetのローカル受容野内で視覚と言語の相互作用を行うだけです。自然言語/参照式の記述論理に反して、モデルは部分的な観察を与えられた領域レベルの関係をほとんど構築できないため、そのような相互作用は満たされないと主張します。実際、人々は通常、他のオブジェクトとの関係を使用してターゲットオブジェクトを記述しますが、これはビデオ全体を見ないと簡単には理解できない場合があります。この問題に対処するために、言語ガイダンスを使用してオブジェクトを人間がセグメント化する方法を模倣することにより、新しいトップダウンアプローチを導入します。まず、ビデオ内のすべての候補オブジェクトを把握し、次にそれらの高レベルオブジェクト間の関係を解析して、参照されるオブジェクトを選択します。正確な関係を理解するために、位置関係、テキスト誘導意味関係、時間関係の3種類のオブジェクトレベルの関係が調査されます。 A2DセンテンスとJ-HMDBセンテンスに関する広範な実験は、私たちの方法が最先端の方法を大幅に上回っていることを示しています。定性的な結果は、私たちの結果がより説明しやすいことも示しています。

Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue that such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language guidance. We first figure out all candidate objects in videos and then choose the refereed one by parsing relations among those high-level objects. Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation. Extensive experiments on A2D Sentences and J-HMDB Sentences show our method outperforms state-of-the-art methods by a large margin. Qualitative results also show our results are more explainable.

updated: Fri Jan 19 2024 14:43:57 GMT+0000 (UTC)

published: Fri Mar 19 2021 09:31:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト