Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus

Xiang Li; Jinglu Wang; Xiaohao Xu; Xiao Li; Bhiksha Raj; Yan Lu

循環リレーショナルコンセンサスによる堅牢な参照ビデオオブジェクトセグメンテーションに向けて

Referring Video Object Segmentation (R-VOS) は、言語表現に基づいてビデオ内のオブジェクトをセグメント化することを目的とした難しいタスクです。既存の R-VOS メソッドのほとんどには、参照されるオブジェクトがビデオに登場する必要があるという重要な前提があります。意味的コンセンサスと呼ばれるこの仮定は、虚偽のビデオに対して表現がクエリされる可能性がある現実のシナリオでは違反されることがよくあります。この研究では、セマンティックの不一致を処理できる堅牢な R-VOS モデルの必要性を強調します。したがって、我々は、ペアになっていないビデオとテキストの入力を受け入れる、Robust R-VOS と呼ばれる拡張タスクを提案します。我々は、主要な R-VOS 問題とその二重の問題 (テキスト再構成) を共同でモデル化することで、この問題に取り組みます。構造的なテキスト対テキストのサイクル制約は、ビデオとテキストのペア間の意味論的な合意を識別し、それをポジティブなペアに課すために導入され、それによってポジティブとネガティブの両方のペアからマルチモーダルな位置合わせを実現します。私たちの構造的制約は、言語の多様性によってもたらされる課題に効果的に対処し、点ごとの制約に依存していた以前の方法の制限を克服します。モデルの堅牢性を測定するために、新しい評価データセット R2-Youtube-VOS が構築されています。私たちのモデルは、R-VOS ベンチマーク、Ref-DAVIS17 および Ref-Youtube-VOS、さらに R2-Youtube-VOS~dataset で最先端のパフォーマンスを実現します。

Referring Video Object Segmentation (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression. Most existing R-VOS methods have a critical assumption: the object referred to must appear in the video. This assumption, which we refer to as semantic consensus, is often violated in real-world scenarios, where the expression may be queried against false videos. In this work, we highlight the need for a robust R-VOS model that can handle semantic mismatches. Accordingly, we propose an extended task called Robust R-VOS, which accepts unpaired video-text inputs. We tackle this problem by jointly modeling the primary R-VOS problem and its dual (text reconstruction). A structural text-to-text cycle constraint is introduced to discriminate semantic consensus between video-text pairs and impose it in positive pairs, thereby achieving multi-modal alignment from both positive and negative pairs. Our structural constraint effectively addresses the challenge posed by linguistic diversity, overcoming the limitations of previous methods that relied on the point-wise constraint. A new evaluation dataset, R2-Youtube-VOSis constructed to measure the model robustness. Our model achieves state-of-the-art performance on R-VOS benchmarks, Ref-DAVIS17 and Ref-Youtube-VOS, and also our R2-Youtube-VOS~dataset.

updated: Fri Aug 18 2023 18:48:33 GMT+0000 (UTC)

published: Mon Jul 04 2022 05:08:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト