Joint Visual Grounding and Tracking with Natural Language Specification

Li Zhou; Zikun Zhou; Kaige Mao; Zhenyu He

自然言語仕様による共同視覚グラウンディングとトラッキング

自然言語仕様による追跡は、自然言語記述に基づいた順序で参照対象を特定することを目的としています。既存のアルゴリズムは、この問題を視覚的グラウンディングと追跡の 2 つのステップで解決し、それに応じて分離されたグラウンディングモデルと追跡モデルを展開して、これら 2 つのステップをそれぞれ実装します。このような分離されたフレームワークは、視覚的な根拠と追跡の間のリンクを見落としています。つまり、自然言語の説明が、2 つのステップの両方でターゲットをローカライズするためのグローバルなセマンティックキューを提供するということです。その上、分離されたフレームワークはエンドツーエンドでトレーニングすることはほとんどできません。これらの問題を処理するために、グラウンディングと追跡を統一されたタスクとして再定式化する共同の視覚的グラウンディングと追跡フレームワークを提案します。具体的には、視覚言語の参照とテスト画像の間の関係を効果的に構築するためのマルチソース関係モデリングモジュールを提案します。さらに、時間モデリングモジュールを設計して、モデルのグローバルセマンティック情報のガイダンスを使用して時間的な手がかりを提供します。これにより、ターゲットの外観の変化への適応性が効果的に向上します。 TNL2K、LaSOT、OTB99、およびRefCOCOgに関する広範な実験結果は、追跡とグラウンディングの両方について、最先端のアルゴリズムに対して私たちの方法が有利に機能することを示しています。コードは https://github.com/lizhou-cs/JointNLT で入手できます。

Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description. Existing algorithms solve this issue in two steps, visual grounding and tracking, and accordingly deploy the separated grounding model and tracking model to implement these two steps, respectively. Such a separated framework overlooks the link between visual grounding and tracking, which is that the natural language descriptions provide global semantic cues for localizing the target for both two steps. Besides, the separated framework can hardly be trained end-to-end. To handle these issues, we propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task: localizing the referred target based on the given visual-language references. Specifically, we propose a multi-source relation modeling module to effectively build the relation between the visual-language references and the test image. In addition, we design a temporal modeling module to provide a temporal clue with the guidance of the global semantic information for our model, which effectively improves the adaptability to the appearance variations of the target. Extensive experimental results on TNL2K, LaSOT, OTB99, and RefCOCOg demonstrate that our method performs favorably against state-of-the-art algorithms for both tracking and grounding. Code is available at https://github.com/lizhou-cs/JointNLT.

updated: Tue Mar 21 2023 17:09:03 GMT+0000 (UTC)

published: Tue Mar 21 2023 17:09:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト