What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

Tal Shaharabany; Yoad Tewel; Lior Wolf

見てどこにあるのか：弱く監視されたオープンワールドのフレーズ-テキスト入力なしのグラウンディング

入力画像だけが与えられた場合、このメソッドは、画像内のオブジェクトのバウンディングボックスと、オブジェクトを説明するフレーズを返します。これは、ローカリゼーションメカニズムのトレーニング中に入力画像内のオブジェクトに遭遇しなかった可能性があるオープンワールドパラダイム内で実現されます。さらに、トレーニングは、バウンディングボックスが提供されていない、監視の弱い環境で行われます。これを実現するために、私たちの方法では、事前にトレーニングされた2つのネットワークを組み合わせています。CLIP画像とテキストのマッチングスコアとBLIP画像キャプションツールです。トレーニングはCOCO画像とそのキャプションで行われ、CLIPに基づいています。次に、推論中に、BLIPを使用して、現在の画像のさまざまな領域に関する仮説を生成します。私たちの仕事は、弱く監視されたセグメンテーションとフレーズの根拠を一般化し、両方の領域で最先端の技術をしのぐことが経験的に示されています。それはまた、私たちの仕事で提示された、弱く監視されたオープンワールドの純粋に視覚的なフレーズグラウンディングという新しいタスクで非常に説得力のある結果を示しています。たとえば、フレーズの根拠をベンチマークするために使用されるデータセットでは、追加の入力として人間のキャプションを使用する方法と比較して、私たちの方法は非常に穏やかな劣化をもたらします。私たちのコードはhttps://github.com/talshaharabany/what-is-where-by-lookingで入手でき、ライブデモはhttps：// talshaharabany/what-is-where-by-lookingで見つけることができます。

Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects. This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism. Moreover, training takes place in a weakly supervised setting, where no bounding boxes are provided. To achieve this, our method combines two pre-trained networks: the CLIP image-to-text matching score and the BLIP image captioning tool. Training takes place on COCO images and their captions and is based on CLIP. Then, during inference, BLIP is used to generate a hypothesis regarding various regions of the current image. Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains. It also shows very convincing results in the novel task of weakly-supervised open-world purely visual phrase-grounding presented in our work. For example, on the datasets used for benchmarking phrase-grounding, our method results in a very modest degradation in comparison to methods that employ human captions as an additional input. Our code is available at https://github.com/talshaharabany/what-is-where-by-looking and a live demo can be found at https://talshaharabany/what-is-where-by-looking.

updated: Sun Jun 19 2022 09:07:30 GMT+0000 (UTC)

published: Sun Jun 19 2022 09:07:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト