Recurrent Instance Segmentation using Sequences of Referring Expressions

Alba Herrera-Palacio; Carles Ventura; Carina Silberer; Ionut-Teodor Sorodoc; Gemma Boleda; Xavier Giro-i-Nieto

参照式のシーケンスを使用した繰り返しインスタンスのセグメンテーション

この作業の目標は、一連の言語記述（表現を参照）によって参照される画像内のオブジェクトをセグメント化することです。ユーザーが提供する参照表現ごとに1つのバイナリマスクのシーケンスを出力する、リカレントレイヤーを備えたディープニューラルネットワークを提案します。アーキテクチャ内の反復レイヤーにより、モデルは、同じイメージ内の空間的観点から、以前のマスク上の各予測マスクを調整できます。私たちのマルチモーダルアプローチは、既成のアーキテクチャを使用して、画像と参照表現の両方をエンコードします。視覚的なブランチは、言語エンコーダーによって生成されたフレーズの埋め込みと連結されたピクセルの埋め込みのテンソルを提供します。静止画像用のRefCOCOデータセットに関する実験は、提案されたアーキテクチャが参照表現のシーケンスをうまく活用して、インスタンスセグメンテーションのピクセル単位のタスクを解決する方法を示しています。

The goal of this work is to segment the objects in an image that are referred to by a sequence of linguistic descriptions (referring expressions). We propose a deep neural network with recurrent layers that output a sequence of binary masks, one for each referring expression provided by the user. The recurrent layers in the architecture allow the model to condition each predicted mask on the previous ones, from a spatial perspective within the same image. Our multimodal approach uses off-the-shelf architectures to encode both the image and the referring expressions. The visual branch provides a tensor of pixel embeddings that are concatenated with the phrase embeddings produced by a language encoder. Our experiments on the RefCOCO dataset for still images indicate how the proposed architecture successfully exploits the sequences of referring expressions to solve a pixel-wise task of instance segmentation.

updated: Tue Nov 05 2019 21:49:55 GMT+0000 (UTC)

published: Tue Nov 05 2019 21:49:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト