SimpleClick: Interactive Image Segmentation with Simple Vision Transformers

Qin Liu; Zhenlin Xu; Gedas Bertasius; Marc Niethammer

SimpleClick: シンプルなビジョントランスフォーマーによるインタラクティブな画像セグメンテーション

クリックベースのインタラクティブな画像セグメンテーションは、限られたユーザーのクリックでオブジェクトを抽出することを目的としています。階層バックボーンは、現在のメソッドの事実上のアーキテクチャです。最近、プレーンで非階層的なビジョントランスフォーマー (ViT) が、高密度の予測タスクの競争力のあるバックボーンとして登場しました。この設計により、元の ViT は、事前トレーニング用の階層バックボーンを再設計することなく、ダウンストリームタスク用に微調整できる基盤モデルになることができます。この設計はシンプルで効果的であることが証明されていますが、インタラクティブな画像セグメンテーションについてはまだ検討されていません。このギャップを埋めるために、単純なバックボーンを活用する最初のインタラクティブなセグメンテーション方法である SimpleClick を提案します。プレーンなバックボーンに基づいて、バックボーン自体にわずかな変更を加えてクリックをバックボーンにエンコードする対称パッチ埋め込みレイヤーを導入します。マスクオートエンコーダー (MAE) として事前トレーニングされたプレーンバックボーンを使用して、SimpleClick は最先端のパフォーマンスを実現します。驚くべきことに、私たちの方法は SBD で 4.15 NoC@90 を達成し、以前の最高の結果よりも 21.8% 向上しています。医用画像の広範な評価は、私たちの方法の一般化可能性を示しています。さらに、SimpleClick 用の非常に小さな ViT バックボーンを開発し、詳細な計算分析を提供して、実用的な注釈ツールとしての適合性を強調します。

Click-based interactive image segmentation aims at extracting objects with a limited user clicking. A hierarchical backbone is the de-facto architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for downstream tasks without redesigning a hierarchical backbone for pretraining. Although this design is simple and has been proven effective, it has not yet been explored for interactive image segmentation. To fill this gap, we propose SimpleClick, the first interactive segmentation method that leverages a plain backbone. Based on the plain backbone, we introduce a symmetric patch embedding layer that encodes clicks into the backbone with minor modifications to the backbone itself. With the plain backbone pretrained as a masked autoencoder (MAE), SimpleClick achieves state-of-the-art performance. Remarkably, our method achieves 4.15 NoC@90 on SBD, improving 21.8% over the previous best result. Extensive evaluation on medical images demonstrates the generalizability of our method. We further develop an extremely tiny ViT backbone for SimpleClick and provide a detailed computational analysis, highlighting its suitability as a practical annotation tool.

updated: Sat Mar 11 2023 19:36:34 GMT+0000 (UTC)

published: Thu Oct 20 2022 04:20:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト