Scaling Semantic Segmentation Beyond 1K Classes on a Single GPU

Shipra Jain; Danda Paudel Pani; Martin Danelljan; Luc Van Gool

単一のGPUで1Kクラスを超えるセマンティックセグメンテーションのスケーリング

最先端のオブジェクト検出および画像分類方法は、それぞれ9kおよび10kを超えるクラスで優れたパフォーマンスを発揮します。対照的に、セマンティックセグメンテーションデータセットのクラスの数は比較的限られています。ラベル付けされたデータの欠如とセグメンテーションの高い計算需要によって引き起こされる制限を考慮すると、これは驚くべきことではありません。この論文では、メモリのオーバーヘッドを増やすことなく、多数のセマンティッククラスの既存のセマンティックセグメンテーションモデルをトレーニングおよびスケーリングするための新しいトレーニング方法を提案します。埋め込みベースのスケーラブルなセグメンテーションアプローチでは、セグメンテーションモデルの出力のスペースの複雑さをO（C）からO（1）に減らし、グラウンドトゥルースクラスの確率の近似方法を提案し、それを使用してクロスエントロピー損失を計算します。。提案されたアプローチは一般的であり、最先端のセグメンテーションモデルで採用して、1つのGPUだけで任意の数のセマンティッククラスに合わせて適切にスケーリングできます。私たちのアプローチは、異なるバックボーンを持つDeeplabV3 +モデルに採用された場合、Cityscapes、Pascal VOC、ADE20k、COCO-Stuff10kデータセットに対して同様の、場合によってはさらに優れたmIoUを実現します。 DeeplabV3 +モデルよりも3倍優れたmIoUを備え、LVISおよびCOCOアノテーションからブートストラップされた1284クラスのデータセットに対するアプローチの明確な利点を示します。

The state-of-the-art object detection and image classification methods can perform impressively on more than 9k and 10k classes, respectively. In contrast, the number of classes in semantic segmentation datasets is relatively limited. This is not surprising when the restrictions caused by the lack of labeled data and high computation demand for segmentation are considered. In this paper, we propose a novel training methodology to train and scale the existing semantic segmentation models for a large number of semantic classes without increasing the memory overhead. In our embedding-based scalable segmentation approach, we reduce the space complexity of the segmentation model's output from O(C) to O(1), propose an approximation method for ground-truth class probability, and use it to compute cross-entropy loss. The proposed approach is general and can be adopted by any state-of-the-art segmentation model to gracefully scale it for any number of semantic classes with only one GPU. Our approach achieves similar, and in some cases, even better mIoU for Cityscapes, Pascal VOC, ADE20k, COCO-Stuff10k datasets when adopted to DeeplabV3+ model with different backbones. We demonstrate a clear benefit of our approach on a dataset with 1284 classes, bootstrapped from LVIS and COCO annotations, with three times better mIoU than the DeeplabV3+ model.

updated: Mon Dec 14 2020 13:12:38 GMT+0000 (UTC)

published: Mon Dec 14 2020 13:12:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト