A Novel Upsampling and Context Convolution for Image Semantic Segmentation

Khwaja Monib Sediqi; Hyo Jong Lee

画像セマンティックセグメンテーションのための新しいアップサンプリングとコンテキスト畳み込み

画像のピクセル単位の分類を指すセマンティックセグメンテーションは、ロボットビジョンや自動運転業界での重要性が増しているため、コンピュータビジョンの基本的なトピックです。オブジェクトの境界、カテゴリ、場所など、シーン内のオブジェクトに関する豊富な情報を提供します。セマンティックセグメンテーションの最近の方法では、多くの場合、深い畳み込みニューラルネットワークを使用したエンコーダ-デコーダ構造が採用されています。エンコーダー部分は、いくつかのフィルターとプーリング操作を使用して画像の特徴を抽出しますが、デコーダー部分は、ピクセル単位の予測のために、エンコーダーの低解像度特徴マップを完全な入力解像度特徴マップに徐々に復元します。ただし、セマンティックセグメンテーションのエンコーダ-デコーダバリアントは、プーリング操作またはストライドを伴う畳み込みによって引き起こされる深刻な空間情報の損失に悩まされ、シーン内のコンテキストを考慮しません。本論文では、ネットワーク内の画像の空間情報を効果的に保存するために、ガイド付きフィルタリングに基づく高密度アップサンプリング畳み込み法を提案します。さらに、シーン内の大規模なオブジェクトをカバーするだけでなく、オブジェクトの境界を正確に描写するためにそれらを密にカバーする、新しいローカルコンテキスト畳み込み法を提案します。いくつかのベンチマークデータセットの理論的分析と実験結果は、私たちの方法の有効性を検証します。定性的に、私たちのアプローチは、現在の優れた方法を超えた精度のレベルでオブジェクトの境界を描きます。定量的には、ADE20KおよびPascal-Contextベンチマークデータセットで、それぞれ82.86％および81.62％のピクセル精度の新しい記録を報告します。最先端の方法と比較して、提案された方法は有望な改善を提供します。

Semantic segmentation, which refers to pixel-wise classification of an image, is a fundamental topic in computer vision owing to its growing importance in robot vision and autonomous driving industries. It provides rich information about objects in the scene such as object boundary, category, and location. Recent methods for semantic segmentation often employ an encoder-decoder structure using deep convolutional neural networks. The encoder part extracts feature of the image using several filters and pooling operations, whereas the decoder part gradually recovers the low-resolution feature maps of the encoder into a full input resolution feature map for pixel-wise prediction. However, the encoder-decoder variants for semantic segmentation suffer from severe spatial information loss, caused by pooling operations or convolutions with stride, and does not consider the context in the scene. In this paper, we propose a dense upsampling convolution method based on guided filtering to effectively preserve the spatial information of the image in the network. We further propose a novel local context convolution method that not only covers larger-scale objects in the scene but covers them densely for precise object boundary delineation. Theoretical analyses and experimental results on several benchmark datasets verify the effectiveness of our method. Qualitatively, our approach delineates object boundaries at a level of accuracy that is beyond the current excellent methods. Quantitatively, we report a new record of 82.86% and 81.62% of pixel accuracy on ADE20K and Pascal-Context benchmark datasets, respectively. In comparison with the state-of-the-art methods, the proposed method offers promising improvements.

updated: Sat Mar 20 2021 06:16:42 GMT+0000 (UTC)

published: Sat Mar 20 2021 06:16:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト