Dynamic Neural Representational Decoders for High-Resolution Semantic Segmentation

Bowen Zhang; Yifan Liu; Zhi Tian; Chunhua Shen

高解像度セマンティックセグメンテーションのための動的ニューラル表現デコーダ

セマンティックセグメンテーションでは、特定の画像のピクセルごとの予測が必要です。通常、セグメンテーションネットワークの出力解像度は、CNNバックボーンでのダウンサンプリング操作のために大幅に低下します。以前のほとんどの方法は、空間分解能を回復するためにアップサンプリングデコーダーを採用しています。さまざまなデコーダーが文献で設計されました。ここでは、動的ニューラル表現デコーダー（NRD）と呼ばれる、シンプルでありながら大幅に効率的な新しいデコーダーを提案します。エンコーダーの出力の各位置はセマンティックラベルのローカルパッチに対応しているため、この作業では、これらのラベルのローカルパッチをコンパクトなニューラルネットワークで表します。この神経表現により、デコーダーはセマンティックラベル空間の前の滑らかさを活用できるため、デコーダーがより効率的になります。さらに、これらの神経表現は動的に生成され、エンコーダネットワークの出力を条件とします。必要なセマンティックラベルを神経表現から効率的にデコードできるため、高解像度のセマンティックセグメンテーション予測が可能になります。提案されたデコーダーは、わずか30％の計算の複雑さで、DeeplabV3 +のデコーダーよりも優れており、わずか15％の計算で拡張エンコーダーを使用する方法で競争力のあるパフォーマンスを達成できることを経験的に示しています。 Cityscapes、ADE20K、およびPASCAL Contextデータセットでの実験は、提案された方法の有効性と効率を示しています。

Semantic segmentation requires per-pixel prediction for a given image. Typically, the output resolution of a segmentation network is severely reduced due to the downsampling operations in the CNN backbone. Most previous methods employ upsampling decoders to recover the spatial resolution. Various decoders were designed in the literature. Here, we propose a novel decoder, termed dynamic neural representational decoder (NRD), which is simple yet significantly more efficient. As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks. This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient. Furthermore, these neural representations are dynamically generated and conditioned on the outputs of the encoder networks. The desired semantic labels can be efficiently decoded from the neural representations, resulting in high-resolution semantic segmentation predictions. We empirically show that our proposed decoder can outperform the decoder in DeeplabV3+ with only 30% computational complexity, and achieve competitive performance with the methods using dilated encoders with only 15% computation. Experiments on the Cityscapes, ADE20K, and PASCAL Context datasets demonstrate the effectiveness and efficiency of our proposed method.

updated: Fri Jul 30 2021 04:50:56 GMT+0000 (UTC)

published: Fri Jul 30 2021 04:50:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト