DWRSeg: Dilation-wise Residual Network for Real-time Semantic Segmentation

Haoran Wei; Xu Liu; Shouchun Xu; Zhongjian Dai; Yaping Dai; Xiangyang Xu

DWRSeg: リアルタイムセマンティックセグメンテーションのための Dilation-wise Residual Network

リアルタイムのセマンティックセグメンテーションは、インテリジェントな車両シナリオで重要な役割を果たしてきました。最近、多数のネットワークがマルチサイズの受容野からの情報を組み込んで、リアルタイムのセマンティックセグメンテーションタスクでの特徴抽出を容易にしています。ただし、これらの方法は、より多くのコンテキスト情報を引き出すために大規模な受容野を優先的に採用するため、非効率的な特徴抽出が発生する可能性があります。リアルタイムタスクでの効率的な特徴抽出の需要を考慮すると、精巧な受容野が重要であると考えています。したがって、さまざまな段階内でさまざまな受容野サイズのセットを所有する、Dilation-wise Residual Segmentation (DWRSeg) と呼ばれる効果的かつ効率的なアーキテクチャを提案します。このアーキテクチャには、(i) ネットワークの高レベルの受容野のさまざまなスケールに基づいて特徴を抽出するための Dilation-wise Residual (DWR) モジュールが含まれます。 (ii) 逆ボトルネック構造を使用して低段階から特徴を抽出する単純逆残差 (SIR) モジュール。（iii）マルチスケールの特徴マップを集約して予測を生成するための単純な完全畳み込みネットワーク（FCN）のようなデコーダー。 Cityscapes と CamVid データセットに関する広範な実験は、軽量化に加えて、精度と推論速度の間の最先端のトレードオフを達成することにより、私たちの方法の有効性を実証しています。事前トレーニングを使用したり、トレーニングトリックに頼ったりすることなく、Cityscapes テストセットで 1 枚の NVIDIA GeForce GTX 1080 Ti カードで 319.5 FPS の速度で 72.7% の mIoU を達成しました。これは既存の方法よりも大幅に高速です。コードとトレーニング済みモデルは公開されています。

Real-time semantic segmentation has played an important role in intelligent vehicle scenarios. Recently, numerous networks have incorporated information from multi-size receptive fields to facilitate feature extraction in real-time semantic segmentation tasks. However, these methods preferentially adopt massive receptive fields to elicit more contextual information, which may result in inefficient feature extraction. We believe that the elaborated receptive fields are crucial, considering the demand for efficient feature extraction in real-time tasks. Therefore, we propose an effective and efficient architecture termed Dilation-wise Residual segmentation (DWRSeg), which possesses different sets of receptive field sizes within different stages. The architecture involves (i) a Dilation-wise Residual (DWR) module for extracting features based on different scales of receptive fields in the high level of the network; (ii) a Simple Inverted Residual (SIR) module that uses an inverted bottleneck structure to extract features from the low stage; and (iii) a simple fully convolutional network (FCN)-like decoder for aggregating multiscale feature maps to generate the prediction. Extensive experiments on the Cityscapes and CamVid datasets demonstrate the effectiveness of our method by achieving a state-of-the-art trade-off between accuracy and inference speed, in addition to being lighter weight. Without using pretraining or resorting to any training trick, we achieve 72.7% mIoU on the Cityscapes test set at a speed of 319.5 FPS on one NVIDIA GeForce GTX 1080 Ti card, which is significantly faster than existing methods. The code and trained models are publicly available.

updated: Fri Dec 02 2022 13:55:41 GMT+0000 (UTC)

published: Fri Dec 02 2022 13:55:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト