Lightweight Real-time Semantic Segmentation Network with Efficient Transformer and CNN

Guoan Xu; Juncheng Li; Guangwei Gao; Huimin Lu; Jian Yang; Dong Yue

効率的なトランスフォーマーと CNN を使用した軽量のリアルタイムセマンティックセグメンテーションネットワーク

過去 10 年間で、畳み込みニューラルネットワーク (CNN) は、セマンティックセグメンテーションの重要性を示してきました。 CNN モデルは非常に優れたパフォーマンスを発揮しますが、グローバルな表現を捉える能力は依然として不十分であり、最適ではない結果をもたらします。最近、Transformer は NLP タスクで大きな成功を収め、長期的な依存関係のモデル化における利点を実証しました。最近、Transformer は、画像処理タスクをシーケンス間予測として再定式化するコンピュータービジョン研究者からも多大な注目を集めていますが、その結果、局所的な特徴の詳細が悪化しました。この作業では、LETNet と呼ばれる軽量のリアルタイムセマンティックセグメンテーションネットワークを提案します。 LETNet は、U 字型の CNN と Transformer をカプセル埋め込みスタイルで効果的に組み合わせて、それぞれの欠点を補います。一方、精巧に設計された軽量拡張ボトルネック (LDB) モジュールと機能拡張 (FE) モジュールは、トレーニングにゼロから同時にプラスの影響をもたらします。困難なデータセットで実施された広範な実験は、LETNet が精度と効率のバランスにおいて優れたパフォーマンスを達成することを示しています。具体的には、0.95M パラメーターと 13.6G FLOP しか含まれていませんが、単一の RTX 3090 GPU を使用して、Cityscapes テストセットで 120 FPS で 72.8% の mIoU、CamVid テストデータセットで 250 FPS で 70.5% の mIoU をもたらします。ソースコードは、https://github.com/IVIPLab/LETNet で入手できます。

In the past decade, convolutional neural networks (CNNs) have shown prominence for semantic segmentation. Although CNN models have very impressive performance, the ability to capture global representation is still insufficient, which results in suboptimal results. Recently, Transformer achieved huge success in NLP tasks, demonstrating its advantages in modeling long-range dependency. Recently, Transformer has also attracted tremendous attention from computer vision researchers who reformulate the image processing tasks as a sequence-to-sequence prediction but resulted in deteriorating local feature details. In this work, we propose a lightweight real-time semantic segmentation network called LETNet. LETNet combines a U-shaped CNN with Transformer effectively in a capsule embedding style to compensate for respective deficiencies. Meanwhile, the elaborately designed Lightweight Dilated Bottleneck (LDB) module and Feature Enhancement (FE) module cultivate a positive impact on training from scratch simultaneously. Extensive experiments performed on challenging datasets demonstrate that LETNet achieves superior performances in accuracy and efficiency balance. Specifically, It only contains 0.95M parameters and 13.6G FLOPs but yields 72.8% mIoU at 120 FPS on the Cityscapes test set and 70.5% mIoU at 250 FPS on the CamVid test dataset using a single RTX 3090 GPU. The source code will be available at https://github.com/IVIPLab/LETNet.

updated: Tue Feb 21 2023 07:16:53 GMT+0000 (UTC)

published: Tue Feb 21 2023 07:16:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト