A Multi-scale Transformer for Medical Image Segmentation: Architectures, Model Efficiency, and Benchmarks

Yunhe Gao; Mu Zhou; Di Liu; Dimitris Metaxas

医療画像セグメンテーションのためのマルチスケールトランスフォーマー：アーキテクチャ、モデル効率、およびベンチマーク

トランスフォーマーは、多くの自然言語処理および視覚タスクで成功するようになりましたが、この分野の独特の困難さのために、医療画像への潜在的なアプリケーションはほとんど未踏のままです。この研究では、畳み込みニューラルネットワークとTransformerの長所を組み合わせて、医療画像セグメンテーションのパフォーマンスと効率を向上させる、シンプルでありながら強力なバックボーンモデルであるUTNetV2を紹介します。 UTNetV2の重要な設計には、次の3つの革新が含まれます。（1）Transformerブロックの射影およびフィードフォワードネットワークに深さ方向に分離可能な畳み込みを導入することにより、ハイブリッド階層アーキテクチャを使用しました。したがって、変圧器は、大規模な事前トレーニングの必要性を排除します。（2）適応的に更新されたセマンティックマップを導入することにより、自己注意の二次計算の複雑さを線形に減らす効率的な双方向注意（B-MHA）を提案しました。効率的な注意により、長距離の関係をキャプチャし、高解像度のトークンマップのきめ細かいエラーを修正することができます。（3）B-MHAのセマンティックマップを使用すると、多くの計算オーバーヘッドを導入することなく、セマンティックおよび空間的にグローバルなマルチスケール機能融合を実行できます。さらに、さまざまな医療画像セグメンテーションタスクに基づいたCNNベースとTransformerベースの公正な比較コードベースを提供して、両方のアーキテクチャの長所と短所を評価します。 UTNetV2は、大規模なデータセット、小規模なデータセット、2Dおよび3D設定など、さまざまな設定で最先端のパフォーマンスを実証しました。

Transformers have emerged to be successful in a number of natural language processing and vision tasks, but their potential applications to medical imaging remain largely unexplored due to the unique difficulties of this field. In this study, we present UTNetV2, a simple yet powerful backbone model that combines the strengths of the convolutional neural network and Transformer for enhancing performance and efficiency in medical image segmentation. The critical design of UTNetV2 includes three innovations: (1) We used a hybrid hierarchical architecture by introducing depthwise separable convolution to projection and feed-forward network in the Transformer block, which brings local relationship modeling and desirable properties of CNNs (translation invariance) to Transformer, thus eliminate the requirement of large-scale pre-training. (2) We proposed efficient bidirectional attention (B-MHA) that reduces the quadratic computation complexity of self-attention to linear by introducing an adaptively updated semantic map. The efficient attention makes it possible to capture long-range relationship and correct the fine-grained errors in high-resolution token maps. (3) The semantic maps in the B-MHA allow us to perform semantically and spatially global multi-scale feature fusion without introducing much computational overhead. Furthermore, we provide a fair comparison codebase of CNN-based and Transformer-based on various medical image segmentation tasks to evaluate the merits and defects of both architectures. UTNetV2 demonstrated state-of-the-art performance across various settings, including large-scale datasets, small-scale datasets, 2D and 3D settings.

updated: Thu Mar 03 2022 03:22:10 GMT+0000 (UTC)

published: Mon Feb 28 2022 22:59:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト