ConvTransSeg: A Multi-resolution Convolution-Transformer Network for Medical Image Segmentation

Zhendi Gong; Andrew P. French; Guoping Qiu; Xin Chen

ConvTransSeg: 医用画像セグメンテーションのためのマルチ解像度畳み込み変換ネットワーク

畳み込みニューラルネットワーク (CNN) は、非常に複雑な特徴表現を抽出できるため、医療画像のセグメンテーションで最先端のパフォーマンスを達成しました。ただし、最近の研究では、従来の CNN には、さまざまな画像領域の長期的な依存関係を把握するためのインテリジェンスが欠けていると主張されています。自然言語処理タスクへの Transformer モデルの適用の成功に続いて、医用画像セグメンテーションの分野でも、長期的なコンテキスト情報をキャプチャする能力があるため、Transformer の利用への関心が高まっています。ただし、CNN とは異なり、Transformer には局所的な特徴表現を学習する機能がありません。したがって、CNN とトランスフォーマーの両方の利点を十分に活用するために、ハイブリッドエンコーダー/デコーダーセグメンテーションモデル (ConvTransSeg) を提案します。これは、特徴学習用のエンコーダーとしてのマルチレイヤー CNN と、セグメンテーション予測用のデコーダーとしての対応するマルチレベル Transformer で構成されます。エンコーダとデコーダは、多重解像度方式で相互接続されています。皮膚病変、ポリープ、細胞、脳組織などのいくつかの公開医療画像データセットを使用して、バイナリおよび複数クラスの画像セグメンテーションタスクで、他の多くの最先端のハイブリッド CNN および Transformer セグメンテーションモデルと私たちの方法を比較しました。実験結果は、モデルの複雑さとメモリ消費が少なく、ダイス係数と平均対称表面距離測定に関して、私たちの方法が全体的に最高のパフォーマンスを達成することを示しています。私たちが比較したほとんどの Transformer ベースの方法とは対照的に、私たちの方法は、同等以上のパフォーマンスを達成するために事前トレーニング済みのモデルを使用する必要はありません。このコードは、Github で研究目的で自由に利用できます (リンクは承認時に追加されます)。

Convolutional neural networks (CNNs) achieved the state-of-the-art performance in medical image segmentation due to their ability to extract highly complex feature representations. However, it is argued in recent studies that traditional CNNs lack the intelligence to capture long-term dependencies of different image regions. Following the success of applying Transformer models on natural language processing tasks, the medical image segmentation field has also witnessed growing interest in utilizing Transformers, due to their ability to capture long-range contextual information. However, unlike CNNs, Transformers lack the ability to learn local feature representations. Thus, to fully utilize the advantages of both CNNs and Transformers, we propose a hybrid encoder-decoder segmentation model (ConvTransSeg). It consists of a multi-layer CNN as the encoder for feature learning and the corresponding multi-level Transformer as the decoder for segmentation prediction. The encoder and decoder are interconnected in a multi-resolution manner. We compared our method with many other state-of-the-art hybrid CNN and Transformer segmentation models on binary and multiple class image segmentation tasks using several public medical image datasets, including skin lesion, polyp, cell and brain tissue. The experimental results show that our method achieves overall the best performance in terms of Dice coefficient and average symmetric surface distance measures with low model complexity and memory consumption. In contrast to most Transformer-based methods that we compared, our method does not require the use of pre-trained models to achieve similar or better performance. The code is freely available for research purposes on Github: (the link will be added upon acceptance).

updated: Thu Oct 13 2022 14:59:23 GMT+0000 (UTC)

published: Thu Oct 13 2022 14:59:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト