AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation

Xiangyi Yan; Hao Tang; Shanlin Sun; Haoyu Ma; Deying Kong; Xiaohui Xie

AFTer-UNet：医療画像セグメンテーション用のAxial Fusion Transformer UNet

トランスベースのモデルの最近の進歩は、特に2Dと3Dの両方の設定で、医療画像のセグメンテーションで大きな成功を収めているU-Netモデル（またはそのバリアント）と組み合わせて、医療画像のセグメンテーションでこれらの手法を探求することに注目を集めています。。現在の2Dベースの方法では、畳み込み層を純粋なトランスフォーマーに直接置き換えるか、トランスフォーマーをU-Netのエンコーダーとデコーダーの間の追加の中間エンコーダーと見なします。ただし、これらのアプローチでは、1つのスライス内の注意のエンコードのみが考慮され、3Dボリュームによって自然に提供される軸軸情報は利用されません。 3D設定では、ボリュームデータとトランスフォーマーの畳み込みは両方とも大きなGPUメモリを消費します。画像をダウンサンプリングするか、トリミングされたローカルパッチを使用して、GPUメモリの使用量を減らす必要があります。これにより、パフォーマンスが制限されます。この論文では、詳細な特徴を抽出する畳み込み層の機能と、長いシーケンスモデリングでのトランスの強度の両方の利点を利用するAxial Fusion Transformer UNet（AFTer-UNet）を提案します。セグメンテーションをガイドするために、スライス内とスライス間の両方の長距離キューを考慮します。一方、以前のトランスベースのモデルよりもパラメーターが少なく、トレーニングに必要なGPUメモリも少なくて済みます。 3つの多臓器セグメンテーションデータセットに関する広範な実験は、私たちの方法が現在の最先端の方法よりも優れていることを示しています。

Recent advances in transformer-based models have drawn attention to exploring these techniques in medical image segmentation, especially in conjunction with the U-Net model (or its variants), which has shown great success in medical image segmentation, under both 2D and 3D settings. Current 2D based methods either directly replace convolutional layers with pure transformers or consider a transformer as an additional intermediate encoder between the encoder and decoder of U-Net. However, these approaches only consider the attention encoding within one single slice and do not utilize the axial-axis information naturally provided by a 3D volume. In the 3D setting, convolution on volumetric data and transformers both consume large GPU memory. One has to either downsample the image or use cropped local patches to reduce GPU memory usage, which limits its performance. In this paper, we propose Axial Fusion Transformer UNet (AFTer-UNet), which takes both advantages of convolutional layers' capability of extracting detailed features and transformers' strength on long sequence modeling. It considers both intra-slice and inter-slice long-range cues to guide the segmentation. Meanwhile, it has fewer parameters and takes less GPU memory to train than the previous transformer-based models. Extensive experiments on three multi-organ segmentation datasets demonstrate that our method outperforms current state-of-the-art methods.

updated: Wed Oct 20 2021 06:47:28 GMT+0000 (UTC)

published: Wed Oct 20 2021 06:47:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト