Automatic Segmentation of Colorectal Cancer in Abdominal CT Images Using a Deep Learning Network Fusing 3D U-Net and Transformer: A Multicenter, Multidevice Postprint Study
Huang Mu, Zhang Daochun, Yang Linking Vessel, Zhong Liming, Yuan Wenjing, Jia Ziqi, Tan Xiangliang, Duan Xiaohui, Liu Xian, Chen Weicui
Submitted 2025-07-17 | ChinaXiv: chinaxiv-202507.00341

Abstract

Background: The application of deep learning in medical imaging faces challenges such as time-consuming and labor-intensive data annotation, which restricts the efficiency of its clinical translation. Objective: To investigate the feasibility and efficacy of a deep learning network that integrates three-dimensional U-Net (3D U-Net) and Transformer (TransUNet-Cascade) for automatic segmentation of colorectal cancer (CRC) lesions in abdominal CT images. Methods: A total of 2,180 contrast-enhanced abdominal CT images from CRC patients were retrospectively enrolled from Guangdong Provincial Hospital of Chinese Medicine (Center 1), Nanfang Hospital of Southern Medical University (Center 2), and Sun Yat-sen Memorial Hospital of Sun Yat-sen University (Center 3) between January 2018 and May 2023, and divided into training set (n=1,159), validation set (n=289), and external test set (n=732) using weighted random sampling. This study proposes a novel deep learning network model—TransUNet-Cascade, which optimizes segmentation accuracy through a multi-stage learning strategy. Using manual annotation as the benchmark, model performance was evaluated using Dice similarity coefficient (DSC), F1 score, 95% Hausdorff distance (HD95), intersection over union (IoU), precision (PRE), and recall (REC). This study selected three-dimensional no new U-Net (3D nnU-Net) as the comparative baseline model, and conducted systematic training and performance comparison with the proposed TransUNet-Cascade network under unified dataset and evaluation criteria to comprehensively validate its effectiveness in CRC segmentation tasks. Results: In the independent external test set, the segmentation efficacy of both deep learning networks based on arterial phase images was overall superior to that of venous phase images; TransUNet-Cascade achieved average arterial phase DSC, F1, HD95, IoU, PRE, and REC of 0.740, 0.839, 34.084, 0.656, 0.737, and 0.767, respectively, which was overall superior to 3D nnU-Net (average DSC, F1, HD95, IoU, PRE, and REC of 0.724, 0.838, 35.954, 0.642, 0.730, and 0.744, respectively). The model achieved the best segmentation performance for right-sided colon cancer (DSC=0.784), while the segmentation performance for rectal cancer was relatively poorer (DSC=0.622). Conclusion: TransUNet-Cascade enhances the automatic segmentation accuracy of CRC lesions by combining the advantages of convolutional neural networks and Transformer, demonstrating certain potential for clinical application.

Full Text

Automatic Segmentation of Colorectal Cancer in Abdominal CT Images Using a Deep Learning Network Based on Fused 3D U-Net and Transformer: A Multicenter, Multidevice Study

HUANG Mu¹, ZHANG Daochun², YANG Wei¹, ZHONG Liming¹, YUAN Wenjing³, JIA Ziqi³, TAN Xiangliang⁴, DUAN Xiaohui⁵, LIU Xian³, CHEN Weicui³*

¹School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China
²Taizhou Hospital of Zhejiang Province Affiliated to Wenzhou Medical University, Taizhou 317000, China
³Department of Radiology, Guangdong Provincial Traditional Chinese Medicine Hospital, the Second Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou 510120, China
⁴Department of Radiology, Nanfang Hospital, Southern Medical University, Guangzhou 510515, China
⁵Department of Radiology, Sun Yat-Sen Memorial Hospital, Sun Yat-sen University, Guangzhou 510235, China

*Corresponding author: CHEN Weicui, Associate chief physician; E-mail: weicuichen@126.com

Abstract

Background: The application of deep learning in medical imaging faces challenges such as time-consuming and labor-intensive data annotation, which hinders clinical translation efficiency.

Objective: To investigate the feasibility and efficacy of a deep learning network (TransUNet-Cascade) that integrates 3D U-Net and Transformer for automatic segmentation of colorectal cancer (CRC) lesions in abdominal CT images.

Methods: A retrospective analysis was conducted on contrast-enhanced abdominal CT images from 2,180 CRC patients at Guangdong Provincial Traditional Chinese Medicine Hospital (Center 1), Nanfang Hospital of Southern Medical University (Center 2), and Sun Yat-sen Memorial Hospital of Sun Yat-sen University (Center 3) between January 2018 and May 2023. The dataset was divided into a training set (n=1,159), validation set (n=289), and external test set (n=732) using weighted random sampling. This study proposes a novel deep learning network model—TransUNet-Cascade—that optimizes segmentation accuracy through a multi-stage learning strategy. Using manual annotations as the gold standard, model performance was evaluated using Dice similarity coefficient (DSC), F1 score, 95% Hausdorff distance (HD95), intersection over union (IoU), precision (PRE), and recall (REC). This study selected 3D no new U-Net (3D nnU-Net) as the baseline comparison model. Under unified dataset and evaluation criteria, systematic training and performance comparison were conducted with the proposed TransUNet-Cascade network to comprehensively validate its effectiveness in CRC segmentation tasks.

Results: In the independent external test set, both deep learning networks demonstrated superior segmentation performance on arterial phase images compared with venous phase images. TransUNet-Cascade achieved average DSC, F1, HD95, IoU, PRE, and REC values of 0.740, 0.839, 34.084, 0.656, 0.737, and 0.767, respectively, on arterial phase images, outperforming 3D nnU-Net (average DSC, F1, HD95, IoU, PRE, and REC values of 0.724, 0.838, 35.954, 0.642, 0.730, and 0.744). The model achieved the best segmentation performance for right-sided colon cancer (DSC=0.784), while rectal cancer segmentation was relatively less effective (DSC=0.622).

Conclusion: By combining the strengths of convolutional neural networks and Transformers, TransUNet-Cascade improves the accuracy of automatic CRC lesion segmentation and demonstrates potential for clinical application.

Keywords: Colorectal neoplasms; Tomography, X-ray computed; Deep learning; Artificial intelligence; Self-attention mechanism; Convolutional neural network

1. Materials and Methods

1.1 Study Subjects

This cross-sectional study was approved by the Ethics Committee of Guangdong Provincial Traditional Chinese Medicine Hospital (BE2023-142). Informed consent was waived due to its retrospective nature. The study included CRC patients from Guangdong Provincial Traditional Chinese Medicine Hospital (Center 1), Nanfang Hospital of Southern Medical University (Center 2), and Sun Yat-sen Memorial Hospital of Sun Yat-sen University (Center 3) between January 2018 and May 2023. Inclusion criteria were: (1) pathologically confirmed primary CRC after surgery; (2) complete preoperative CT imaging data covering both arterial and venous phases. Exclusion criteria were: (1) poor image quality or severe artifacts; (2) lesions not visually identifiable on CT images. Based on these criteria, 2,180 patients were ultimately enrolled, including 777, 732, and 671 cases from Centers 1, 2, and 3, respectively. The cohort comprised 1,285 males and 895 females, aged 34-87 years with a mean age of (59±11) years. Data from Centers 1 and 3 were combined and divided into training (n=1,159) and validation (n=289) sets at a 4:1 ratio, while data from Center 2 served as an independent external test set (n=732). Stratified weighted random sampling was used for training/validation set division, with tumor location (left/right colon, rectum, etc.) and scanner model as stratification variables to ensure balanced proportions across subgroups. The external test set retained the original data distribution from Center 2 to evaluate model performance in real clinical scenarios. See Figure 1 [FIGURE:1].

1.2 CT Examination Methods

Abdominal CT scan data were acquired from 4 brands and 7 scanner models. Detailed equipment information and scanning parameters are provided in Table 1 [TABLE:1]. For contrast agents, Centers 1 and 2 used iopromide (Bayer, Germany) at 370 mg/mL iodine concentration, while Center 3 used ioversol or iomeprol (Bracco, Italy) at 350 mg/mL iodine concentration. All centers administered contrast via elbow vein injection using a power injector at 2.5-3.0 mL/s with a dose of 1.2-1.5 mL/kg. Arterial phase images were acquired at 25-30 s post-injection, and venous phase images at 55-70 s.

1.3 CRC Location

Referring to the "National Health Commission of China CRC Diagnosis and Treatment Guidelines (2023 Edition)" [3], CRC was classified by tumor location into right-sided colon cancer (including cecum, ascending colon, and hepatic flexure), transverse colon cancer, left-sided colon cancer (including splenic flexure, descending colon, and sigmoid colon), and rectal cancer.

1.4 Manual Tumor Annotation

Arterial and venous phase images were saved in DICOM format and converted to NIfTI format using Python. Manual tumor annotation was performed by one annotator (junior title) and one reviewer (senior title). The reviewer trained the annotator to ensure quality, and reviewed, modified (if necessary), and finalized all annotations. The 3D-Slicer software (Version 4.10.2, http://www.slicer.org) was used for annotation, following the expert consensus on CRC CT and MRI annotation [11]. The final annotated dataset was used to train network models and evaluate segmentation performance.

1.5 Automatic Segmentation Using Fused 3D U-Net and Transformer DL Network

This study proposes a two-stage CRC segmentation algorithm; the specific workflow is shown in Figure 2 [FIGURE:2].

1.5.1 Image Preprocessing: This study employed the following standardized preprocessing pipeline for raw CT images. First, to eliminate spatial resolution differences caused by different CT equipment and scanning parameters, all raw images were resampled to a uniform slice thickness of 3 mm. Subsequently, a two-stage normalization method was applied for intensity standardization: Stage 1 involved truncated normalization, where voxel intensity values were clipped to the 0.5%-99.5% percentile range based on each image's voxel intensity distribution (mean and standard deviation) to eliminate potential effects of extreme outliers on model training. Stage 2 implemented Z-score normalization by subtracting the image mean and dividing by its standard deviation, transforming the intensity distribution to a standard normal distribution with mean 0 and standard deviation 1, thereby achieving consistent intensity distributions across CT images from different sources.

1.5.2 Fused 3D U-Net and Transformer DL Network Framework: This study designed a novel DL network framework—TransUNet-Cascade—based on 3D U-Net with integrated Swin Transformer modules to enhance segmentation accuracy through a multi-stage learning strategy (Figure 2). The Swin Transformer module consists of two basic blocks, each including normalization steps. Input features are first normalized through layer normalization, followed by 3D window-based multi-head self-attention for local feature extraction, which models features through interactions between Query, Key, and Value. Features are then fed into a multi-layer perceptron module for enhanced feature representation through nonlinear mapping. In the second block, the traditional window mechanism is replaced with 3D shifted window multi-head self-attention, which integrates global contextual information through window shifting strategies, significantly improving global feature modeling capability. Each step incorporates layer normalization and residual connections. TransUNet-Cascade adopts a cascaded network architecture and progressively optimizes segmentation accuracy through a multi-stage learning strategy. In Stage 1, the network processes input images at low resolution, combining CNN for local feature extraction with Transformer for global modeling to generate coarse regions of interest. These preliminary results serve as input for Stage 2, which crops and focuses on high-resolution regions of interest. In Stage 2, the network performs fine-grained segmentation on high-resolution regions of interest, further optimizing boundary delineation and structural detail preservation.

To comprehensively evaluate TransUNet-Cascade's effectiveness, this study introduced the classical segmentation model 3D nnU-Net for comparison. 3D nnU-Net is a highly automated 3D medical image segmentation framework based on standard 3D U-Net with symmetric encoder-decoder design and skip connections for effective fusion of low-level features and high-level semantic information. The encoder extracts multi-scale features through multiple 3D convolution and downsampling layers, while the decoder gradually restores spatial resolution through upsampling and convolution for fine structural reconstruction. Therefore, this study selected 3D nnU-Net as the baseline model for systematic training and performance comparison with TransUNet-Cascade under unified dataset and evaluation criteria to comprehensively validate its effectiveness in CRC segmentation.

1.5.3 Model Training and Testing: During training, image patches with batch size 2 were input to the network, with randomly sampled patch dimensions set to 192×192×48 (length × width × slices). To mitigate overfitting, spatial and intensity augmentations were applied to image patches, including random flipping, rotation, scaling, and intensity transformations such as Gaussian noise and Gaussian blur. Stochastic gradient descent was used as the optimizer with an initial learning rate of 0.01, gradually decaying to 5.5×10⁻⁴. Based on the PyTorch framework, models were trained for 1,000 epochs. The network loss function combined binary cross-entropy loss and Dice loss with equal weights of 0.5. The model with the highest average Dice coefficient on the validation set was selected as the optimal network. During testing, no data augmentation was applied; images were cropped using overlapping sliding windows with patch size consistent with training (192×192×48). Network output probabilities were binarized at a threshold of 0.5, and the largest connected component was retained as the final segmentation result.

1.6 Evaluation of Segmentation Model Performance

Using manual annotations as the gold standard, TransUNet-Cascade and 3D nnU-Net model performance was evaluated using Dice similarity coefficient (DSC), F1 score, 95% Hausdorff distance (HD95), intersection over union (IoU), precision (PRE), and recall (REC). The formulas are as follows:

$$DSC = \frac{2 \cdot |X \cap Y|}{|X + Y|}$$

$$F1\text{-score} = \left[\frac{(PRE^{-1} + REC^{-1})}{2}\right]^{-1}$$

$$HD95 = \max\left[95\text{percent} \sup_{a \in A} \inf_{b \in B} d(a,b), 95\text{percent} \sup_{b \in B} \inf_{a \in A} d(a,b)\right]$$

$$PRE = \frac{V_{TP}}{|V_{TP} + V_{FP}|}$$

$$IoU = \frac{V_{TP}}{|V_{TP} + V_{FP} + V_{FN}|}$$

$$REC = \frac{V_{TP}}{|V_{TP} + V_{FN}|}$$

Where X is the ground truth set and Y is the prediction set; VTP represents the number of voxels correctly classified as tumor region (consistent between model segmentation and manual annotation); VFN represents voxels in tumor region classified as background; VFP represents voxels in background classified as tumor region. d(a,b) is the distance from point a to point b; inf is the infimum (minimum possible value); sup is the supremum (maximum possible value); 95percent sup is the supremum calculated based on the 95th percentile of boundary point distances between sets A and B.

1.7 Statistical Analysis

SPSS 19.0 software was used for statistical analysis. All measurement data in this study followed normal distribution and were expressed as (x̄±s). One-way ANOVA was used for multi-group comparisons. Count data were expressed as relative frequencies, with χ² test for inter-group comparisons. P<0.05 was considered statistically significant.

2. Results

2.1 Patient Clinical Data

No statistically significant difference in age was found among the training, validation, and external test sets (P>0.05). However, statistically significant differences were observed in gender and tumor location among the three sets (P<0.001). See Table 2 [TABLE:2].

2.2 Comparison of Overall Segmentation Performance Between Two Models

To validate TransUNet-Cascade's segmentation performance, it was compared with the 3D nnU-Net algorithm; experimental results are shown in Figure 3 [FIGURE:3]. In the external test set, TransUNet-Cascade achieved average DSC, F1, HD95, IoU, PRE, and REC values of 0.740, 0.839, 34.084, 0.656, 0.737, and 0.767, respectively, for automatic CRC segmentation based on arterial phase images. In contrast, 3D nnU-Net achieved average DSC, F1, HD95, IoU, PRE, and REC values of 0.724, 0.838, 35.954, 0.642, 0.730, and 0.744. For venous phase images, TransUNet-Cascade achieved average DSC, F1, HD95, IoU, PRE, and REC values of 0.688, 0.812, 32.364, 0.601, 0.666, and 0.743, respectively, while 3D nnU-Net achieved values of 0.660, 0.810, 35.412, 0.576, 0.625, and 0.730. Both DL models demonstrated generally superior automatic segmentation performance on arterial phase images compared with venous phase images. TransUNet-Cascade's segmentation efficacy was overall superior to 3D nnU-Net.

2.3 Comparison of Segmentation Performance by Tumor Location

Further analysis of TransUNet-Cascade's segmentation performance by tumor location revealed that the model achieved optimal segmentation for right-sided colon cancer, with DSC, F1, HD95, IoU, PRE, and REC values of 0.730, 0.833, 36.539, 0.650, 0.728, and 0.757, respectively. This was followed by left-sided colon cancer and transverse colon cancer. Rectal cancer segmentation was the least effective, with DSC, F1, HD95, IoU, PRE, and REC values of 0.622, 0.835, 43.832, 0.547, 0.611, and 0.659, respectively. See Table 3 [TABLE:3].

3. Discussion

Accurate segmentation of CRC lesions is a crucial prerequisite for building robust artificial intelligence algorithm models. The colon and rectum are flexible hollow organs, and CRC exhibits characteristics such as variable size, diverse morphology, and irregular margins, demonstrating significant heterogeneity on imaging. Furthermore, the colon and rectum are adjacent to multiple organs, and interference from these anatomical structures further increases the difficulty of automatic CRC lesion segmentation based on imaging data. Over the past decade, convolutional neural networks (CNNs) have made significant advances in medical image segmentation [12]. Among them, U-Net, through its symmetric encoder-decoder structure and skip connection mechanism, effectively addresses challenges of small sample data and high-precision localization in medical image segmentation, becoming a milestone in this field [13]. Our research group previously conducted preliminary exploration of automatic CRC segmentation in abdominal CT images based on 3D nnU-Net [8]. However, CNNs have limited receptive fields and typically use small convolution kernels to balance model accuracy and computational complexity, focusing more on local feature extraction and performing poorly in capturing long-range dependencies.

Transformer, as a deep learning model based on self-attention mechanisms, was first proposed by Vaswani et al. [14] in 2017. Transformer abandons traditional convolutional and recurrent structures, capturing long-range dependencies through parallel sequence data processing. Its core features include multi-head attention mechanisms and positional encoding, enabling excellent performance in natural language processing tasks [15]. Additionally, Transformer's encoder-decoder architecture and stackable multi-layer structure endow it with powerful feature extraction and representation capabilities, forming the foundational framework for advanced models such as BERT and GPT. In recent years, Transformer has been gradually introduced into medical image segmentation and has demonstrated significant advantages. For example, Ghazouani et al. [16] combined Transformer with CNN modules to build an encoder-decoder structure, achieving excellent performance with an average DSC of 0.8977 for brain tumor segmentation in MR images from 1,251 patients in the BraTS 2021 dataset. Sun et al. [17] proposed an algorithm fusing Vision Transformer with edge-guided encoder-decoder networks, achieving a DSC of 0.9015 for automatic spinal structure segmentation in 195 T2WI images. Liu et al. [18] designed a multi-scale edge optimization algorithm based on Swin Transformer for bladder cancer MRI segmentation, achieving an average DSC of 0.9373 in 100 patients' T2WI images. However, to our knowledge, no studies have applied Transformer to CRC lesion segmentation in abdominal CT images. This study aims to explore the potential and clinical value of Transformer in CRC segmentation.

Our dataset included contrast-enhanced abdominal CT images from 2,180 CRC patients acquired using different brands and models of CT scanners across three independent medical centers. With a large training set (n=1,159) and significant data heterogeneity, the dataset well reflects real clinical scenarios. This study proposed a novel CNN-Transformer combined TransUNet-Cascade model and compared its segmentation performance with 3D nnU-Net, with generalization capability validated through an independent external test set. Results showed that both TransUNet-Cascade and 3D nnU-Net achieved superior automatic tumor segmentation on arterial phase images compared with venous phase images. CRC contains abundant neovascularization with defective endothelial cell structure, incomplete smooth muscle and pericyte coverage, leading to significantly increased permeability. During the arterial phase, tumor tissue rapidly takes up contrast agent and reaches peak enhancement, creating optimal contrast between tumor and normal intestinal wall that facilitates model segmentation.

Further analysis of the two DL models' segmentation performance revealed that TransUNet-Cascade could effectively identify and segment CRC lesions, achieving average DSC, F1, HD95, IoU, PRE, and REC values of 0.740, 0.839, 34.084, 0.656, 0.737, and 0.767, respectively, on arterial phase images in the external test set, overall outperforming 3D nnU-Net. This advantage may be attributed to TransUNet-Cascade's architectural design and its ability to capture multi-scale features. CRC lesions are distributed across different abdominal locations with complex spatial relationships with surrounding tissues and organs. While 3D nnU-Net demonstrates excellent local feature extraction through adaptive hyperparameter and network structure optimization, it has insufficient global context modeling, potentially leading to inaccurate segmentation of lesion boundaries and spatial distribution, especially when contrast between lesion regions and surrounding tissue is low. In contrast, TransUNet-Cascade combines the respective strengths of CNN and Transformer: CNN effectively extracts local detail features, while Transformer modules capture long-range dependencies through self-attention mechanisms, better understanding spatial distributions between lesions and surrounding tissues/organs. This balances local details and global context, enabling better handling of CRC lesions' complex morphology and blurred boundaries, thereby further improving segmentation accuracy.

This study further analyzed TransUNet-Cascade's segmentation performance by tumor location, revealing optimal segmentation for right-sided colon cancer with DSC, F1, HD95, IoU, PRE, and REC values of 0.730, 0.833, 36.539, 0.650, 0.728, and 0.757, respectively, followed by left-sided colon cancer and transverse colon cancer. Rectal cancer segmentation was relatively poor. This finding is consistent with our previous 3D nnU-Net-based CRC segmentation study [8]. This phenomenon may be related to anatomical characteristics of CRC at different locations. First, the right colon is located on the right side of the abdominal cavity with relatively fixed anatomical structure and higher contrast with surrounding tissues. In contrast, the rectum is located in the pelvic cavity with complex anatomy, adjacent to multiple tissues and organs (such as bladder, prostate, uterus, etc.). Additionally, smaller density differences among fat, muscle, and bone tissues in the pelvic cavity result in lower contrast between rectal cancer and surrounding tissues on CT images, further increasing segmentation difficulty. Although TransUNet-Cascade's cascaded structure and multi-stage segmentation strategy progressively optimize results, the model may struggle to capture sufficient detail information when processing rectal cancer due to unclear lesion boundaries and complex spatial relationships with surrounding tissues, leading to decreased segmentation accuracy.

This study has several limitations: (1) Limited model diversity: Only two DL models (3D nnU-Net and TransUNet-Cascade) were used for CRC lesion segmentation, without comparison to other advanced DL models, potentially limiting the comprehensiveness and generalizability of results. (2) Poor rectal cancer segmentation: TransUNet-Cascade's relatively poor performance on rectal cancer may be related to complex anatomical structure, blurred boundaries, and low contrast in CT images. The model's capability in handling low-contrast and complex anatomical structures requires further optimization. (3) Dataset limitations: Although our dataset originated from three independent medical centers and covered different CT scanner brands and models, the sample size remains limited, particularly for certain tumor locations (e.g., transverse colon cancer), which may affect model segmentation performance for these sites.

In summary, the TransUNet-Cascade model, by combining CNN and Transformer advantages, adopting a cascaded structure, and enhancing global context modeling capability, outperforms the traditional 3D nnU-Net model in CRC segmentation tasks. These results provide new technical insights for precise CRC segmentation and hold promise for further clinical application.

Author Contributions: HUANG Mu, ZHANG Daochun, and CHEN Weicui conceived the research idea and designed the study protocol. HUANG Mu constructed and validated the deep learning model. YANG Wei and ZHONG Liming provided methodological support. YUAN Wenjing and JIA Ziqi collected and organized data. TAN Xiangliang and DUAN Xiaohui provided external validation data. LIU Xian performed statistical analysis. CHEN Weicui revised the final version and is responsible for the overall manuscript content.

ORCID IDs:
HUANG Mu: https://orcid.org/0009-0007-9289-2366
CHEN Weicui: https://orcid.org/0000-0002-1814-8295

Funding: National Natural Science Foundation of China (82202259); 13th Chaoyang Talent Project of Guangdong Provincial Traditional Chinese Medicine Hospital (ZY2022YL05)

References

[1] HAN B F, ZHENG R S, ZENG H M, et al. Cancer incidence and mortality in China, 2022[J]. J Natl Cancer Cent, 2024, 4(1): 47-53. DOI: 10.1016/j.jncc.2024.01.006.

[2] ZHENG R S, CHEN R, HAN B F, et al. Analysis of cancer incidence and mortality in China, 2022[J]. Chinese Journal of Oncology, 2024, 46(3): 221-231. DOI: 10.3760/cma.j.cn112152-20240119-00035.

[3] Medical Administration Department of National Health Commission, Chinese Society of Oncology, Chinese Medical Association. Chinese guidelines for diagnosis and treatment of colorectal cancer (2023 edition)[J]. Medical Journal of Peking Union Medical College Hospital, 2023, 14(4): 706-733. DOI: 10.12290/xhyxzz.2023-0315.

[4] LI S L, YUAN L, LU T, et al. Deep learning imaging reconstruction of reduced-dose 40 keV virtual monoenergetic imaging for early detection of colorectal cancer liver metastases[J]. Eur J Radiol, 2023, 168: 111128. DOI: 10.1016/j.ejrad.2023.111128.

[5] YAO L S, LI S Y, TAO Q, et al. Deep learning for colorectal cancer detection in contrast-enhanced CT without bowel preparation: a retrospective, multicentre study[J]. EBioMedicine, 2024, 104: 105183. DOI: 10.1016/j.ebiom.2024.105183.

[6] MIAO S D, SUN M Z, ZHANG B B, et al. Multimodal deep learning: tumor and visceral fat impact on colorectal cancer occult peritoneal metastasis[J]. Eur Radiol, 2025. DOI: 10.1007/s00330-025-11450-2.

[7] WANG X, SHI C, YUAN Z Y. Partition attention UNet model for segmenting knee cartilage in MRI[J]. Chinese Journal of Medical Imaging Technology, 2024, 40(5): 764-768. DOI: 10.13929/j.issn.1003-3289.2024.05.027.

[8] ZHENG K Y, WU H, YUAN W J, et al. Feasibility study of automatic colorectal cancer segmentation from abdominal CT images using 3D nnU-Net deep learning network[J]. Chinese Journal of Radiology, 2024, 58(8): 829-835.

[9] XIAO L, ZHANG L, TANG Y, et al. Global and contextual dual attention U-Net network for thoracic and lumbar spine segmentation in sagittal X-ray images[J]. Chinese Journal of Medical Imaging Technology, 2025, 41(1): 128-132. DOI: 10.13929/j.issn.1003-3289.2025.01.027.

[10] PU Q M, XI Z X, YIN S, et al. Advantages of transformer and its application for medical image segmentation: a survey[J]. Biomed Eng Online, 2024, 23(1): 14. DOI: 10.1186/s12938-024-01227-8.

[11] Medical Imaging Big Data and Artificial Intelligence Working Committee of Chinese Society of Radiology, Abdominal Group of Chinese Society of Radiology, Magnetic Resonance Group of Chinese Society of Radiology. Expert consensus on CT and MRI annotation of colorectal cancer (2020)[J]. Chinese Journal of Radiology, 2021, 55(2): 111-116. DOI: 10.3760/cma.j.cn112149-20200706-00894.

[12] CARIN L, PENCINA M J. On deep learning for medical image analysis[J]. JAMA, 2018, 320(11): 1192-1193. DOI: 10.1001/jama.2018.13316.

[13] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[M]//Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Cham: Springer International Publishing, 2015: 234-241. DOI: 10.1007/978-3-319-24574-4_28.

[14] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. 2017: 5998-6008.

[15] SHAMSHAD F, KHAN S, ZAMIR S W, et al. Transformers in medical imaging: a survey[J]. Med Image Anal, 2023, 88: 102802. DOI: 10.1016/j.media.2023.102802.

[16] GHAZOUANI F, VERA P, RUAN S. Efficient brain tumor segmentation using Swin transformer and enhanced local self-attention[J]. Int J Comput Assist Radiol Surg, 2024, 19(2): 273-281. DOI: 10.1007/s11548-023-03024-8.

[17] SUN H, MO G P, XU G H, et al. RET-Net algorithm based on fused vision Transformer and edge-guided encoder-decoder network for spinal MRI segmentation[J]. Chinese Journal of Medical Imaging Technology, 2023, 39(4): 577-581. DOI: 10.13929/j.issn.1003-3289.2023.04.021.

[18] LIU L B, LI X, WEI B Z, et al. Multi-scale edge optimization algorithm for bladder cancer MRI segmentation based on Swin Transformer[J]. Journal of Biomedical Engineering Research, 2023, 42(1): 43-49. DOI: 10.19529/j.cnki.1672-6278.2023.01.07.

(Received: March 14, 2024; Revised: June 18, 2025)
(This article was edited by JIA Mengmeng)

Submission history

Automatic Segmentation of Colorectal Cancer in Abdominal CT Images Using a Deep Learning Network Fusing 3D U-Net and Transformer: A Multicenter, Multidevice Postprint Study