PAC-P2T: pyramid atrous convolution with pyramid pooling Transformer for polyp segmentation
Introduction
Colorectal cancer (CRC) commonly develops from polyps in the colon or rectum, a progression that generally occurs over many years (1,2). Globally, CRC ranks among the most prevalent and deadly cancers. Early detection and removal of polyps are crucial for CRC prevention. Colonoscopy is the gold standard for polyp screening, and plays a vital role in this process. However, the diverse morphology and subtle appearance of polyps render high-precision colonoscopic screening exceptionally challenging, even for experienced clinicians, making error-free detection a formidable task. Missed detections and incomplete removal significantly increase the risk of CRC (3). To enhance screening accuracy, computer-aided diagnosis (CAD) systems are widely used to assist medical professionals in polyp detection and localization (4). As shown in Figure 1, polyps exhibit significant intra-class variation and low inter-class contrast, characterized by diverse shapes, smooth transitions with the intestinal wall, and blurred boundaries. Additionally, complex intestinal environments, including residual food debris, feces, and blood, further complicate polyp detection, making accurate segmentation an extremely challenging task (1).
With the rapid advancement of deep learning, frameworks based on this technology have come to dominate the field of polyp segmentation. Traditional machine learning methods, constrained by limited data representation capabilities and the need for highly engineered feature construction, are increasingly being replaced by deep learning across many domains. In the era of deep learning, early approaches in polyp extraction apply conventional image segmentation algorithms and general medical image segmentation methods, such as Fully Convolutional Networks (FCN) (5), SegNet (6), UNet (7), and its variant UNet++ (8), META-UNet (9). Brandao et al. (10) pioneered the application of FCN for polyp segmentation using a VGG-based encoder-decoder architecture, which achieved superior results. They further improved segmentation performance by integrating depth estimation as an additional input channel, demonstrating its effectiveness through experimental validation. Bardhi et al. (11) adopted SegNet but found it susceptible to lighting variations, while Wang et al. (12) validated SegNet-based experiments with newly collected clinical data and integrated public datasets, confirming its effectiveness but also highlighting its limitations in handling boundary blurring and residual interference. Given the unique challenges of polyp segmentation, specialized Convolutional Neural Network (CNN)-based networks, such as SFA (13), Polyp-Net (14), PraNet (15), ABC-Net (16), SSN (17), ACSNet (18), MSEG (19), EU-Net (20), and MSNet (21), have been developed to address this task effectively. Some approaches (13,22,23) introduce polyp edge extraction branches to refine boundary information and improve region segmentation accuracy as well. Although CNN-based methods have significantly enhanced segmentation accuracy, their architectural constraints limit their ability to capture long-range dependencies, thereby restricting global feature awareness. Transformer models, renowned for their robust long-range dependency modeling capabilities, have rapidly risen to prominence in computer vision following their remarkable success in natural language processing. Frameworks such as Vision Transformer and other Transformer-based architectures have been developed, delivering exceptional performance in tasks like image recognition, object detection, and segmentation. Researchers soon explore Transformer-based methods to enhance polyp segmentation robustness. Dong et al. (24) were the first to introduce a pyramid vision Transformer (PVT)-based Transformer framework, proposing Polyp-PVT, which leverages PVT’s powerful feature representation capabilities for detection and segmentation, resulting in substantial performance improvements. Subsequently, other PVT-based polyp segmentation networks, including CGMA-Net (25) and DFINet (26), have been developed to further advance the field. Additionally, some studies proposed hybrid architectures (27) that combine Transformer and CNN structures in parallel, achieving certain successes but introducing increased computational overhead. Existing polyp segmentation methods can be broadly categorized into CNN-based, Transformer-based, and hybrid architectures. Despite their notable progress, existing polyp segmentation methods still suffer from several limitations, including insufficient coordination between global semantic modeling and local boundary refinement, limited flexibility in multi-scale receptive field control, and weak cross-layer feature interaction mechanisms, all of which jointly restrict their effectiveness in handling the complex morphology and low-contrast boundaries commonly observed in colonoscopic images.
As mentioned, edge information plays a crucial role in polyp region extraction. Some algorithms incorporate dedicated edge extraction branches to enhance this capability. Several prior studies have explored atrous convolutions for segmentation or detection tasks. Choi et al. (28) adopted atrous convolutions in a lightweight atrous spatial pyramid pooling module for real-time crack segmentation. Ali et al. (29) utilized atrous convolutions to enlarge the receptive field during anomaly detection on thermography data, combined with adversarial learning and attention mechanisms. Compared to standard convolution operations, atrous convolution expands the receptive field while preserving spatial resolution, allowing for better retention of fine details. This characteristic is particularly beneficial for handling challenges such as blurred polyp boundaries commonly found in colonoscopy images. To take full advantage of Transformer and the atrous convolution, we propose a novel polyp segmentation network termed PAC-P2T. PAC-P2T is designed as a dual-pyramid coupled architecture, where pyramid pooling Transformer (P2T) and pyramid atrous convolution (PAC) are tightly integrated to enable coordinated global-local feature modeling. Specifically, pyramid pooling in the backbone captures long-range dependencies and global semantics, while pyramid atrous convolution is systematically embedded across multiple network stages to progressively expand receptive fields and enhance boundary-aware feature representation. Aiming to enlarge the receptive fields layer by layer and enhancing the multi-scale feature fusion, we introduce the single level atrous convolution feature fusion module into each side branch of the encoder, and transmit the corresponding feature to each lower-level side branch. At last, we conduct experiments on five public colorectal polyp segmentation datasets, i.e., ETIS (30), CVC-ClinicDB (31), CVC-300 (32), CVC-ColonDB (15), and Kvasir-SEG (33). Experimental results reveal the effectiveness of our model. For instance, the proposed PAC-P2T achieves a mean Dice of 0.815 and a mean intersection over union (IoU) of 0.744 on ETIS.
In summary, the key contributions of our work are as follows:
- We propose a dual-pyramid coupled architecture for polyp segmentation, which integrates a P2T backbone with a PAC design to jointly enable global semantic modeling and multi-scale receptive field encoding at the network level.
- We propose a hierarchical atrous-driven feature interaction framework that progressively expands receptive fields while enabling cross-layer global-to-local guidance, enhancing multi-scale and boundary-aware representation across the network.
We also conducted a corresponding review and analysis of the related work.
Specific networks for polyp segmentation
Due to challenges such as reflections, blurred edges, complex backgrounds, diverse morphologies, and the concealed nature of polyps, general-purpose methods struggle to effectively address the requirements of polyp segmentation. To achieve robust polyp segmentation, numerous specialized segmentation networks have been proposed. Fang et al. (13,16) simultaneously employed polyp boundary and region information as network constraints, constructing a symmetric dual-branch decoder based on UNet to separately enforce region and boundary constraints, while introducing selective feature aggregation to enhance feature transmission. Zhou et al. (22) proposed a strategy that applies polyp boundary supervision only to shallow encoder features and uses these features as deep feature guidance to enhance spatial recognition of polyp regions. Since explicitly adding an independent branch for boundary constraints may affect network convergence, Fan et al. (15) strengthen implicit edge features using reverse attention and establish benchmark dataset splits that significantly influence subsequent research. Zhang et al. (18) generated global semantic information from deep encoder features and combine local feature enhancement at each level with deeper region estimation guidance to accomplish polyp region estimation. Wei et al. (34) recognized that dataset color distribution affects model generalization and propose a color transfer strategy. Zhao et al. (21,35) introduced MSNet and M2SNet, emphasizing differential feature training between adjacent layers. Hu et al. (36) incorporated reflection restoration and enhancement into training while proposing a hierarchical short-link-based salient polyp detection network, in addition, Zhou et al. (37) introduced a saliency detection network with redundant and missing feature decoupling strategy to polyp segmentation, also validates the effective of saliency method. Lewis et al. (38) proposed a dual encoder-decoder architecture that effectively integrates both local and global information by leveraging CNN-based and Transformer-based modules. Zhang et al. (39) introduced dynamic kernels to address the issue of varying polyp scales. Liu et al. (40) proposed a coarse-to-fine segmentation path, using the deepest encoder features as a global guiding source to strengthen deep semantic guidance. Tomar et al. (41) employed dilated convolution to aggregate features across layers, using them as progressively transmitted inputs for the deepest decoder, optimizing the encoder’s comprehensive semantic feature extraction. Dai et al. (42) proposed a dual-path U-shaped network that focuses on historical feature reuse and extension, achieving certain improvements. Du et al. (43) incorporated uncertainty modeling, enhancing the performance of ICGNet in polyp segmentation. To leverage Transformer’s advantages in long-range dependencies, Dong et al. (24) were the first to introduce PVT as an encoder, using a shallow attention mechanism to process deep features, while applying channel and spatial enhancement to shallow features, ultimately fusing both feature streams for polyp region estimation. Their proposed Polyp-PVT significantly improves detection performance. Hu et al. (44) applied the P2T network for polyp segmentation, utilizing its feature extraction capabilities combined with multi-level pyramid pooling to optimize polyp feature extraction. Hu et al. (45) incorporated adjacent-differential feature fusion into the polyp segmentation network, thereby enhancing the network’s consistency in feature representation. Lin et al. (46) proposed Polyp-LVT, a lightweight polyp segmentation network based on vision Transformer. Yue et al. (26) introduced a spatial and frequency domain fusion approach, combining the previous layer’s polyp region estimation with fused features to reinforce edge features, enhancing network performance in complex polyp detection scenarios. Zheng et al. (25) adopt PVT as the backbone and use a U-shaped network as the primary architecture, proposing a cross-level feature guidance strategy for feature transmission between encoder and decoder layers. The decoder further employs progressive multi-scale aggregation to refine the feature transmission path, improving polyp scale adaptability. Despite extensive optimization efforts in semantic understanding, feature aggregation, global information guidance, and edge feature enhancement, existing algorithms still fall short in meeting the robust segmentation demands of clinical polyp screening, necessitating further improvements in network adaptability for polyp segmentation tasks.
Pyramid mechanism for vision analysis
Human visual perception shares similarities with the pyramid mechanism in terms of multi-level processing, coarse-to-fine analysis, information compression, and multi-scale feature extraction. The application of the pyramid mechanism in computer vision simulates and optimizes the human visual system, making it an essential component of visual analysis research. Zhao et al. (47) utilized pyramid parsing to obtain subdomain feature representations for supporting holistic scene semantic segmentation. The pyramid mechanism has also been gradually introduced into medical image analysis. Sarker et al. (48) integrated pyramid pooling into the decoder side of a network for dermatological lesion segmentation under dermoscopy, employing an encoder-decoder structure similar to SegNet. Feng et al. (49) proposed a scale-aware pyramid fusion mechanism, achieving progress in segmentation tasks involving thoracic organs, retinal edema lesions, and skin lesions. Qin et al. (50) introduced a parallel branches with cross-scale inter-query attention mechanism to support multi-scale feature learning, making advancements in semantic segmentation. Regarding frameworks, Wang et al. (51) proposed PVT, which overcomes the challenges of adapting Transformers to dense prediction tasks. Polyp-PVT (24), DFINet (26), and CGMA-Net (25) all adopt PVT and its enhanced versions as the backbone for feature encoding. Wu et al. (52) introduced P2T, incorporating pyramid pooling into the network backbone for the first time and demonstrating its effectiveness in object detection, instance segmentation, and other tasks. Hu et al. (44) applied P2T to polyp segmentation, constructing a memory-keeping pyramid fusion structure using pyramid pooling. Yue et al. (53) employed a pyramid structure for multi-layer feature aggregation, integrating attention mechanisms to enhance polyp segmentation. Given the advantages of the pyramid mechanism in visual analysis, particularly in segmentation, further exploration of its potential in polyp segmentation is necessary to enhance feature extraction and improve segmentation robustness.
Atrous convolution for vision analysis
Atrous convolution is considered as a powerful tool to control the resolution of features captured by CNNs as well as explicitly adjust filter’s field-of-view (54). Atrous convolution in cascade or in parallel was adopted to capture multi-scale context for semantic image segmentation, and it was also proved can keep more detailed spatial information (54). Wu et al. (55) showed the effect of modifying atrous rates while extracting long-term features. Choi et al. (28) proposed a pyramid pooling module with atrous convolutions (dilation rates 1–4) to slightly expand the receptive field while prioritizing inference efficiency. Ali et al. (29) creatively applies adversarial learning and attention mechanisms to thermography-based defect detection, atrous convolutions with different dilation rates were employed to increasing receptive field. Kang et al. (56) proposed a lightweight segmentation framework based on depthwise separable convolutions and squeeze-and-excitation attention modules. Huang et al. (57) proposed the scheme of kernel-sharing atrous convolution, which boost the network’s generalization and representation abilities, and improvs the performance on semantic segmentation. The atrous spatial pyramid pooling module was applied in DeepLabV3+ (58) for extracting features at an arbitrary resolution. Zhao et al. (59) utilized atrous convolution with different rates to extract pyramid feature in different receptive fields, and such a scheme boosted the network’s performance on saliency detection. Guo et al. (60) proposed a cascaded re-aggregation atrous convolution multi-scale feature fusion scheme, embedding it into the dual-branch process and aggregation module at the decoder end, achieving superior performance in holistic scene semantic segmentation. Das et al. (61) introduced an atrous spatial pyramid-based pooling method, demonstrating effectiveness in sickle cell anemia segmentation tasks. Halder et al. (62) proposed a two-layer atrous pyramid with residual connections approach, applying it to lung nodule segmentation and classification. In addition to sematic segmentation, atrous convolution has also shown its superiority in areas such as single image defocus deblurring (63), object detection (64) and body landmarks estimation (65). Exploring the incorporation of atrous convolution into polyp segmentation is meaningful, as it provides a direction for enhancing the extraction of polyp regions through multi-scale receptive field modeling and hierarchical feature aggregation.
The remainder of this paper is organized as follows. In “Methods” section, we introduce our method in detail. In “Results” section, comparison experimental results, ablation study and discussions are conducted to verify the effectiveness of our method. In “Discussion” section, limitations and future work are discussed. Finally, some remarks and conclusions are given in “Conclusions” section.
Methods
In this section, we introduce PAC-P2T as our proposed network for polyp segmentation. We first depict the overall architecture. Then we introduce the feature extraction module and the attention mechanism. Finally, we illustrate the loss function adopted in our study and describe the datasets used for training and evaluation.
Overall architecture
The overall architecture of PAC-P2T is illustrated in Figure 2. The model is a U-shaped framework with P2T used as the backbone for feature extraction. A colonoscopy image with the resolution of is first fed into the encoder network P2T. The multi-level features extracted from P2T are denoted as . The general resolution of Fi is in this case. The features F2, F3, and F4 from the last three layers of the backbone are collectively used as inputs to the multi-layer PAC feature extraction module (MPAF). These features are processed through the CA module and a sequence of convolution, batch normalization, and ReLU operations to extract global feature information. To prevent attenuation during information transmission, the global feature information is directly up-sampled and fed back to each layer to provide semantic guidance. As illustrated in Figure 2, the global feature feedback utilizes a dual-connection mechanism. It not only undergoes direct additive integration with the features of each P2T layer but also further participates in guidance after the fused features at each layer pass through the single-level atrous convolution feature fusion module (SLAF) module. This approach maximizes the utilization of deep semantic features. To reinforce the progressive guidance of deep features to shallow features, in the decoder phase, the composite features obtained through global feature-guided deep network processing serve as an additional input for the shallow features. This design enables layer-by-layer semantic optimization and smooth transmission.
Feature extraction module
To adapt to the feature representation needs of different tasks and further enhance the representational capability of features at various levels, we propose the SLAF module. This module leverages atrous convolutions to improve the perception and representational capacity of features at each level, thereby increasing the receptive field while ensuring feature resolution. This approach boosts the adaptability of corresponding features to various scales. As shown in Figure 2, to strengthen deep semantic features, the deepest features of the backbone network undergo dual multi-scale optimization. These optimizations are performed both before integrating global features and after the first stage of global feature integration, using SLAF for feature enhancement. For other layers in the network, SLAF is applied for feature enhancement only after the first stage of global feature integration. The primary purpose of this design is to intensively strengthen deep semantic features, thereby creating a positive impact on the shallow-level features. As illustrated in Figure 3, the input features are processed by the SLAF module through four parallel pyramid-like atrous convolution operations, each with a different scale, and the results are ultimately merged to output the enhanced features. The corresponding process can be presented as follows:
where represents the concatenation operation applied to all features within the parentheses, followed by batch normalization and the ReLu function, denotes the atrous convolution operation with a dilation rate of d applied to the features in the parentheses, refers to the input features of the SLAF module, and refers to the output features.
To fuse features from different levels within the network and simultaneously enhance the perceptual field of the features, the pyramid structure of the SLAF module is introduced as a fundamental functional unit within the MPAF module. Additionally, the pyramid characteristics of the MPAF module are also reflected in the dimensions of the input features. As shown in Figure 3, the three inputs to the MPAF module form a pyramid in both dimensionality and semantic information. As illustrated in Figure 3, the features from different levels are processed individually by the SLAF module, after which they are concatenated and merged. The resulting MPAF features can be obtained as follows:
where represents the i-th input to the MPAF module. In this work, the three inputs are F2, F3, and F4, obtained from the last three layers of the backbone.
Attention mechanism
As shown in Figure 2, the output features of the MPAF module are merged features obtained from three different level features processed through the SLAF module, resulting in a high number of feature channels. Different channels may contain distinct semantic information, and directly using high-dimensional features for semantic guidance could dilute critical channel-specific information, potentially adversely affecting the target task. To further integrate channel feature representations, we employ a simple channel attention mechanism.
As illustrated in Figure 4, the process begins with global pooling to generate an initial channel descriptor, providing a rough estimation of channel importance. To effectively capture channel dependencies, the mechanism employs two fully connected (FC) layers with ReLU activation. Finally, a sigmoid function maps the channel-wise vector to a range of [0, 1], enabling precise weight assignment across channels:
where is generated by the input feature after global pooling, and refers to the sigmoid and ReLU function, respectively, , . and are the weights corresponding to the two FC layers. The final output of this Channel Attention (CA) module is obtained by weighting the input feature with sc:
where is the c-th channel of the input feature, is the c-th weight corresponding to . As seen in Figure 2, refers to the output of the MPAF module. After applying the CA module to enhance the channel features of the MPAF output, we proceed with a 3×3 convolution to reduce the number of channels. The processed features are then normalized using batch normalization and activated through ReLU. Finally, these refined features are utilized as global guidance for each level to perform polyp region extraction tasks.
Loss function
The weighted binary cross entropy (wBCE) loss and the weighted IoU (wIoU) loss are commonly applied in segmentation tasks. In this work, we leverage both losses to compute the overall training loss as follows:
where represents the weighted BCE loss of the i-th output, corresponds to the wIoU loss. The calculations for both the wBCE loss and wIoU loss are given in Eq. [6] and [7], where pi denotes the predicted probability for the i-th pixel, yi is the ground truth label, denotes the weight assigned to the i-th pixel. The outputs OUT, OUT-1, OUT-2, OUT-3, OUT-4 shown in Figure 2 correspond to the 0-th, 1-th, 2-th, 3-th, 4-th output, respectively.
Datasets
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. We selected five widely used datasets to validate the algorithm’s performance. The main information about the datasets is shown in Table 1. It can be observed that 90% of the images from CVC-ClinicDB and Kvasir-SEG were used for training, with the remaining images serving as two separate test sets. This configuration is consistent with that in PraNet (15). Figure 5 presents some sample images from the testing datasets. As shown in Figure 5, the ETIS dataset exhibits notable differences compared to other datasets, including variations in color distribution and the presence of more folds and bubbles on the intestinal wall, which significantly increase the difficulty of the segmentation task.
Table 1
| Dataset | Sample Num | Images involved in training set | Images involved in testing set | Resolution |
|---|---|---|---|---|
| ETIS | 192 | – | 192 | 1,225×996 |
| CVC-ClinicDB | 612 | 550 | 62 | 384×288 |
| CVC-300 | 60 | – | 60 | 574×500 |
| CVC-ColonDB | 380 | – | 380 | 574×500 |
| Kvasir-SEG | 1,000 | 900 | 100 | 332×487 to 1,920×1,072 |
Results
Experimental settings
In the training stage of our PAC-P2T, data augmentation techniques such as random cropping, horizontal flipping, and rotation are applied. The inputs for each dataset are resized to 352×352, and a mini-batch size of 16 is employed. Training incorporates a multi-scale strategy alongside stochastic gradient descent (SGD) as the optimizer, with a momentum of 0.9 and a weight decay of 0.0005. The network is totally training for 50 epochs. The learning rate starts at a maximum value of 0.05 and is adjusted using a warm-up phase followed by linear decay. All models adopt the same dataset splitting strategy. Following common practice in polyp segmentation research, we report the performance of competing methods using the reported results, which are obtained under the same benchmark settings.
Evaluation metrics
We employ a set of six evaluation metrics to assess the performance of the models. These metrics include mean Dice coefficient (mDice), mIoU, weighted F-measure , S-measure , E-measure , and mean absolute error (MAE). mDice and mIoU measure the degree of overlap between the segmentation results and the ground truth. They can be calculated by
where TP, FP, and FN denote true positive, false positive, and false negative, respectively. Weighted F-measure provides a comprehensive evaluation by considering both recall and precision, while taking into account the pixel differences using a weighting method. can be calculated by
where is set to 1. S-measure focuses on the structural similarity between the segmentation results and the ground truth, it can be computed by
where and denote region based and object level similarity measurements, the method for the calculation of those two parameters can refer to (66), is employed for balancing and , is set to 0.5. E-measure considers the segmentation results at both pixel and image levels, it can be computed by
where w and h are the height and width of the input, The enhanced alignment matrix is used to capture pixel-level matching and image-level statistical characteristics (67). MAE quantifies the absolute difference between the inference results and the ground truth, it is defined as
where G represents the pixel values of ground truth, P represents the pixel values of prediction, and M is the total number of pixels.
Comparative experiments
Our algorithm was compared against 19 state-of-the-art methods, all trained using the same dataset. Except for Polyp-LVT, which lacks test results on the ETIS and CVC-ColonDB datasets, all other methods were evaluated on the all testing datasets. The comparison includes algorithms include UNet (7), UNet++ (8), SFA (13), ACSNet (18), PraNet (15), EUNet (20), MSEG (19), SANet (34), MSNet (21), Polyp-PVT (24), DCRNet (68), M2SNet (35), PPNet (44), PSNet (38), CFANet (22), Polyp-LVT (46), CGMA-Net (25), DFINet (26), and ADSANet (45). In addition to UNet and UNet++, which are general-purpose segmentation models, the other methods are specifically designed for polyp segmentation tasks. UNet pioneered the use of a U-shaped network for image segmentation, delivering strong results across a range of tasks. UNet++ further enhanced performance by redesigning the skip pathways. All these methods predominantly follow an encoder-decoder architecture, employing various optimization techniques to improve feature transmission and semantic information extraction.
The quantitative analysis results are presented in Tables 2-6 and Figures 6-10. As shown in Table 1, our training dataset primarily consists of most of the data from CVC-ClinicDB and Kvasir-SEG, while the remaining data from these datasets are used as separate testing datasets. The other three testing datasets are entirely unrelated to the training set. Therefore, the results in Tables 2-4 and Figures 6-8 are used to evaluate the generalization ability of the proposed method, whereas Tables 5,6 and Figures 9,10 assess its learning ability.
Table 2
| Methods | Year | ETIS | |||||
|---|---|---|---|---|---|---|---|
| mDice | mIoU | MAE | |||||
| UNet | 2015 | 0.398 | 0.335 | 0.366 | 0.684 | 0.643 | 0.036 |
| UNet++ | 2018 | 0.401 | 0.344 | 0.390 | 0.683 | 0.629 | 0.035 |
| SFA | 2019 | 0.297 | 0.217 | 0.231 | 0.557 | 0.531 | 0.109 |
| ACSNet | 2020 | 0.556 | 0.496 | 0.506 | 0.736 | 0.742 | 0.096 |
| PraNet | 2020 | 0.628 | 0.567 | 0.600 | 0.794 | 0.808 | 0.031 |
| EUNet | 2021 | 0.687 | 0.609 | 0.636 | 0.793 | 0.807 | 0.067 |
| MSEG | 2021 | 0.578 | 0.509 | 0.530 | 0.754 | 0.737 | 0.059 |
| SANet | 2021 | 0.750 | 0.654 | 0.685 | 0.849 | 0.881 | 0.015 |
| MSNet | 2021 | 0.719 | 0.664 | 0.678 | 0.840 | 0.830 | 0.020 |
| Polyp-PVT | 2021 | 0.787 | 0.706 | 0.750 | 0.871 | 0.906 | 0.013 |
| DCRNet | 2022 | 0.700 | 0.630 | 0.671 | 0.828 | 0.854 | 0.015 |
| M2SNet | 2023 | 0.749 | 0.678 | 0.712 | 0.846 | 0.872 | 0.017 |
| PPNet | 2023 | 0.784 | 0.716 | 0.743 | 0.871 | 0.885 | 0.013 |
| PSNet | 2023 | 0.787 | 0.713 | – | – | – | – |
| CFANet | 2023 | 0.732 | 0.655 | 0.693 | 0.845 | 0.892 | 0.014 |
| CGMA-Net | 2024 | 0.718 | 0.649 | 0.682 | – | 0.854 | – |
| DFINet | 2024 | 0.791 | 0.717 | 0.755 | – | 0.896 | – |
| ADSANet | 2025 | 0.813 | 0.743 | 0.796 | 0.876 | 0.910 | 0.010 |
| PAC-P2T | – | 0.815† | 0.744† | 0.785 | 0.877 | 0.913 | 0.013 |
† indicate the best outcomes. MAE, mean absolute error; mDice, mean Dice coefficient; mIoU, mean intersection over union; PAC-P2T, pyramid atrous convolution with pyramid pooling Transformer; PVT, pyramid vision Transformer.
Table 3
| Methods | Year | CVC-300 | |||||
|---|---|---|---|---|---|---|---|
| mDice | mIoU | MAE | |||||
| UNet | 2015 | 0.710 | 0.627 | 0.684 | 0.843 | 0.847 | 0.022 |
| UNet++ | 2018 | 0.707 | 0.624 | 0.687 | 0.839 | 0.834 | 0.018 |
| SFA | 2019 | 0.467 | 0.329 | 0.341 | 0.64 | 0.644 | 0.065 |
| ACSNet | 2020 | 0.856 | 0.788 | 0.830 | 0.921 | 0.943 | 0.010 |
| PraNet | 2020 | 0.871 | 0.797 | 0.843 | 0.925 | 0.950 | 0.010 |
| EUNet | 2021 | 0.837 | 0.765 | 0.805 | 0.904 | 0.919 | 0.015 |
| MSEG | 2021 | 0.874 | 0.804 | 0.852 | 0.924 | 0.948 | 0.009 |
| SANet | 2021 | 0.888 | 0.815 | 0.859 | 0.928 | 0.962 | 0.008 |
| MSNet | 2021 | 0.869 | 0.807 | 0.849 | 0.925 | 0.943 | 0.010 |
| Polyp-PVT | 2021 | 0.900 | 0.833 | 0.884 | 0.935 | 0.973 | 0.007 |
| DCRNet | 2022 | 0.863 | 0.787 | 0.825 | 0.923 | 0.939 | 0.013 |
| M2SNet | 2023 | 0.903 | 0.842 | 0.881 | 0.939 | 0.965 | 0.009 |
| PPNet | 2023 | 0.899 | 0.839 | 0.879 | 0.937 | 0.966 | 0.006 |
| PSNet | 2023 | 0.877 | 0.802 | – | – | – | – |
| CFANet | 2023 | 0.893 | 0.827 | 0.875 | 0.938 | 0.978 | 0.008 |
| Polyp-LVT | 2024 | 0.904 | 0.835 | – | – | 0.967 | 0.006 |
| CGMA-Net | 2024 | 0.865 | 0.794 | 0.833 | – | 0.934 | – |
| DFINet | 2024 | 0.886 | 0.818 | 0.860 | – | 0.953 | – |
| ADSANet | 2025 | 0.909† | 0.844 | 0.898 | 0.939 | 0.977 | 0.006 |
| PAC-P2T | – | 0.907 | 0.846† | 0.890 | 0.939 | 0.971 | 0.005 |
† indicate the best outcomes. MAE, mean absolute error; mDice, mean Dice coefficient; mIoU, mean intersection over union; PAC-P2T, pyramid atrous convolution with pyramid pooling Transformer; PVT, pyramid vision Transformer.
Table 4
| Methods | Year | CVC-ColonDB | |||||
|---|---|---|---|---|---|---|---|
| mDice | mIoU | MAE | |||||
| UNet | 2015 | 0.512 | 0.444 | 0.498 | 0.712 | 0.696 | 0.061 |
| UNet++ | 2018 | 0.483 | 0.410 | 0.467 | 0.691 | 0.68 | 0.064 |
| SFA | 2019 | 0.469 | 0.347 | 0.379 | 0.634 | 0.675 | 0.094 |
| ACSNet | 2020 | 0.704 | 0.631 | 0.684 | 0.821 | 0.840 | 0.052 |
| PraNet | 2020 | 0.712 | 0.640 | 0.699 | 0.820 | 0.847 | 0.043 |
| EUNet | 2021 | 0.756 | 0.681 | 0.730 | 0.831 | 0.863 | 0.045 |
| MSEG | 2021 | 0.716 | 0.649 | 0.697 | 0.829 | 0.839 | 0.039 |
| SANet | 2021 | 0.753 | 0.670 | 0.726 | 0.837 | 0.869 | 0.043 |
| MSNet | 2021 | 0.755 | 0.678 | 0.737 | 0.836 | 0.883 | 0.041 |
| Polyp-PVT | 2021 | 0.808 | 0.727 | 0.795 | 0.865 | 0.913 | 0.031 |
| DCRNet | 2022 | 0.735 | 0.666 | 0.724 | 0.834 | 0.859 | 0.038 |
| M2SNet | 2023 | 0.758 | 0.685 | 0.737 | 0.842 | 0.869 | 0.038 |
| PPNet | 2023 | 0.791 | 0.726 | 0.776 | 0.865 | 0.905 | 0.028 |
| PSNet | 2023 | 0.795 | 0.715 | – | – | – | – |
| CFANet | 2023 | 0.743 | 0.665 | 0.728 | 0.835 | 0.898 | 0.039 |
| CGMA-Net | 2024 | 0.780 | 0.698 | 0.757 | – | 0.893 | – |
| DFINet | 2024 | 0.799 | 0.718 | 0.779 | – | 0.901 | – |
| ADSANet | 2025 | 0.752 | 0.677 | 0.745 | 0.832 | 0.880 | 0.039 |
| PAC-P2T | – | 0.812† | 0.736† | 0.800 | 0.867 | 0.915 | 0.028 |
† indicate the best outcomes. MAE, mean absolute error; mDice, mean Dice coefficient; mIoU, mean intersection over union; PAC-P2T, pyramid atrous convolution with pyramid pooling Transformer.
Table 5
| Methods | Year | CVC-ClinicDB | |||||
|---|---|---|---|---|---|---|---|
| mDice | mIoU | MAE | |||||
| UNet | 2015 | 0.823 | 0.755 | 0.811 | 0.889 | 0.913 | 0.019 |
| UNet++ | 2018 | 0.794 | 0.729 | 0.785 | 0.873 | 0.891 | 0.022 |
| SFA | 2019 | 0.700 | 0.607 | 0.647 | 0.793 | 0.840 | 0.042 |
| ACSNet | 2020 | 0.882 | 0.826 | 0.873 | 0.927 | 0.947 | 0.011 |
| PraNet | 2020 | 0.899 | 0.849 | 0.896 | 0.936 | 0.963 | 0.009 |
| EUNet | 2021 | 0.902 | 0.846 | 0.891 | 0.936 | 0.959 | 0.011 |
| MSEG | 2021 | 0.909 | 0.864 | 0.907 | 0.938 | 0.961 | 0.007 |
| SANet | 2021 | 0.916 | 0.859 | 0.909 | 0.939 | 0.971 | 0.012 |
| MSNet | 2021 | 0.921 | 0.879 | 0.914 | 0.941 | 0.972 | 0.008 |
| Polyp-PVT | 2021 | 0.937 | 0.889 | 0.936 | 0.949 | 0.985 | 0.006 |
| DCRNet | 2022 | 0.896 | 0.844 | 0.890 | 0.933 | 0.964 | 0.010 |
| M2SNet | 2023 | 0.922 | 0.880 | 0.917 | 0.942 | 0.970 | 0.009 |
| PPNet | 2023 | 0.921 | 0.878 | 0.913 | 0.947 | 0.969 | 0.008 |
| PSNet | 2023 | 0.928 | 0.879 | – | – | – | – |
| CFANet | 2023 | 0.933 | 0.823 | 0.924 | 0.950 | 0.989 | 0.007 |
| Polyp-LVT | 2024 | 0.935 | 0.882 | – | – | 0.982 | 0.007 |
| CGMA-Net | 2024 | 0.927 | 0.880 | 0.922 | – | 0.976 | – |
| DFINet | 2024 | 0.937 | 0.893† | 0.932 | – | 0.980 | – |
| ADSANet | 2025 | 0.934 | 0.888 | 0.929 | 0.947 | 0.980 | 0.006 |
| PAC-P2T | – | 0.937† | 0.892 | 0.935 | 0.946 | 0.980 | 0.009 |
† indicate the best outcomes. MAE, mean absolute error; mDice, mean Dice coefficient; mIoU, mean intersection over union; PAC-P2T, pyramid atrous convolution with pyramid pooling Transformer; PVT, pyramid vision Transformer.
Table 6
| Methods | Year | Kvasir-SEG | |||||
|---|---|---|---|---|---|---|---|
| mDice | mIoU | MAE | |||||
| UNet | 2015 | 0.818 | 0.746 | 0.794 | 0.858 | 0.881 | 0.055 |
| UNet++ | 2018 | 0.821 | 0.743 | 0.808 | 0.862 | 0.886 | 0.048 |
| SFA | 2019 | 0.723 | 0.611 | 0.67 | 0.782 | 0.834 | 0.075 |
| ACSNet | 2020 | 0.898 | 0.838 | 0.882 | 0.92 | 0.941 | 0.032 |
| PraNet | 2020 | 0.898 | 0.84 | 0.885 | 0.915 | 0.944 | 0.030 |
| EUNet | 2021 | 0.908 | 0.854 | 0.893 | 0.917 | 0.951 | 0.028 |
| MSEG | 2021 | 0.897 | 0.839 | 0.885 | 0.912 | 0.942 | 0.028 |
| SANet | 2021 | 0.904 | 0.847 | 0.892 | 0.915 | 0.949 | 0.028 |
| MSNet | 2021 | 0.907 | 0.862 | 0.893 | 0.922 | 0.944 | 0.028 |
| Polyp-PVT | 2021 | 0.917 | 0.864 | 0.911 | 0.925 | 0.956 | 0.023 |
| DCRNet | 2022 | 0.886 | 0.825 | 0.868 | 0.911 | 0.933 | 0.035 |
| M2SNet | 2023 | 0.912 | 0.861 | 0.901 | 0.922 | 0.953 | 0.025 |
| PPNet | 2023 | 0.920 | 0.874 | 0.911 | 0.927 | 0.949 | 0.024 |
| PSNet | 2023 | 0.929† | 0.879† | – | – | – | – |
| CFANet | 2023 | 0.915 | 0.861 | 0.903 | 0.924 | 0.962 | 0.023 |
| Polyp-LVT | 2024 | 0.909 | 0.851 | – | – | 0.941 | 0.024 |
| CGMA-Net | 2024 | 0.907 | 0.854 | 0.895 | – | 0.948 | – |
| DFINet | 2024 | 0.924 | 0.878 | 0.914 | – | 0.956 | – |
| ADSANet | 2025 | 0.915 | 0.865 | 0.910 | 0.924 | 0.949 | 0.023 |
| PAC-P2T | – | 0.923 | 0.873 | 0.916 | 0.923 | 0.962 | 0.023 |
† indicate the best outcomes. MAE, mean absolute error; mDice, mean Dice coefficient; mIoU, mean intersection over union; PAC-P2T, pyramid atrous convolution with pyramid pooling Transformer; PVT, pyramid vision Transformer.
By evaluating the performance of various methods on the ETIS, CVC-300, and CVC-ColonDB datasets, our proposed PAC-P2T demonstrates superior overall performance. As shown in Figures 6,8, PAC-P2T consistently achieves the highest mDice and mIoU scores. In Figure 7, ADSANet slightly outperforms PAC-P2T in terms of the mDice metric, but is marginally inferior to PAC-P2T in mIoU. Notably, apart from our method, the top five performing algorithms in these datasets exhibit varying rankings, further validating the generalization ability of PAC-P2T. From Tables 2-4 and Figures 6-8, UNet and UNet++ perform relatively poorly on the ETIS and CVC-ColonDB datasets. Using their performance as a baseline for assessing dataset difficulty, ETIS and CVC-ColonDB appear more challenging than CVC-300. Figure 6 highlights PAC-P2T’s substantial advantage in the ETIS dataset. Compared to strong contenders such as DFINet, Polyp-PVT, PSNet, and PPNet, PAC-P2T surpasses them in mDice by more than 2%, and outperforms SANet while closely following PPNet by 6.5%. Table 4 and Figure 8 further show that on the CVC-ColonDB dataset, PAC-P2T achieves near-leading results across all metrics, with only a minimal 0.2% gap from the best-performing method on metric. On the CVC-ColonDB dataset, PAC-P2T significantly outperforms ADSANet, demonstrating its superior generalization capability. Compared to the popular PraNet, PAC-P2T improves mDice, mIoU, , , and indicators by 10%, 8.7%, 10.1%, 4.7%, and 6.8%, respectively. Additionally, compared with the recently reported DFINet, PAC-P2T demonstrates a 1.3% and 1.8% improvement in mDice and mIoU, respectively. In the relatively less challenging CVC-300 dataset, PAC-P2T consistently outperforms all competing methods across all evaluation metrics, further proving the effectiveness of its model architecture in polyp feature extraction and recognition.
Analyzing Tables 5,6 and Figures 9,10, we observe that since part of the CVC-ClinicDB and Kvasir-SEG datasets was used for model training, all methods achieve relatively strong performance on these datasets. The performance gap in mDice between traditional U-Net architectures and the best-performing methods is around 10%, significantly smaller than the 40% gap observed in ETIS, indicating that these datasets pose a lower challenge under the current training and testing setup. In CVC-ClinicDB, DFINet achieves the highest mDice and mIoU scores, while PAC-P2T and Polyp-PVT also attain top mDice results, though trailing DFINet by 0.1% and 0.4% in mIoU, respectively. For the remaining four metrics, Polyp-PVT perform best, with PAC-P2T yielding comparable results. Overall, PAC-P2T demonstrates strong performance in CVC-ClinicDB, while CFANet and Polyp-LVT also achieve competitive results. As shown in Figure 10, different algorithms exhibit similar performance without significant stepwise differences. PSNet exhibits better performance in terms of mDice and mIoU, achieving the highest scores on both metrics in the Kvasir-SEG dataset. Both PAC-P2T and DFINet perform well, with PAC-P2T trailing DFINet by 0.1% and 0.5% in mDice and mIoU, respectively, but surpassing it by 0.4% in metric, while PPNet achieves the best performance in indicator. In summary, PAC-P2T exhibits strong learning ability, performing on par with recent high-performing algorithms, while demonstrating superior generalization ability. Looking at Tables 2-4 and Figures 6-8, traditional U-Net architectures lag behind advanced methods in capturing semantic and spatial information, leading to consistently weaker performance across datasets. Although SFA introduces edge information, its symmetric dual-branch structure for edge and region feature extraction fails to effectively leverage the advantages of both branches. Our method enhances polyp boundary preservation through atrous convolution with varying receptive fields, while the MPAF module generates global features that are integrated via a two-stage direct transmission mechanism, preserving high-level semantic features and effectively guiding multi-stage feature fusion, which helps PAC-P2T achieve good polyp segmentation performance.
The qualitative comparison results, shown in Figure 11, illustrate the segmentation performance of different methods through predicted segmentation maps and difference maps with the ground truth. Overall, PAC-P2T demonstrates superior performance, producing clearer polyp boundaries and better alignment with the ground truth. The five input images contain polyps with blurred boundaries, particularly in the first, third, fourth, and fifth samples. In the first image, only PAC-P2T accurately extracts the polyp region while PolypPVT detects the polyp but also misidentifies adjacent areas. Similarly misidentifies are made by SANet, MSNet, DCRNet, and MS2Net. PraNet fails to recognize the polyp in first image effectively, generating extensive false-positive predictions, while SANet, MSNet, and DCRNet also exhibit uncertainty, introducing varying levels of noise prediction. The third and fifth samples further reveal PraNet’s struggles with prediction noise, while DCRNet demonstrates significant segmentation errors in the fourth and fifth images. In the second sample, although nearly half of the methods successfully extract the polyp region, most still produce large areas of misclassification, with only MSNet, PolypPVT, and PAC-P2T achieving reliable segmentation. The third image poses an additional challenge due to lighting variations that alter the polyp’s appearance compared to typical cases, yet PAC-P2T and MSEG effectively handle this issue. Performance improves across methods for the fourth image, but misclassifications still occur to some extent, with traditional U-Net architectures achieving relatively strong segmentation in this case. In the fifth image, the left boundary of the polyp is well-defined, whereas the right boundary is more ambiguous and difficult to delineate. While SANet, PPNet, and CFANet perform well in handling the left boundary, they are slightly less effective than PAC-P2T in segmenting the right boundary, further demonstrating PAC-P2T’s ability to handle the challenges of complex imaging in polyps.
Ablation study
In our proposed method, SLAF and multiple loss are incorporated to enhance the model’s feature extraction capabilities. To evaluate their effectiveness, the same training and testing settings are applied. Experimental results are summarized in Tables 7-11, with the best outcomes highlighted in bold.
Table 7
| Methods | ETIS | |||||
|---|---|---|---|---|---|---|
| mDice | mIoU | MAE | ||||
| w/o , , , | 0.758 | 0.691 | 0.725 | 0.859 | 0.881 | 0.016 |
| w/o , , | 0.766 | 0.700 | 0.731 | 0.862 | 0.871 | 0.016 |
| w/o SLAF, , , , | 0.711 | 0.644 | 0.678 | 0.829 | 0.873 | 0.015 |
| w/o SLAF, , , | 0.774 | 0.704 | 0.734 | 0.866 | 0.876 | 0.015 |
| w/o SLAF | 0.757 | 0.685 | 0.714 | 0.851 | 0.860 | 0.020 |
| PAC-P2T | 0.815† | 0.744† | 0.785† | 0.877† | 0.913† | 0.013† |
† indicate the best outcomes. MAE, mean absolute error; mDice, mean Dice coefficient; mIoU, mean intersection over union; PAC-P2T, pyramid atrous convolution with pyramid pooling Transformer.
Table 8
| Methods | CVC-300 | |||||
|---|---|---|---|---|---|---|
| mDice | mIoU | MAE | ||||
| w/o , , , | 0.901 | 0.842 | 0.881 | 0.941† | 0.957 | 0.005 |
| w/o , , | 0.886 | 0.825 | 0.866 | 0.930 | 0.957 | 0.006 |
| w/o SLAF, , , , | 0.896 | 0.830 | 0.870 | 0.933 | 0.961 | 0.006 |
| w/o SLAF, , , | 0.860 | 0.792 | 0.835 | 0.913 | 0.943 | 0.007 |
| w/o SLAF | 0.897 | 0.830 | 0.869 | 0.933 | 0.958 | 0.006 |
| PAC-P2T | 0.907† | 0.846† | 0.890† | 0.939 | 0.971† | 0.005† |
† indicate the best outcomes. MAE, mean absolute error; mDice, mean Dice coefficient; mIoU, mean intersection over union; PAC-P2T, pyramid atrous convolution with pyramid pooling Transformer.
Table 9
| Methods | CVC-ColonDB | |||||
|---|---|---|---|---|---|---|
| mDice | mIoU | MAE | ||||
| w/o , , , | 0.781 | 0.712 | 0.765 | 0.859 | 0.888 | 0.029 |
| w/o , , | 0.802 | 0.726 | 0.784 | 0.867 | 0.911 | 0.026† |
| w/o SLAF, , , , | 0.769 | 0.697 | 0.749 | 0.847 | 0.881 | 0.033 |
| w/o SLAF, , , | 0.792 | 0.716 | 0.769 | 0.860 | 0.890 | 0.031 |
| w/o SLAF | 0.802 | 0.730 | 0.780 | 0.868† | 0.897 | 0.028 |
| PAC-P2T | 0.812† | 0.736† | 0.800† | 0.867 | 0.915† | 0.028 |
† indicate the best outcomes. MAE, mean absolute error; mDice, mean Dice coefficient; mIoU, mean intersection over union; PAC-P2T, pyramid atrous convolution with pyramid pooling Transformer.
Table 10
| Methods | CVC-ClinicDB | |||||
|---|---|---|---|---|---|---|
| mDice | mIoU | MAE | ||||
| w/o , , , | 0.916 | 0.872 | 0.907 | 0.944 | 0.966 | 0.009 |
| w/o , , | 0.904 | 0.862 | 0.897 | 0.937 | 0.954 | 0.010 |
| w/o SLAF, , , , | 0.900 | 0.855 | 0.889 | 0.936 | 0.953 | 0.009 |
| w/o SLAF, , , | 0.926 | 0.881 | 0.916 | 0.944 | 0.968 | 0.010 |
| w/o SLAF | 0.922 | 0.876 | 0.913 | 0.944 | 0.968 | 0.009 |
| PAC-P2T | 0.937† | 0.892† | 0.935† | 0.946† | 0.980† | 0.009† |
† indicate the best outcomes. MAE, mean absolute error; mDice, mean Dice coefficient; mIoU, mean intersection over union; PAC-P2T, pyramid atrous convolution with pyramid pooling Transformer.
Table 11
| Methods | Kvasir-SEG | |||||
|---|---|---|---|---|---|---|
| mDice | mIoU | MAE | ||||
| w/o , , , | 0.912 | 0.865 | 0.899 | 0.926 | 0.945 | 0.024 |
| w/o , , | 0.913 | 0.871 | 0.902 | 0.925 | 0.945 | 0.025 |
| w/o SLAF, , , , | 0.914 | 0.868 | 0.900 | 0.925 | 0.943 | 0.022 |
| w/o SLAF, , , | 0.923 | 0.881† | 0.912 | 0.930† | 0.951 | 0.021† |
| w/o SLAF | 0.920 | 0.873 | 0.907 | 0.928 | 0.949 | 0.022 |
| PAC-P2T | 0.923† | 0.873 | 0.916† | 0.923 | 0.962† | 0.023 |
† indicate the best outcomes. MAE, mean absolute error; mDice, mean Dice coefficient; mIoU, mean intersection over union; PAC-P2T, pyramid atrous convolution with pyramid pooling Transformer.
To evaluate the effectiveness of SLAF, we conducted three sets of comparative experiments on each testing dataset. In the first set, SLAF was retained while losses , , , were removed. This was compared to removing the same losses and replacing SLAF with a 3×3 convolution. The corresponding results are presented in Rows 1 and 3 of Tables 7-11. After removing SLAF, the model’s performance declined on four datasets, with the most pronounced drop observed on the ETIS dataset (both mDice and mIoU decreased 4.7%), though a slight improvement was noted on the Kvasir-SEG dataset. The second set, reflected in Rows 2 and 4 of Tables 7-11, retained while removing SLAF. This led to a performance drop on the CVC-300 dataset, with mDice and mIoU decreasing by 2.6%, 3.3% separately. Lastly, the difference between Rows 5 and 6 of Tables 7-11 highlights the impact of SLAF inclusion. When SLAF was integrated, the proposed PAC-P2T achieved superior results across all five datasets. Collectively, these experiments confirm SLAF’s contribution to enhancing model performance.
To validate the effectiveness of the composite loss, we conducted three sets of comparative experiments. The results are grouped in Tables 7-11 as follows: the first two rows form one group, the middle two rows form another, and the final two rows constitute the third group. Comparing Row 2 with Row 1, where was added in the former, improvements were observed across four datasets, except for CVC-ClinicDB, where no positive impact was noted. Similarly, examining the middle two rows, even after removing SLAF, adding still improved the model’s performance. Finally, the results of the last two rows demonstrate that retaining all five losses better supports the extraction of effective feature information, enabling the model to achieve superior performance across the testing datasets. These results underscore the contribution of composite loss.
Table 12 reports the standard deviations of mDice for several representative models across five datasets. Overall, all methods demonstrate reasonable stability, while PAC-P2T consistently exhibits lower variance, indicating more reliable performance. Notably, on challenging datasets such as ETIS, PAC-P2T achieves improved stability compared to competing methods, further validating its robustness.
Table 12
| Model | Dataset | ||||
|---|---|---|---|---|---|
| ETIS | CVC-300 | CVC-ColonDB | CVC-ClinicDB | Kvasir-SEG | |
| PraNet | 0.628±0.036 | 0.871±0.051 | 0.712±0.038 | 0.899±0.048 | 0.898±0.041 |
| EUNet | 0.687±0.039 | 0.837±0.049 | 0.756±0.040 | 0.902±0.048 | 0.908±0.042 |
| CGMA-Net | 0.718±0.021 | 0.865±0.010 | 0.780±0.001 | 0.927±0.011 | 0.907±0.004 |
| ADSANet | 0.813±0.020 | 0.909±0.005 | 0.752±0.009 | 0.934±0.004 | 0.915±0.004 |
| PAC-P2T | 0.815±0.004 | 0.907±0.003 | 0.812±0.012 | 0.937±0.007 | 0.923±0.004 |
Data are presented as mDice ± SD. SD, standard deviation. mDice, mean Dice coefficient; PAC-P2T, pyramid atrous convolution with pyramid pooling Transformer.
To evaluate the computational complexity and inference efficiency of PAC-P2T, we conduct a quantitative comparison with several methods, as summarized in Table 13, using three key metrics: inference speed, floating-point operations (FLOPs), and parameter count (Params). All experiments in Table 13 are performed on an NVIDIA RTX 4090 GPU with 20 GB memory, implemented in PyTorch 2.3.1, and evaluated at a fixed input resolution of 352×352. Inference speed is reported in frames per second (FPS), FLOPs are measured in giga FLOPs (G) to reflect computational cost, and the number of parameters is expressed in millions (M) to indicate model memory footprint. As shown in Table 13, although PAC-P2T contains a relatively large number of parameters, its overall computational cost remains moderate, particularly when compared with CFANet. In terms of inference speed, PAC-P2T significantly outperforms PraNet, EUNet, and CFANet, while being slightly slower than ADSANet. Overall, while maintaining strong performance, PAC-P2T achieves near real-time inference under the current system configuration with manageable computational overhead, thereby satisfying practical deployment requirements.
Table 13
| Model | PraNet | EUNet | CFANet | ADSANet | PAC-P2T |
|---|---|---|---|---|---|
| Speed (FPS) | 19.35 | 15.34 | 21.58 | 61.10 | 23.91 |
| FLOPs | 13.15 | 23.15 | 55.36 | 36.70 | 38.87 |
| Param (M) | 30.50 | 31.36 | 25.24 | 28.97 | 40.26 |
FLOPs, floating-point operations; FPS, frames per second; PAC-P2T, pyramid atrous convolution with pyramid pooling Transformer.
Discussion
Polyp region segmentation in colonoscopic images is a highly challenging task. Based on U-shape architecture, we introduce atrous convolution to construct multi-scale feature extraction and global feature fusion modules. By leveraging global and local feature short connections and multi-stage fusion, PAC-P2T demonstrates a certain level of robustness in polyp region extraction. However, it still encounters misidentification issues in particularly challenging cases. Specifically, the global-local coupling, while effective, may become unreliable under severe distribution shifts such as illumination variations, where global semantic cues can be misleading and local features lack sufficient discriminative power, leading to polyp-like false positives. In addition, although atrous convolution enlarges the receptive field, it does not explicitly model structural continuity, making the network less robust when encountering smooth transitions between polyps and surrounding tissues, which often results in incomplete segmentation. Furthermore, the cross-layer feature interaction relies primarily on direct propagation and fusion, which may be insufficient to handle complex degradation factors such as motion blur, where both spatial details and high-level semantics are simultaneously corrupted.
Figure 12 presents several failure cases. Due to lighting effects, some intestinal wall regions may exhibit polyp-like appearances, as seen in the first two images of Row 1 in Figure 12. Another impact of lighting variations is the formation of brightness boundaries within polyp regions, as shown in the last image of Row 3. When the transition between the polyp and the intestinal wall is overly smooth, the predicted polyp region may be incomplete, as illustrated in the last image of Row 1 and the middle image of the Row 3. Additionally, endoscopic camera movement can cause image blurring, making motion blur another significant challenge in polyp segmentation, as demonstrated in the first image of Figure 12. Although PAC-P2T successfully detects the polyp region, motion blur results in unclear boundaries, leading to misidentification. Future work will focus on addressing the challenges posed by low-quality images, reflections, and smooth boundaries. Research will explore integrating image enhancement networks, improving the preservation of image details, and advancing high-resolution image segmentation to further enhance algorithm performance.
Conclusions
In this paper, we investigate polyp segmentation and introduce PAC-P2T to enhance the robustness of polyp region extraction. To expand the receptive field while preserving spatial resolution and retaining fine-grained details, we integrate atrous convolution as a key feature extraction unit. The SLAF module captures multi-scale contextual information and facilitates feature transmission, thereby enabling a better understanding of polyp morphology. Meanwhile, the MPAF module extracts multi-scale features across multiple layers and integrates them into global representations to guide feature extraction across different branches. Comparative experiments with 19 state-of-the-art methods on five datasets, along with ablation studies, demonstrate the effectiveness of PAC-P2T in polyp segmentation.
Acknowledgments
None.
Footnote
Funding: This work was supported in part by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-409/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Mei J, Zhou T, Huang K, Zhang Y, Zhou Y, Wu Y, Fu H. A survey on deep learning for polyp segmentation: techniques, challenges and future trends. Vis Intell. 2025;3:1.
- Siegel RL, Miller KD, Wagle NS, Jemal A. Cancer statistics, 2023. CA Cancer J Clin 2023;73:17-48. [Crossref] [PubMed]
- le Clercq CM, Bouwens MW, Rondagh EJ, Bakker CM, Keulen ET, de Ridder RJ, Winkens B, Masclee AA, Sanduleanu S. Postcolonoscopy colorectal cancers are preventable: a population-based study. Gut 2014;63:957-63. [Crossref] [PubMed]
- Yao L, Zhang L, Liu J, Zhou W, He C, Zhang J, Wu L, Wang H, Xu Y, Gong D, Xu M, Li X, Bai Y, Gong R, Sharma P, Yu H. Effect of an artificial intelligence-based quality improvement system on efficacy of a computer-aided detection system in colonoscopy: a four-group parallel study. Endoscopy 2022;54:757-68. [Crossref] [PubMed]
- Shelhamer E, Long J, Darrell T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans Pattern Anal Mach Intell 2017;39:640-51. [Crossref] [PubMed]
- Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans Pattern Anal Mach Intell 2017;39:2481-95. [Crossref] [PubMed]
- Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015; 5-9 October 2015; Munich, Germany. Springer; 2015:234-41.
- Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. Deep Learn Med Image Anal Multimodal Learn Clin Decis Support (2018) 2018;11045:3-11. [Crossref] [PubMed]
- Wu H, Zhao Z, Wang Z. META-Unet: Multi-scale efficient transformer attention Unet for fast and high-accuracy polyp segmentation. IEEE Trans Autom Sci Eng 2023;21:4117-28.
- Brandao P, Zisimopoulos O, Mazomenos E, Ciuti G, Bernal J, Visentini-Scarzanella M, Menciassi A, Dario P, Koulaouzidis A, Arezzo A, Hawkes DJ, Stoyanov D. Towards a computed-aided diagnosis system in colonoscopy: automatic polyp segmentation using convolution neural networks. J Med Robot Res 2018;3:1840002.
- Bardhi O, Sierra Sosa D, Garcia Zapirain B, Elmaghraby A, editors. Automatic colon polyp detection using Convolutional encoder-decoder model. 2017 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT); 18-20 December 2017; Bilbao, Spain. IEEE; 2017:445-8.
- Wang P, Xiao X, Brown JRG, Berzin TM, Tu M, Xiong F, Hu X, Liu P, Song Y, Zhang D, Yang X, Li L, He J, Yi X, Liu J, Liu X. Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy. Nat Biomed Eng 2018;2:741-8. [Crossref] [PubMed]
- Fang Y, Chen C, Yuan Y, Tong K, editors. Selective Feature Aggregation Network with Area-Boundary Constraints for Polyp Segmentation. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019: 22nd International Conference; 13–17 October 2019; Shenzhen, China. Springer; 2019:302-10.
- Banik D, Roy K, Bhattacharjee D, Nasipuri M, Krejcar O. Polyp-Net: A multimodel fusion network for polyp segmentation. IEEE Trans Instrum Meas 2021;70:1-12.
- Fan DP, Ji GP, Zhou T, Chen G, Fu H, Shen J, Shao L. PraNet: Parallel Reverse Attention Network for Polyp Segmentation. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020; 4–8 October 2020; Lima, Peru. Springer; 2020:263-73.
- Fang Y, Zhu D, Yao J, Yuan Y, Tong K-y. ABC-Net: Area-Boundary Constraint Network with Dynamical Feature Selection for Colorectal Polyp Segmentation. IEEE Sens J 2020;21:11799-809.
- Feng R, Lei B, Wang W, Chen T, Chen J, Chen DZ, Wu J. SSN: a stair-shape network for real-time polyp segmentation in colonoscopy images. 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI); 03-07 April 2020; Iowa City, IA, USA. IEEE; 2020:225-9.
- Zhang R, Li G, Li Z, Cui S, Qian D, Yu Y. Adaptive Context Selection for Polyp Segmentation. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020: 23rd International Conference; 4–8 October, 2020; Lima, Peru. Springer; 2020:253-62.
- Huang CH, Wu HY, Lin YL. Hardnet-mseg: A simple encoder-decoder polyp segmentation neural network that achieves over 0.9 mean dice and 86 fps. arXiv:210107172 [Preprint]. Available online: https://arxiv.org/abs/2101.07172
- Patel K, Bur AM, Wang G. Enhanced U-Net: A Feature Enhancement Network for Polyp Segmentation. Proc Int Robot Vis Conf 2021;2021:181-8.
- Zhao X, Zhang L, Lu H. Automatic polyp segmentation via multi-scale subtraction network. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021: 24th International Conference; 27 September–1 October 2021; Strasbourg, France. Springer; 2021:120-30.
- Zhou T, Zhou Y, He K, Gong C, Yang J, Fu H, Shen D. Cross-level Feature Aggregation Network for Polyp Segmentation. Pattern Recognit 2023;140:109555.
- Zhai C, Yang L, Liu Y, Bian G. DBG-Net: A Double-Branch Boundary Guidance Network for Polyp Segmentation. IEEE Trans Instrum Meas 2024;73:1-17.
- Dong B, Wang W, Fan DP, Li J, Fu H, Shao L. Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers. arXiv:210806932 [Preprint]. Available online: https://arxiv.org/abs/2108.06932.
- Zheng J, Yan Y, Zhao L, Pan X. CGMA-Net: Cross-level Guidance and Multi-scale Aggregation Network for Polyp Segmentation. IEEE J Biomed Health Inform 2024;28:1424-35. [Crossref] [PubMed]
- Yue G, Li Y, Wu S, Jiang B, Zhou T, Yan W, Lin H, Wang T. Dual-Domain Feature Interaction Network for Automatic Colorectal Polyp Segmentation. IEEE Trans Instrum Meas 2024;73:1-12.
- Liu F, Hua Z, Li J, Fan L. DBMF: Dual Branch Multiscale Feature Fusion Network for polyp segmentation. Comput Biol Med 2022;151:106304. [Crossref] [PubMed]
- Choi W, Cha YJ. SDDNet: Real-time crack segmentation. IEEE Trans Ind Electron 2019;67:8016-25.
- Ali R, Cha YJ. Attention-based generative adversarial network with internal damage segmentation using thermography. Autom Constr 2022;141:104412.
- Silva J, Histace A, Romain O, Dray X, Granado B. Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. Int J Comput Assist Radiol Surg 2014;9:283-93. [Crossref] [PubMed]
- Bernal J, Sánchez FJ, Fernández-Esparrach G, Gil D, Rodríguez C, Vilariño F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput Med Imaging Graph 2015;43:99-111.
- Vázquez D, Bernal J, Sánchez FJ, Fernández-Esparrach G, López AM, Romero A, Drozdzal M, Courville A. A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images. J Healthc Eng 2017;2017:4037190. [Crossref] [PubMed]
- Jha D, Smedsrud PH, Riegler MA, Halvorsen P, de Lange T, Johansen D, Johansen HD. Kvasir-seg: A segmented polyp dataset. MultiMedia Modeling; 5–8 January 2020; Daejeon, South Korea. Springer; 2020:451-62.
- Wei J, Hu Y, Zhang R, Li Z, Zhou SK, Cui S. Shallow attention network for polyp segmentation. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021: 24th International Conference; 27 September–1 October 2021; Strasbourg, France. Springer; 2021:699-708.
- Zhao X, Jia H, Pang Y, Lv L, Tian F, Zhang L, Sun W, Lu H. M2SNet: Multi-scale in Multi-scale Subtraction Network for Medical Image Segmentation. arXiv:230310894 [Preprint]. Available online: https://arxiv.org/abs/2303.10894
- Hu K, Zhao L, Feng S, Zhang S, Zhou Q, Gao X, Guo Y. Colorectal polyp region extraction using saliency detection network with neutrosophic enhancement. Comput Biol Med 2022;147:105760. [Crossref] [PubMed]
- Zhou Q, Wang J, Li J, Zhou C, Hu H, Hu K. RMFDNet: Redundant and Missing Feature Decoupling Network for salient object detection. Eng Appl Artif Intell 2025;139:109459.
- Lewis J, Cha YJ, Kim J. Dual encoder–decoder-based deep polyp segmentation network for colonoscopy images. Sci Rep 2023;13:1183. [Crossref] [PubMed]
- Zhang R, Lai P, Wan X, Fan D-J, Gao F, Wu X-J, Li G. Lesion-Aware Dynamic Kernel for Polyp Segmentation. Medical Image Computing and Computer Assisted Intervention – MICCAI 2022: 25th International Conference; 18–22 September 2022; Singapore. Springer; 2022:99-109.
- Liu G, Jiang Y, Liu D, Chang B, Ru L, Li M. A coarse-to-fine segmentation frame for polyp segmentation via deep and classification features. Expert Syst Appl 2023;214:118975.
- Tomar NK, Jha D, Bagci U. Dilatedsegnet: A deep dilated segmentation network for polyp segmentation. MultiMedia Modeling; 9–12 January 2023; Bergen, Norway. Springer; 2023:334-44.
- Dai D, Dong C, Yan Q, Sun Y, Zhang C, Li Z, Xu S I. (2)U-Net: A dual-path U-Net with rich information interaction for medical image segmentation. Med Image Anal 2024;97:103241. [Crossref] [PubMed]
- Du X, Xu X, Chen J, Zhang X, Li L, Liu H, Li S. UM-Net: Rethinking ICGNet for polyp segmentation with uncertainty modeling. Med Image Anal 2025;99:103347. [Crossref] [PubMed]
- Hu K, Chen W, Sun Y, Hu X, Zhou Q, Zheng Z. PPNet: Pyramid pooling based network for polyp segmentation. Comput Biol Med 2023;160:107028. [Crossref] [PubMed]
- Hu K, Wang C, Zhu H, Zhao L, Fu C, Yang W, Pan W. Adjacent-differential network with shallow attention for polyp segmentation in colonoscopy images. Sci Rep 2025;15:32777. [Crossref] [PubMed]
- Lin L, Lv G, Wang B, Xu C, Liu J. Polyp-LVT: Polyp segmentation with lightweight vision transformers. Knowl Based Syst 2024;300:112181.
- Zhao H, Shi J, Qi X, Wang X, Jia J, editors. Pyramid scene parsing network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 21-26 July 2017; Honolulu, HI, USA. IEEE; 2017:6230-9.
- Sarker MMK, Rashwan HA, Akram F, Banu SF, Saleh A, Singh VK, Chowdhury FU, Abdulwahab S, Romani S, Radeva P, Puig D. Medical Image Computing and Computer Assisted Intervention – MICCAI 2018; 16-20 September 2018; Granada, Spain. Springer; 2018:21-9.
- Feng S, Zhao H, Shi F, Cheng X, Wang M, Ma Y, Xiang D, Zhu W, Chen X. CPFNet: Context pyramid fusion network for medical image segmentation. IEEE T Med Imaging 2020;39:3008-18.
- Qin Z, Liu J, Zhang X, Tian M, Zhou A, Yi S, Li H. Pyramid fusion transformer for semantic segmentation. IEEE T Multimedia 2024;26:9630-43.
- Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Lu T, Luo P, Shao L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 10-17 October 2021; Montreal, QC, Canada. IEEE; 2021:548-58.
- Wu YH, Liu Y, Zhan X, Cheng M-M. P2T: Pyramid pooling transformer for scene understanding. IEEE T Pattern Anal Mach Intell 2022;45:12760-71.
- Yue G, Li S, Cong R, Zhou T, Lei B, Wang T. Attention-guided pyramid context network for polyp segmentation in colonoscopy images. IEEE T Instrum Meas 2023;72:1-13.
- Chen L-C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. arXiv:170605587 [Preprint]. Available online: https://arxiv.org/abs/1706.05587
- Wu Z, Shen C, Hengel Avd. Bridging category-level and instance-level semantic image segmentation. arXiv:160506885 [Preprint]. Available online: https://arxiv.org/abs/1605.06885
- Kang DH, Cha YJ. Efficient attention-based deep encoder and decoder for automatic crack segmentation. Struct Health Monit 2022;21:2190-205. [Crossref] [PubMed]
- Huang Y, Wang Q, Jia W, Lu Y, Li Y, He X. See more than once: Kernel-sharing atrous convolution for semantic segmentation. Neurocomputing 2021;443:26-34.
- Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. Computer Vision – ECCV 2018; 8–14 September 2018; Munich, Germany. Springer; 2018:833-51.
- Zhao T, Wu X. Pyramid Feature Attention Network for Saliency Detection. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 15-20 June 2019; Long Beach, CA, USA. IEEE; 2019:3080-9.
- Guo Z, Bian L, Wei H, Li J, Ni H, Huang X. DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation. IEEE T Circ Syst Vid Technol 2025;35:3679-92.
- Das PK, Dash A, Meher S. ACDSSNet: Atrous Convolution-Based Deep Semantic Segmentation Network for Efficient Detection of Sickle Cell Anemia. IEEE J Biomed Health Inform 2024;28:5676-84. [Crossref] [PubMed]
- Halder A, Dey D. Atrous convolution aided integrated framework for lung nodule segmentation and classification. Biomed Signal Process Control 2023;82:104527.
- Son H, Lee J, Cho S, Lee S. Single image defocus deblurring using kernel-sharing parallel atrous convolutions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 10-17 October 2021; Montreal, QC, Canada. IEEE; 2021:2622-30.
- Qiao S, Chen L-C, Yuille A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 20-25 June 2021; Nashville, TN, USA. IEEE; 2021:10208-19.
- Tam AY, Mao YJ, Lai DK, Chan AC, Cheung DSK, Kearns W, Wong DW, Cheung JC. SaccpaNet: A Separable Atrous Convolution- Based Cascade Pyramid Attention Network to Estimate Body Landmarks Using Cross-Modal Knowledge Transfer for Under-Blanket Sleep Posture Classification. IEEE J Biomed Health Inform 2026;30:1593-604. [Crossref] [PubMed]
- Fan D-P, Cheng M-M, Liu Y, Li T, Borji A. Structure-measure: A new way to evaluate foreground maps. 2017 IEEE International Conference on Computer Vision (ICCV); 22-29 October 2017; Venice, Italy. IEEE; 2017:4638-47.
- Fan DP, Gong C, Cao Y, Ren B, Cheng MM, Borji A. Enhanced-alignment measure for binary foreground map evaluation. arXiv:180510421 arXiv:180510421 [Preprint]. Available online: https://arxiv.org/abs/1805.10421
- Yin Z, Liang K, Ma Z, Guo J. Duplex contextual relation network for polyp segmentation. 2022 IEEE International Symposium on Biomedical Imaging (ISBI); 28-31 March 2022; Kolkata, India. IEEE; 2022:998-1002.



