DeepNeXt: a lightweight polyp segmentation algorithm based on multi-scale attention
Introduction
Colorectal cancer (CRC) is one of the most common cancers worldwide, with an estimated one to two million new cases and 700,000 fatalities annually. As such, CRC is the third most prevalent cancer and the fourth leading cause of cancer-induced mortality, following lung, liver, and stomach cancers. In terms of prevalence rates, CRC is the second most common cancer among women, accounting for 9.2%, and the third most common among men, with a rate of 10% (1). According to the cancer statistics released by China’s National Cancer Center in 2018, the incidence of CRC in China stands as the second highest globally, predominantly affecting individuals aged over 40 years.
The majority of colorectal carcinoma (CRC) lesions initially emerge as adenomatous polyps within the intestinal mucosa. As the pathology advances, these polyps can evolve into malignant neoplasms and disseminate to extraintestinal sites. Therefore, the implementation of early screening for intestinal polyps is critical, with the potential to significantly elevate survival rates to approximately 90% (2). Clinical observations indicate that polyps exhibit considerable heterogeneity in size, color, and surface patterns during colonoscopic evaluations. This variability, coupled with the often minimal morphological differentiation from adjacent mucosal tissues, necessitates the discernment of highly skilled clinicians, thereby increasing the likelihood of diagnostic inaccuracies. Addressing these diagnostic challenges, the adoption of advanced medical imaging technologies for adjunctive diagnostic purposes has become a standard practice. The precision in delineating polyp size and morphology is pivotal for the accurate diagnosis of gastrointestinal polyposis (3), and thus, accurate polyp segmentation is instrumental in augmenting the objectivity and precision of clinical diagnostics (4), holding significant implications for clinical practice.
Segmentation for polyps can be broadly categorized into two types, the first being segmentation methods based on manually designed traditional machine learning algorithms, and the second being deep learning segmentation algorithms that have emerged with the development of deep learning techniques. Earlier traditional machine learning segmentation algorithms for polyp segmentation failed to cope with complex images and had a high misdiagnosis rate. Mamonov et al. (5) in 2014 proposed an automated polyp detection system in colon capsule endoscopy using feature extraction and machine learning methods to characterize and classify the preprocessed image to determine the presence of polyps in the image. In order to make the extracted feature information more comprehensive, Tajbakhsh et al. (6) in 2016 proposed a method for automated polyp detection in colonoscopy videos using shape and contextual information, which uses shape feature descriptors to extract features from candidate regions in each video frame, and then a support vector machine (SVM) classifier is used to classify and determine whether it is a polyp or not. In 2018, Misawa et al. (7) developed an original artificial intelligence (AI) assisted Computer-Aided Detection (CADe) system that alerts to the possible presence of polyps by changing the color of the four corners of the endoscopic image to red when the probability exceeds a critical value. The traditional methods also include wavelet domain based probabilistic graphical modeling method (8), threshold-based segmentation method (9) and so on.
Amidst the rapid advancements in deep neural networks, convolutional neural networks (CNNs) have been extensively applied in the domain of medical image segmentation. The U-net (10) architecture, in particular, has demonstrated considerable strength in this field, attributed to its distinct encoder-decoder structure that proficiently facilitates feature extraction and segmentation of medical images. Furthermore, this encoder-decoder framework has been widely adopted in polyp segmentation tasks, undergoing various modifications to its modules and overall structure to enhance segmentation efficacy. Taking inspiration from U-net, the U-net++ (11) architecture has introduced dense convolutional blocks to address the issue of feature disappearance and to amplify the capability for multi-scale feature extraction, thereby refining the original U-net design. Concurrently, the DeepLab (12-14) series of networks, which employ atrous convolution, have further advanced the development of segmentation networks. Their unique multi-scale feature extraction modules have significantly improved segmentation performance. In pursuit of superior outcomes in medical image segmentation, scholars in the field have proposed several network enhancements, maintaining a continuous effort towards optimizing segmentation accuracy. Zhang et al. (15) proposed a CNN called SSDGPNet for polyp detection, which reuses the data lost in pooling layers and connects this data as additional feature maps, achieving significant results in small polyp detection. In 2019, Choi and Cha (16) introduced the SDDNet segmentation model, composed of standard convolutions, densely connected separable convolution modules, improved Atrous Spatial Pyramid Pooling (ASPP) modules, and decoder modules, achieving remarkable results in crack segmentation. Kang and Cha (17) proposed a novel Semantic Transformer Representation Network (STRNet). The model consists of an encoder based on squeeze-and-excitation attention, a decoder based on multi-head attention, a coarse-tuned upsampling module, a Focal-Tversky loss function, and a learnable swish activation function for pixel-level real-time crack segmentation in complex scenes. Additionally, a method for evaluating the complexity of image scenes is also proposed. Furthermore, in 2020, Krenzer et al. (18) demonstrated the value of convolutional networks in endoscopic polyp recognition by comparing the YOLOv3 model with the Faster R-CNN and Cascade Mask R-CNN models. In 2020, Jha et al. introduced the Double U-net (19), an innovative approach that enhances semantic information capture by stacking two U-net architectures and interlinking them with skip connections. Concurrently, it employs ASPP spatial pyramid pooling to assimilate contextual information, thereby achieving superior segmentation outcomes. On another front, in 2021, Wan et al. (20) proposed a polyp object detection model based on the YOLOv5 framework, incorporating a self-attention mechanism. By integrating the attention mechanism during the feature extraction process, the contribution of information-rich feature channels was enhanced, resulting in significant improvements in both detection speed and effectiveness for small polyps and polyps with low contrast. Wei et al. advanced the SANet network (21), which further refines segmentation efficacy by diminishing color diversity and optimizing the utilization of shallow feature information through color exchange operations and a shallow attention mechanism.
Furthermore, the research community has devoted attention to the intricate capture of contextual relationships. Yin et al. unveiled the DCRNet (22), a network adept at seizing multidimensional information and effectuating proficient segmentation by harmonizing internal and external contextual relationship modules. In a similar vein, Yue et al. proposed the BCNet (23), capable of effectively synergizing cross-layer feature interactions between attention layers and a global feature aggregation module, thereby enabling nuanced fusion of contextual information across layers and amalgamating it with global insights for enhanced boundary segmentation precision. Zhang et al. (24) proposed the LDNet model, which combines a dynamic kernel generation and update scheme. The model enhances the contrast between polyps and background regions through the life cycle assessment (LCA) module and improves segmentation accuracy by capturing long-distance contextual relationships with the European Space Agency (ESA) module.
In 2022, Brand et al. (25) used four different types of processors to test a total of 519,856 images from a dataset including 10 complete colonoscopy videos, achieving an overall accuracy of 98.59% in detecting visible instruments with a CNN. This demonstrated the reliability of using AI technology for polyp detection in colonoscopies. In 2022, Brand et al. (26) also performed a frame-by-frame analysis of 111 colonoscopy videos and performance parameters, including per-polyp sensitivity, per-frame sensitivity, and polyp first detection time, on a commercial CADe system. The system was found to work reliably, though false positives (FPs) remained a side effect. In 2022, Krenzer et al. (27) proposed a semi-automatic annotation framework, where experts review videos and annotate all frames with lesions with the help of AI. The results can be used to train AI models. Fitting et al. (28) introduced a new polyp detection system, ENDOMIND, which showed higher sensitivity and specificity compared to commercial CADe systems. In 2023 (29), the project team launched the first fully open-source automatic polyp detection system, ENDOMIND-Advanced, which utilizes post-processing techniques based on video detection to process image streams in real-time. Additionally, they designed a few-shot learning algorithm based on deep metric learning to classify polyps in cropped regions using a Transformer network (30). This algorithm created an embedding space for polyps, achieving an accuracy of 89.35%.
In 2023, Lewis et al. (31) proposed a dual encoder-decoder network for colon polyp segmentation, which incorporates attention mechanisms at various levels and modules of the encoder, in combination with convolutional operations. In the dual decoder stage, the newly introduced enhanced extended transformer decoder enables improved extraction of global information. This is further compensated by the merging module to address losses caused by the CNN and transformer, resulting in accurate segmentation maps. The model was validated on multiple datasets, achieving excellent results. Roy et al. (32) put forward the ConvNeXt 3D encoder-decoder network, which leverages the ConvNeXt module’s upscaling and downscaling blocks to preserve semantic integrity across various scales and ensure a consistent segmentation performance. Subsequently, Liu et al. (33) developed a ConvNeXt 3D encoder-decoder network grounded in the SegNet framework, integrating dendritic neurons and a deep supervision mechanism. This network employs a sophisticated feature extractor to discern precise and richly informative features from medical images, culminating in the proposition of the DDNet model. Moreover, He et al. (34) proposed a data augmentation framework that exploits spatial mapping relationships within datasets to meticulously analyze target structures, culminating in noteworthy segmentation achievements. Eisenmann et al. from the German Cancer Research Center (35) designed an international survey to elucidate the current state of algorithm development in the specific field of biomedical imaging analysis. The survey revealed that 94% of the solutions are based on deep learning, with most respondents indicating that the sheer volume of images prevents immediate processing. The methodologies previously discussed primarily draw upon a diverse array of foundational network architectures, and are enhanced by integrating attention mechanisms, sophisticated feature extraction frameworks, or by incorporating data augmentation schemes. While the layering of modules has indeed elevated the segmentation prowess of these models, it has concurrently escalated the models’ parameter counts and computational demands to substantial magnitudes. However, the imperative to achieve precise segmentation of polyps is driven by the goal to augment the diagnostic efficacy for medical practitioners. This necessity has spurred the development of a suite of lightweight models, designed for seamless integration into clinical devices, thereby aligning with the operational demands of clinical settings.
To address the above problems, this paper starts from the basic CNN convolutional module, a new design and research, for the development of a lightweight polyp segmentation network, so that it can improve the segmentation accuracy and at the same time, it can further control the number of parameters of the model and the amount of computation, so that the lightweight network is conducive to the algorithms in its application in the medical equipment. Our main contributions are summarized below:
- A multi-segment lightweight convolutional encoder module is proposed, comprised of multiple depthwise separable convolutional layers arranged linearly, facilitating accurate feature extraction while maintaining model lightweightness.
- A jumping multi-stage feature fusion network framework is constructed, employing selective jump extraction to further process and integrate key stage features. This approach addresses the issues of feature loss and insufficient depthwise separable convolutional feature extraction while maintaining feature propagation between networks.
- A multiscale attention feature encoding module is proposed, which incorporates deep strip convolution and coordinate attention (CA) mechanisms in multiple branches, and this combination enables the module to extract information from different scales and dimensions to generate feature representations with high robustness.
Methods
Overall network model
The paper adopted deeplabV3+ structure as the base framework and improved and optimized its structure. In the initial feature extraction stage this paper constructed a feature encoder consisting of multilayer dual depthwise separable convolution and operations of batch normalization and ReLU activation function. Then through the banded multi-scale feature extraction module, after the banded convolution of its different branches, and at the same time, the CA mechanism was introduced, which enhanced the extraction of spatial and positional information of its different scale feature maps, improved the localization of the polyps, and the mined of polyp boundary information. This mechanism strengthened the extraction of different scale information and banded targets, which helped to capture irregular polyp tissues. During the feature fusion stage, we merged the output of the multi-scale attention extraction module with the feature maps from stages 2 and 4 of the convolutional encoder. Subsequently, we input the fused multi-information feature maps into the dual depthwise separable convolutional layers for another round of feature extraction. Then, we performed an upsampling operation to obtain the final output. This process aided in restoring the size of the feature maps, thereby achieving higher resolution results. The specific structure is illustrated in Figure 1.
Similar feature coder-decoder structures were widely used in medical image analysis. The encoder-decoder architecture of U-net efficiently captured both local and global information in the image. Additionally, depthwise separable convolution reduced the number of parameters while maintaining better expressive power, a combination validated in the literature (36,37). Moreover, the introduction of the multi-scale attention mechanism further strengthened the model’s feature extraction capability and improved sensitivity to information at different scales. The combination of feature fusion and upsampling operations helped preserve and restore the details and resolution of the feature map, leading to more accurate output.
Lightweight encoder module
In the initial feature extraction stage, we used a lightweight encoder module with a U-net-like structure, which had an efficient feature extraction capability, and the specific structure is shown in Figure 2. We constructed a feature encoder comprising five layers of dual depthwise separable convolutions, with each layer of depthwise separable convolution immediately followed by batch normalization and ReLU activation functions. Simultaneously, we employed maximum pooling downsampling operations to reduce the dimensionality of the feature maps. With these operations, after four downsampling and five double convolution operations, the original input image of size 224×224×3 gradually transformed into a 14×14×512 feature map. This design helped reduce computation while preserving key information. Consequently, the feature map not only decreased in size at this stage but also became enriched with feature representation information in each channel through the multi-level convolution operation, of particular note is the increase in the number of output channels from the initial 3 channels to 512 channels. This enhancement allowed the channels to carry rich feature information, thereby laying a solid foundation for subsequent multi-scale and multi-dimensional feature extraction.
Multi-scale attention feature encoding module
This paper proposed a multi-scale attention feature extraction module (MSAN) to extract information across different scales and dimensions, resulting in more robust feature representations. The model structure was shown in Figure 3. This module combined the advantages of multi-scale information extraction and convolutional attention mechanisms, enabling better extraction of spatial features and positional information at different scales.
The multi-scale attention module consisted of four parts: first, the depthwise separable convolution, which aggregated local information, significantly reduced the number of parameters and computational complexity by performing depthwise convolution and pointwise convolution separately on the spatial dimension, thereby enhancing the ability to capture local information. Second, the multi-branch depthwise convolution applied convolution operations with multiple kernels of different sizes to the feature maps, obtaining rich multi-scale contextual information. This multi-branch structure integrated features from different scales, improving the model’s recognition and segmentation performance. Third, after the depthwise convolution, the CA attention module was added to extract multi-dimensional attention information. This mechanism combined channel attention and spatial attention, weighted important information in the input feature maps while suppressing irrelevant or redundant information, thereby enhancing the model’s representation capability. Finally, a 1×1 convolution was used to fuse information across different channels, further improving the feature representation and the model’s generalization performance.
The multi-scale attention module achieved sufficient extraction and fusion of multi-scale, multi-dimensional feature information through the synergy of these four parts. The corresponding formula was as follows:
where i={0,1,2,3,4,5} represents different branches, F represents original input data, DW represents depthwise separable convolution, Scale represents depth-strip convolution, CA represents CA attention convolution.
First of all, this structure started with a depthwise separable convolution of size 5×5 to converge the local feature information, and a depth strip convolution was used in the multiscale information extraction module. In each branch, we used a pair of depth-striped convolutions that simulate standard depth-direction convolutions with large kernels. In this case, the kernel sizes were set to 7, 9, 11, 15, and 21, respectively. depth-striped convolution was a lightweight approach that mimicked a standard 2D convolution with a kernel size of 7×7 by requiring only a pair of 7×1 and 1×7 convolution operations. This method could better extract long targets in the segmented scene and helped to extract strip features. After depth-striped convolution at each branch, we added the CA attention module to compute spatial attention and positional attention on the feature maps after deep strip convolution. Then, the features obtained from each branch extraction were subjected to a channel fusion operation with the residual branches to generate a feature map with multi-dimensional information. Next, the relationship between different channels was modeled using a 1×1 convolution. The output of this convolution served as attention weights to reweight the inputs to the multiscale attention module. With the aforementioned enhancements, our method combined the benefits of multi-scale information extraction and convolutional attention, thereby improving the efficacy of feature extraction. This holds significant application value for tasks in pertinent scenarios, such as target segmentation.
The CA mechanism within this module was utilized to model and weigh features across different locations within an image in a computer vision task, as depicted in Figure 4. CA encoded the channel relationship and long-term dependence through accurate position information, the specific operation was as follows: first, in the information embedding module, in order to overcome the global average pooling is difficult to preserve the position information, this paper will decomposed the global average pooling along the X, Y direction, the output was the aggregated features along the two spatial directions, to get a pair of features with accurate position information perception graph, which helped to localize the target of interest, the specific formula was as follows:
Second, after transposing the W, H direction of one of the feature matrices for splicing, the features were extracted and dimensionality reduced using a 1×1 convolution operation, followed by a split operation based on the height and width of the original input F-feature maps, which were sliced to obtain the feature maps and , respectively, were once again fed into the 1×1 convolution module and , for feature extraction and dimension upgrading, after the Sigmoid activation function was applied for nonlinear processing ultimately in the H and W dimensions to obtained the C×1×W and C×H×1 feature matrices, this step of the formula is as follows:
Finally, the feature matrix on the two dimensions extracted above became the attention weights on the two dimensions, and we computed the multiplicative weighting of the outputs of the two dimensions with the original features to finally obtain the output feature maps of the multidimensional convolutional attention module, and the formula for this step was as follows:
Feature fusion module
Due to the high similarity between polyps and surrounding tissues, incomplete feature extraction could result in blurred polyp segmentation edges. To address this issue, we proposed a feature fusion module. This module concatenated and weighted feature maps from different extraction stages during the feature encoding phase to overcome the problem of feature loss caused by multi-stage convolution operations, thus highlighting polyp edge information. The specific process was shown in the Figure 5.
During the feature encoding phase, as convolution operations increased the number of feature channels, while enriching multi-channel feature information, it inevitably caused the loss of single-channel internal feature information, leading to potentially less precise polyp segmentation results. To overcome the problem of feature loss caused by multi-convolution operations during the encoder stage, we selectively concatenated feature maps from different stages of the encoder. To fuse coarse and fine-grained features in the encoder stage, we selected the output features from the second and fourth stages of the encoder and the output feature maps from the multi-scale attention feature encoding module for concatenation. Before concatenation, 1×1 convolution operations were applied to adjust the feature maps to the same size, enabling a concatenation operation that produced richer and more complete multi-scale fused feature maps. This approach ensured feature integrity while enhancing polyp edge information. Then, a dual-layer depthwise separable convolution layer was used to re-extract features from the feature maps, followed by a 1×1 convolution operation to adjust the target output size, resulting in a segmentation output with clearer edges.
Results
Experimental environment and data processing
The hardware environment of this experiment was an NVIDIA GeForce RTX 3090 graphics card, the operating system was Windows 11, the programming language was Python 3.7, all the programs were implemented in Pytorch framework, all the configurations in the experiment are the same, and the specific network parameters were set as follows: the size of the input image was 224×224, the iteration number was 100 times, the Batch_size was 10, the initial learning rate is 0.0001, and the optimizer was selected as Adam (Adaptive Moment Estimation).
This paper used the validation of the proposed method using the Kvasir Segmentation Dataset (Kvasir-SEG dataset) and Colorectal Cancer-Clinic Datasetbase (CVC-ClinicDB dataset), Kvasir-SEG dataset used for gastroscopy image segmentation tasks, was created in 2017 by a team of researchers at the University of Oslo, Norway, and contained images of a wide range of gastric lesions including ulcers, polyps, bleeding, and more. CVC-ClinicDB consisted of 612 static images extracted from colonoscopy videos. In terms of experimental setup, in this paper, 90% of the images were used as a training set and the remaining 10% was used as a validation set to process the polyp dataset for evaluating the performance of the model. Secondly the size of the input image was 224×224 pixels, the use of 224×224 helped reduce the number of parameters and computation of the model and to achieve a better lightweight. The dataset information is shown in Table 1.
Table 1
Dataset | Data volume (image) | Training set (image) | Test set (image) | Size (pixel) |
---|---|---|---|---|
Kvasir-SEG | 1,000 | 880 | 120 | 224×224 |
CVC-ClinicDB | 612 | 550 | 62 | 224×224 |
Experimental validation on the Kvasir-SEG dataset and the CVC-ClinicDB dataset. Kvasir-SEG dataset, Kvasir Segmentation Dataset; CVC-ClinicDB dataset, Colorectal Cancer-Clinic Datasetbase.
Loss function
In this paper, we used Dice loss as the loss function in this paper, mainly relying on its Dice similarity coefficient. Where Dice was a function that evaluated the similarity of two regions and represented the overlap between the region segmented by the model and the region labeled by the expert, with values ranging from [0,1], where 1 represented complete overlap and 0 represented no overlap at all. For the target segmentation problem in polyp images, the final segmentation involved dividing the target region from the background region, and such a division is known as a binary classification problem. Where the number of pixels correctly predicted as polyp structures was called true positive (TP), the number of pixels correctly predicted as background regions was called true negative (TN), the number of pixels incorrectly predicted as polyp structures was called FP and the number of pixels was called false negative (FN). Therefore, Dice loss as a loss function in this paper, was minimized as much as possible. Dice loss was calculated as:
Evaluation metrics
In this study, the Dice, mIOU, F1-score (balanced F score), and Recall metrics were employed as evaluative benchmarks for the efficacy of polyp image segmentation models. The lightness of the models was assessed by quantifying their FLOPs and parameters (Params). Comparative analyses were conducted across the deeplabV3+, U-Net, and U-Net++ architectures, with the precise definitions and mathematical formulations of these evaluative indices delineated as follows:
- Recall
Recall was defined as the number of pixels of the original neural structure that were correctly predicted. - Dice similarity coefficient (Dice)
Dice was a function that evaluated the similarity of two regions, and in this experiment, indicated the overlap between the region segmented by the model and the region labeled by the expert, which is in the range of [0,1]. - Mean intersection over union (mIOU)
mIOU was used to evaluate the segmentation performance of the model by the ratio of the intersection over union of the true and predicted values of the neural structure pixels. As shown in equation (4), where Vseg denoted the predicted region and Vgt denoted the true region, which ranged from [0,1]. - Precision
Precision was the number of pixels predicted to be polyp structures that were correctly predicted. - Floating-point operations (FLOPs)
FLOPs referred to the number of FLOPs required to execute a model, which was commonly used to measure the computational complexity of the model. - Params
Params referred to the number of parameters that needed to be learned in a model, which was another important metric for measuring model complexity. Training models with more parameters required more graphics processing unit (GPU) memory.
Analysis of results
In this study, we performed experiments on the publicly available Kvasir-SEG dataset and CVC-ClinicDB dataset and conducted comparative analyses with five advanced deep neural network models, including the compact U-Net (10), DeepLabV3+ (38) and ConvUNeXt (39), larger-scale models: U-Net++ (11), TransUnet (40), SwinUnet (41) and TGANet (42). Notably, to balance computational efficiency with accuracy, the DeepLabV3+ model employed the ResNet50 architecture as its backbone network, aiming to minimize the number of parameters and computational demand. For segmentation performance, we compared the mIOU, Precision, Dice, and Recall metrics of different models; for lightweighting, we compared the FLOPs and Params metrics to reflect the lightweighting effect. The experimental results are shown in Tables 2-4.
Table 2
Model | mIOU (%) | Precision (%) | Dice (%) | Recall (%) |
---|---|---|---|---|
U-net | 66.22 | 76.79 | 77.72 | 78.77 |
DeepLabV3+ | 82.16 | 90.50 | 86.43 | 89.05 |
U-net++ | 79.08 | 84.68 | 87.63 | 91.54 |
ConvUNeXt | 71.54 | 87.31 | 83.02 | 80.56 |
TransUnet | 79.62 | 96.80 | 86.80 | 88.64 |
Swin-Unet | 74.87 | 95.30 | 83.38 | 87.15 |
TGANet | 82.72 | 92.59 | 88.62 | 89.61 |
DeepNeXt | 83.91 | 93.35 | 90.89 | 91.85 |
Experimental validation on the Kvasir-SEG dataset. Kvasir-SEG dataset, Kvasir Segmentation Dataset; mIOU, mean intersection over union.
Table 3
Model | mIOU (%) | Precision (%) | Dice (%) | Recall (%) |
---|---|---|---|---|
U-net | 86.22 | 93.68 | 92.51 | 91.60 |
DeepLabV3+ | 87.24 | 94.03 | 92.92 | 92.76 |
U-net++ | 86.71 | 93.60 | 92.80 | 92.18 |
ConvUNeXt | 84.07 | 93.14 | 91.17 | 89.59 |
TransUnet | 86.82 | 93.83 | 92.90 | 92.15 |
Swin-Unet | 54.61 | 74.11 | 70.28 | 67.90 |
TGANet | 86.49 | 93.41 | 92.46 | 91.45 |
DeepNeXt | 87.37 | 94.54 | 92.97 | 92.16 |
Experimental validation on the CVC-ClinicDB dataset. CVC-ClinicDB dataset, Colorectal Cancer-Clinic Datasetbase; mIOU, mean intersection over union.
Table 4
Model | FLOPs (G) | Params (M) |
---|---|---|
U-net | 10.53 | 7.77 |
DeepLabV3+ | 5.05 | 5.81 |
U-net++ | 28.77 | 15.96 |
ConvUNeXt | 7.25 | 3.50 |
TransUnet | 38.52 | 96.34 |
Swin-Unet | 42.68 | 105.32 |
TGANet | 44.64 | 108.81 |
DeepNeXt | 3.04 | 1.51 |
FLOPs, floating point operations; Params, parameters; G, giga; M, million.
Table 2 compared the segmentation performance of different polyp segmentation models on the Kvasir-SEG dataset. As shown in the table, the proposed DeepNeXt model surpassed classic lightweight models like U-net, the lightweight version of DeepLabV3+, and ConvUNeXt in overall segmentation performance. It achieved the highest scores in mIOU, Precision, Dice, and Recall. When compared to recent popular large-scale segmentation models such as TransUnet, SwinUnet, and TGANet, our model performed well in mIOU, Dice, and Recall, indicating that our model provided more accurate polyp segmentation. However, due to the complex logic system and large number of parameters in large models like Transformer-based models, which better captured detailed features in polyp images, our model still had significant room for improvement in segmentation precision. The comparison data was illustrated in Figures 6,7.
Table 3 compared the segmentation performance of different polyp segmentation models on the CVC-ClinicDB dataset. Due to the limited size of the CVC-ClinicDB dataset, TransUnet could not fully leverage the advantages of large models. This indirectly verified that our model had an advantage in handling smaller datasets, achieving the best scores in mIOU, Precision, Dice, and Recall. The comparison data was illustrated in Figures 8,9.
It was noteworthy that while our model achieved superior segmentation performance compared to traditional lightweight models, it also performed better in terms of FLOPs and Params. The DeepNeXt model, compared to large models, reduced the number of parameters by several tens of times, keeping the Params score stably below 2, indicating that our model had a sufficiently small number of parameters, making it more flexible to use. Additionally, in comparisons with conventional models, the DeepNeXt model significantly reduced the FLOPs score by more than ten times while ensuring accurate polyp detection, demonstrating that our model’s computational process is much faster, meeting the requirements for rapid clinical diagnosis of polyp diseases. Experimental validation showed that our network achieves a remarkable balance between performance and lightweight design. The comparison of Params and FLOPs indicators is shown in Figure 10.
Discussion
This paper was an experimental study of ablation on the Kvasir-SEG open polyp dataset, focusing on the design of the encoder-ide and MSAN module structures to understand how the structure of each module contributed to the results.
Ablation experiments at the encoder
Table 5 presented the ablation studies conducted on the feature encoder segment, which, in this study, was comprised of five sets of dual-depth separable convolutions, interspersed with four downsampling layers among these five sets. To ascertain the optimal configuration for feature encoding within this architecture, the five-layer dual-depth separable convolution at the feature encoder end was progressively modified to four and six layers for comparative evaluation. The experimental results were shown in Table 5.
Table 5
Model | mIOU (%) | Precision (%) | Dice (%) | Recall (%) | FLOPs (G) | Params (M) |
---|---|---|---|---|---|---|
Four-layer | 81.03 | 87.42 | 88.99 | 90.81 | 2.91 | 1.249 |
Five-layer | 83.91 | 93.35 | 90.89 | 91.85 | 3.04 | 1.51 |
Six-layer | 77.92 | 86.03 | 86.86 | 87.77 | 3.25 | 3.63 |
mIOU, mean intersection over union; G, giga; M, million.
As indicated in Table 5, it was seen that the stacking of four-layer double-depth separable convolution made the algorithm produce the lowest FLOPs and params scores, and the reduction of purely convolutional layer structure inevitably brought about a decrease in the number of model parameters and computation, but the extreme pursuit of lightness inevitably resulted in discarding the segmentation effect of the model, so the approach scored poorly in the segmentation performance indicator scores; When the dual-depth separable convolution extended to the six-layer, the reduction of the feature map led to excessive convolution, resulting in a loss of features. This caused a decline in segmentation performance while unnecessarily increasing the model’s parameter count and computational load. In conclusion, the implementation of five-layer of dual-depth separable convolution enabled the model to achieve an optimal balance by being as lightweight as possible while ensuring accuracy.
Ablation experiments inside the MSAN module
Table 6 showed ablation experiments performed for a multi-scale attention convolution module, where we combined a pair of striped convolutional layers with the CA attention mechanism into a single module. Among them, the strip convolution layer was key to feature extraction and also laid the foundation for the later computation of CA attention, so different number of strip convolution had a significant image to feature extraction, in order to understand the contribution of strip convolution to the module design, the following ablation experiments were conducted. The standard groups of 7×7, 11×11, 21×21 were used, and 9×9, 15×15, and 19×19 were added in order, respectively, where MSAN* was 7×7, 9×9, 11×11, 21×21, and MSAN** was 7×7, 9×9, 11×11, 15×15, 19×19, and 21×21, and MSAN was the module used in the algorithm of this paper. The experimental results were shown in Table 6.
Table 6
Model | mIOU (%) | Precision (%) | Dice (%) | Recall (%) | FLOPs (G) | Params (M) |
---|---|---|---|---|---|---|
MSAN* | 81.23 | 87.61 | 89.12 | 90.87 | 3.04 | 1.44 |
MSAN | 83.91 | 93.35 | 90.89 | 91.85 | 3.04 | 1.51 |
MSAN** | 80.94 | 88.01 | 88.94 | 89.95 | 3.05 | 1.58 |
MSAN* is 7×7, 9×9, 11×11, 21×21 strip convolution and MSAN** is 7×7, 9×9, 11×11, 15×15, 19×19 and 21×21 strip convolution, MSAN is the module used in the algorithm of this paper. MSAN, multi-scale attention feature extraction module; mIOU, mean intersection over union; G, giga; M, million.
Based on the data presented in the table above, it was evident that while reducing the number of convolutional layers may contribute to a further degree of model lightness, it significantly compromised the segmentation performance. Conversely, adding convolutional layers did not enhance the segmentation effectiveness and instead imposed additional burdens on the model. Therefore, the MSAN module stacking method utilized in this paper achieved the best results in terms of algorithmic segmentation performance, with the number of parameters and the volume of floating-point calculations involved being at moderate levels. This further validates the superiority of the module combination approach chosen for this study.
Presentation
Figure 11 shows the effect graphs produced by using different neural networks in this paper, from left to right, the original image, labeling, U-net, DeepLabV3+, U-net++, ConvUNeXt, SwinUnet, TransUnet, TGANet and DeepNeXt in this paper. As observed from the figure above, the segmentation maps predicted by the original U-net exhibited pronounced overfitting, mistakenly segmenting areas outside of the polyp regions. Similarly, the segmentation prediction maps generated by the relatively lightweight Mobilenetv2-based DeeplabV3+ and ConvUNeXt also displayed substantial over-segmentation issues. For large models, SwinUnet and TGANet did not segment very well, with severe under-segmentation and over-segmentation. However, for large models U-net++ and TransUnet had good predictive segmentation, mostly with smooth segmentation of the edges, and although there were also slight under-segmentation and over-segmentation phenomena, they were already very accurate compared to the labels. As for the segmentation effect of the model proposed in this paper, it was still relatively accurate for the general segmentation of the main body, only in the processing of the edge was less smooth compared to U-net++ and TransUnet, but for an extremely lightweight network, it was able to achieve the segmentation prediction effect that was able to match with the large model.
Conclusions
In this study, we have proposed DeepNeXt, a novel lightweight multi-scale attention segmentation network designed to address the challenge of polyp segmentation on medical devices with limited computational resources. Firstly, in the encoder stage, we have proposed a multi-segment lightweight convolutional encoder module, which has utilized a stacked composition of multiple depthwise separable convolutional layers to achieve accurate feature extraction while keeping the model lightweight. Secondly, we have proposed a multi-scale attention feature coding module, which has incorporated deep strip convolution and coordinates attention mechanisms in multiple branches, enabling the module to extract information from different scales and dimensions and generate feature representations with high robustness. Finally, we have constructed a new network architecture that has taken selective jump extraction to further process and integrate the key stage features, preserving the feature transfer between networks while solving the problems of feature loss and inadequate depthwise separable convolutional feature extraction. Experimental results showed that our proposed DeepNeXt polyp segmentation model achieved excellent performance on the Kvasir-SEG dataset, which not only outperformed the lightweight model in terms of segmentation accuracy, but also exhibited significant advantages in terms of the number of model parameters and computational complexity. Compared with the large-scale model, the DeepNeXt model reduced computation by more than ten times, reduced the number of parameters in the network by tens of times, and achieved good results in the indicators of mIOU, Dice, Recall, etc., However, there was still a large room for improvement in our model in terms of the segmentation accuracy score. We also verified the effectiveness of the modules through ablation experiments, which justified the internal design of our network. Despite this, our model still had significant room for improvement in segmentation accuracy scores. When facing with the challenging task of segmenting small polyps, there may be potential limitations. Additionally, we did not seek professional medical opinions during the visualization phase of the segmentation results, presenting substantial challenges for practical use. Our experiments were all conducted in a controlled experimental environment. However, actual application conditions may not meet the standards of the experimental environment. Therefore, achieving ideal results in the presence of different polyp sizes and shapes or motion artifacts poses certain limitations and challenges.
Our proposed model not only leveraged key techniques such as deep separable convolution, multiscale feature extraction, and attention mechanism, but also achieved efficient lightweight design for medical devices with limited resources. This study introduced an innovative lightweight multi-scale attention segmentation network tailored for medical devices with constrained computational power, offering robust support for achieving precise and efficient polyp segmentation in clinical applications.
Acknowledgments
Funding: This work was supported by
Footnote
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-985/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Stewart B, Wild CP. Editors. World Cancer Report 2014; International Agency for Research on Cancer (IARC): Lyon, France; 2014.
- Siegel RL, Miller KD, Jemal A. Cancer statistics, 2019. CA Cancer J Clin 2019;69:7-34. [Crossref] [PubMed]
- Ali S, Ghatwary N, Jha D, Isik-Polat E, Polat G, Yang C, et al. Assessing generalisability of deep learning-based polyp detection and segmentation methods through a computer vision challenge. Sci Rep 2024;14:2032. [Crossref] [PubMed]
- Niyas S, Pawan SJ, Kumar MA, Rajan J. Medical image segmentation with 3D convolutional neural networks: A survey. Neurocomputing 2022;493:397-413.
- Mamonov AV, Figueiredo IN, Figueiredo PN, Tsai YH. Automated polyp detection in colon capsule endoscopy. IEEE Trans Med Imaging 2014;33:1488-502. [Crossref] [PubMed]
- Tajbakhsh N, Gurudu SR, Liang J. Automated Polyp Detection in Colonoscopy Videos Using Shape and Context Information. IEEE Trans Med Imaging 2016;35:630-44. [Crossref] [PubMed]
- Misawa M, Kudo SE, Mori Y, Cho T, Kataoka S, Yamauchi A, et al. Artificial Intelligence-Assisted Polyp Detection for Colonoscopy: Initial Experience. Gastroenterology 2018;154:2027-2029.e3. [Crossref] [PubMed]
- Chen H, Feng D, Cao S, Xu W, Xie Y, Zhu J, Zhang H. Slice-to-slice context transfer and uncertain region calibration network for shadow detection in remote sensing imagery. ISPRS Journal of Photogrammetry and Remote Sensing 2023;203:166-82.
- Fang L, Wang X, Wang L. Multi-modal medical image segmentation based on vector-valued active contour models. Information Sciences 2020;51:504-18.
- Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer International Publishing; 2015:234-41.
- Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J. Unet++: A nested u-net architecture for medical image segmentation. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4. Springer International Publishing; 2018:3-11.
- Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv 2014:1412.7062.
- Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans Pattern Anal Mach Intell 2018;40:834-48. [Crossref] [PubMed]
- Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv 2017:1706.05587.
- Zhang X, Chen F, Yu T, An J, Huang Z, Liu J, Hu W, Wang L, Duan H, Si J. Real-time gastric polyp detection using convolutional neural networks. PLoS One 2019;14:e0214133. [Crossref] [PubMed]
- Choi W, Cha YJ. SDDNet: Real-time crack segmentation. IEEE Transactions on Industrial Electronics 2019;67:8016-25.
- Kang DH, Cha YJ. Efficient attention-based deep encoder and decoder for automatic crack segmentation. Struct Health Monit 2022;21:2190-205. [Crossref] [PubMed]
- Krenzer A, Hekalo A, Puppe F. Endoscopic detection and segmentation of gastroenterological diseases with deep convolutional neural networks. In: EndoCV@ ISBI; 2020:58-63.
- Jha D, Riegler MA, Johansen D, Halvorsen P, Johansen HD. Doubleu-net: A deep convolutional neural network for medical image segmentation. 2020 IEEE 33rd International symposium on computer-based medical systems (CBMS). IEEE; 2020:558-64.
- Wan J, Chen B, Yu Y. Polyp Detection from Colorectum Images by Using Attentive YOLOv5. Diagnostics (Basel) 2021;11:2264. [Crossref] [PubMed]
- Wei J, Hu Y, Zhang R, Li Z, Zhou SK, Cui S. Shallow attention network for polyp segmentation. In: de Bruijne M. editors. MICCAI 2021. LNCS, vol. 12901, 2021:699-708. Springer, Cham. Available online: https://doi.org/
10.1007/978-3-030-87193-2_66 - Yin Z, Liang K, Ma Z, Guo J. Duplex contextual relation network for polyp segmentation. 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI). IEEE; 2022:1-5.
- Yue G, Han W, Jiang B, Zhou T. IEEE J Biomed Health Inform 2022;26:4090-9. [Crossref] [PubMed]
- Zhang R, Lai P, Wan X, Fan DJ, Gao F, Wu XJ, Li G. Lesion-aware dynamic kernel for polyp segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland; 2022:99-109.
- Brand M, Troya J, Krenzer A, Saßmannshausen Z, Zoller WG, Meining A, Lux TJ, Hann A. Development and evaluation of a deep learning model to improve the usability of polyp detection systems during interventions. United European Gastroenterol J 2022;10:477-84. [Crossref] [PubMed]
- Brand M, Troya J, Krenzer A, De Maria C, Mehlhase N, Götze S, Walter B, Meining A, Hann A. Frame-by-Frame Analysis of a Commercially Available Artificial Intelligence Polyp Detection System in Full-Length Colonoscopies. Digestion 2022;103:378-85. [Crossref] [PubMed]
- Krenzer A, Makowski K, Hekalo A, Fitting D, Troya J, Zoller WG, Hann A, Puppe F. Fast machine learning annotation in the medical domain: a semi-automated video annotation tool for gastroenterologists. Biomed Eng Online 2022;21:33. [Crossref] [PubMed]
- Fitting D, Krenzer A, Troya J, Banck M, Sudarevic B, Brand M, Böck W, Zoller WG, Rösch T, Puppe F, Meining A, Hann A. A video based benchmark data set (ENDOTEST) to evaluate computer-aided polyp detection systems. Scand J Gastroenterol 2022;57:1397-403. [Crossref] [PubMed]
- Krenzer A, Banck M, Makowski K, Hekalo A, Fitting D, Troya J, Sudarevic B, Zoller WG, Hann A, Puppe F. A Real-Time Polyp-Detection System with Clinical Application in Colonoscopy Using Deep Convolutional Neural Networks. J Imaging 2023;9:26. [Crossref] [PubMed]
- Krenzer A, Heil S, Fitting D, Matti S, Zoller WG, Hann A, Puppe F. Automated classification of polyps using deep learning architectures and few-shot learning. BMC Med Imaging 2023;23:59. [Crossref] [PubMed]
- Lewis J, Cha YJ, Kim J. Dual encoder-decoder-based deep polyp segmentation network for colonoscopy images. Sci Rep 2023;13:1183. [Crossref] [PubMed]
- Roy S, Koehler G, Ulrich C, Baumgartner M, Petersen J, Isensee F, Jäger PF, Maier-Hein KH. Mednext: transformer-driven scaling of convnets for medical image segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland; 2023:405-15.
- Liu Z, Zhang Z, Lei Z, Omura M, Wang RL, Gao S. Dendritic deep learning for medical segmentation. IEEE/CAA Journal of Automatica Sinica, 2024;11:803-5.
- He W, Zhang C, Dai J, Liu L, Wang T, Liu X, Jiang Y, Li N, Xiong J, Wang L, Xie Y, Liang X. A statistical deformation model-based data augmentation method for volumetric medical image segmentation. Med Image Anal 2024;91:102984. [Crossref] [PubMed]
- Eisenmann M, Reinke A, Weru V, Tizabi MD, Isensee F, Adler TJ, et al. Biomedical image analysis competitions: The state of current participation practice. arXiv preprint arXiv 2022:2212.08568.
- Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv 2017:1704.04861.
- Chollet F. Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition; 2017:1251-8.
- Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European conference on computer vision (ECCV); 2018:801-18.
- Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. A convnet for the 2020s. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022:11976-86.
- Chen J, Lu Y, Yu Q, X Luo, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv 2021:2102.04306.
- Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-unet: Unet-like pure transformer for medical image segmentation. European conference on computer vision. Cham: Springer Nature Switzerland; 2022:205-18.
- Tomar NK, Jha D, Bagci U, Ali S. TGANet: Text-guided attention for improved polyp segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland; 2022:151-60.