A streamlined U-Net convolution network for medical image processing

Ching-Hsue Cheng; Jun-He Yang; Yu-Chen Hsu

doi:10.21037/qims-24-1429

Original Article

A streamlined U-Net convolution network for medical image processing

Ching-Hsue Cheng¹ , Jun-He Yang², Yu-Chen Hsu¹

¹Department of Information Management, National Yunlin University of Science & Technology, Yunlin; ²Department of E-Sport Technology Management, Cheng Shiu University, Kaohsiung City

Contributions: (I) Conception and design: CH Cheng; (II) Administrative support: JH Yang, YC Hsu; (III) Provision of study materials or patients: All authors; (IV) Collection and assembly of data: JH Yang, YC Hsu; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Jun-He Yang, PhD. Department of E-Sport Technology Management, Cheng Shiu University, No. 840, Chengcing Rd., Niaosong Dist., Kaohsiung City. Email: blackwhale@gcloud.csu.edu.tw.

Background: Image segmentation is crucial in medical diagnosis, helping to identify diseased areas in images for more accurate diagnoses. The U-Net model, a convolutional neural network (CNN) widely used for medical image segmentation, has limitations in extracting global features and handling multi-scale pathological information. This study aims to address these challenges by proposing a novel model that enhances segmentation performance while reducing computational demands.

Methods: We introduce the LUNeXt model, which integrates Vision Transformers (ViT) with a redesigned convolution block structure. This model employs depthwise separable convolutions to capture global features with fewer parameters. Comprehensive experiments were conducted on four diverse medical image datasets to evaluate the model’s performance.

Results: The LUNeXt model demonstrated competitive segmentation performance with a significant reduction in parameters and floating-point operations (FLOPs) compared to traditional U-Net models. The application of explainable AI techniques provided clear visualization of segmentation results, highlighting the model’s efficacy in efficient medical image segmentation.

Conclusions: LUNeXt facilitates efficient medical image segmentation on standard hardware, reducing the learning curve and making advanced techniques more accessible to practitioners. This model balances the complexity and parameter count, offering a promising solution for enhancing the accuracy of pathological feature extraction in medical images.

Keywords: Medical image segmentation; convolutional neural network (CNN); Vision Transformers (ViT); lightweight model

Submitted Jul 13, 2024. Accepted for publication Nov 15, 2024. Published online Dec 20, 2024.

doi: 10.21037/qims-24-1429

Introduction

Image segmentation is a crucial step in medical image analysis, aimed at identifying and marking tissue lesions to aid physicians in diagnosing diseases (1). Successful segmentation cases include polyps, cells, blood vessels, and brain tumors. The rapid development of artificial intelligence, particularly in deep learning and convolutional neural networks (CNNs), has led to significant advancements in this field. Notable CNN models include VGGNet (2), Inception (3-6), ResNet (7), AlexNet (8), DenseNet (9), and MobileNet (10,11). As these CNN networks evolve, deep learning models can more accurately extract features for segmentation and classification.

Ronneberger et al. (12) introduced the U-Net model, incorporating a decoder architecture into the CNN framework, which is widely used in medical image segmentation. Many studies (13,14) have extended the U-Net model to provide more accurate segmentation results in various cases. However, the U-Net model has limitations in preserving global semantic information and multi-scale pathological information (such as cell arrangement and tumor boundaries) in medical images. Additionally, the accumulation of parameters from multiple convolution and pooling layer operations can impact the speed and performance of computer calculations, leading to overfitting and the loss of important pathological features, ultimately reducing the accuracy of image segmentation. Enhancing the accuracy of pathological feature information while balancing model complexity and parameter count remains a significant challenge.

Dosovitskiy et al. (15) designed the Vision Transformers (ViT) model, which applies the technology of self-attention mechanisms and multi-layer perceptrons (16) to computer vision. The ViT model and its extensions, such as Swin Transformer (17) and ShiftViT (18), have shown better segmentation results than traditional CNN in image recognition tasks. Recently, researchers have combined the advantages of Transformer (16) global features and U-Net (12) local feature extraction, using parallel computing and position embedding technologies in conjunction with CNN model architecture. For example, the ConvNeXt (19) model combines the ResNet (7) architecture and Swin Transformer (17) design, using depth-separable convolution to extract global feature information more effectively than pure convolution operation. Additionally, Han et al. (20) designed an efficient CNN, ConvUNeXt, based on the U-Net (12) architecture and ConvNeXt (19) modules, which reduces parameters and improves medical image segmentation capabilities. These combined methods have become popular choices for target detection and image segmentation in recent years.

Based on the inspiration from combining the above methods, the ConvUNeXt (20) module uses a 7×7 depth separable convolution layer, which increases the number of floating-point operations (FLOPs) and complexity of calculations in the entire model architecture. This study introduces the LUNeXt model, integrating U-Net and ConvUNeXt with redesigned convolution block structures. The model adjusts the depth-separable convolution layer size, removes the control channel number, and adds a 1×1 convolution layer to reduce computational load. To extract more accurate pathological features, we set the convolution block operation to occur twice before each upsampling and downsampling. The proposed LUNeXt model significantly reduces the number of parameters and FLOPs while maintaining segmentation performance. Finally, we compare the LUNeXt model with U-Net, U-Net++, U-Net3+, ConvUNeXt, and other representative medical image segmentation models. In summary, the contributions of this study are as follows:

Redesigned the architecture of ConvUNeXt and combined it with the encoder-decoder structure of U-Net to propose the LUNeXt model. This model can separate and remove convolutional layers by adjusting the depth, significantly reducing the number of parameters and FLOPs.
Proposed LUNeXt to perform two convolution block operations before each upsampling and downsampling to extract more accurate pathological features. Additionally, its lightweight design speeds up model training, improves computational efficiency, and enables training on devices with limited hardware performance.
During the verification process, we collected four different medical image datasets for experiments. This study uses FLOPs and parameter count metrics to measure computational efficiency (algorithm/model complexity) and applies five performance metrics to evaluate performance, demonstrating the effectiveness of the proposed LUNeXt model.

Additionally, recent advancements in CNNs have led to the development of ConvUNeXt, a model that has shown significant promise in the field of biomedical image segmentation due to its lightweight architecture. The TriConvUNeXt model, for instance, is a pure CNN-based lightweight symmetrical network designed to efficiently handle biomedical image segmentation tasks (21). Furthermore, research presented at the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) highlights that even models with as few as one million parameters can achieve effective medical image segmentation (22). These studies underscore the potential of integrating advanced attention mechanisms with skip connections to enhance the accuracy and efficiency of segmentation results.

In traditional CNN models, convolution kernels convolve input feature maps, but excessive convolution increases the number of parameters significantly. Models like U-Net, U-Net++, and U-Net3+ have many parameters, slowing down training. In the U-Net model, two 3×3 convolution kernels perform local convolution on the input feature map, using rectified linear unit (ReLU) activation and normalization to adjust features.

In the convolution operations of most current neural network (NN) models, activation functions are used to enable the model to learn complex features in the input data and at the same time alleviate the vanishing gradient problem of the NNs caused by over-fitting. The output of the ReLU function is to change inputs that are less than zero to zero, otherwise keep the original input. Although this method has a lower computational cost, in certain tasks, it is easy for the output to always be zero in some convolutional layers. In order to solve this problem, many different activation functions have been proposed one after another, such as Leaky ReLU and Gaussian error linear unit (GELU). GELU is an activation function widely used in the NN fields and natural language processing in recent years. Compared to the piecewise linear properties of ReLU, GELU adds to each input data and allows non-zero weights to train features more accurately, albeit with higher computational cost.

In image segmentation, the attention gate mechanism usually adjusts the feature weights output by different layers to retain more important features. In recent years, many models have used the attention gate mechanism with the skip connection used in the U-Net model. Traditional skip connections directly add decoder features and corresponding encoder features and combine primary and high-level semantic features so that each layer can learn feature information from different levels. The skip connections of integrating different attention gate mechanisms first divide the current features into different gates, and use different activation functions or other operations to adjust the feature weights in different gates; and add it to the features of the previous layer, determine which features are more important based on the weight, and finally add all the processed gates to obtain a more accurate segmentation result.

Methods

The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). This section introduces the LUNeXt model and its evaluation metrics.

Proposed LUNeXt model

The proposed model is based on the encoder and decoder mechanism of the U-Net model, and we use the ConvUNeXt model, depthwise separable convolution, residual connection, and attention mechanism to improve the efficiency of the medical image segmentation model. The improved model is named a lightweight LUNeXt model, as shown in Figure 1. The reasons for improvement according to the used methods in this study are explained as follows.

Figure 1 Proposed model. BN, batch normalization; ReLU, rectified linear unit.

Depthwise separable convolution blocks

In the U-Net model, two 3×3 convolution kernels perform local convolution on the input feature map, using ReLU (23) and normalization to adjust features, as shown in Figure 2. The ConvUNeXt model uses a 7×7 depth-separable convolution layer and several 1×1 convolution layers to adjust channel numbers, replacing ReLU with GELU (24), significantly reducing parameters compared to traditional operations, as shown in Figure 2. Our proposed model further optimizes this by using a 3×3 depth-separable convolution and removing 1×1 convolution layers, allowing direct normalization and residual connection, and employing GELU and ReLU activation functions to maintain performance while reducing parameters.

Figure 2 The difference of the proposed LUNeXt with different models. BN, batch normalization; ReLU, rectified linear unit; GELU, Gaussian error linear unit.

Application of activation functions in the LUNeXt model

This study experiments with both ReLU and GELU activation functions to determine which is more suitable for medical datasets in the proposed LUNeXt model. The basic function graphs of ReLU and GELU are shown in Figure 3.

Figure 3 The ReLU and GELU function graphs. ReLU, rectified linear unit; GELU, Gaussian error linear unit.

Proposed model’s attention and skip connections

The proposed model uses the attention gate mechanism of ConvUNeXt and transformer architecture, and there will be two inputs in the attention gate mechanism, which are primary features (corresponding encoder features) and high-level features (current features). First, after normalizing and upsampling the input high-level features, the feature map is divided into three gates. The first gate is combined with the primary features and uses the sigmoid function to predict the pixel classification, and the result is then added to the primary features to obtain the result R_1 of the first gate. The second gate directly uses the sigmoid function to predict the pixel classification. The third gate uses the Tanh function to predict it, and then the second gate and the third gate are multiplied to obtain the combined weight R_2 of the two gates. Finally, the result of adding R_1 and R_2 is output to a 1×1 convolution layer. This operation method successfully combines different levels of feature information, as shown in Figure 4.

Figure 4 Proposed skip connections of integrating different attention mechanisms. “+” represents pixel-wise addition between two feature maps; and “X” represents pixel-wise multiplication between two feature maps. BN, batch normalization.

Proposed research procedure

The proposed research procedure includes six steps, as shown in Figure 5. The six steps are introduced as follows.

Figure 5 Proposed research procedure. FLOPs, floating-point operations; IoU, intersection over union.

Step 1: input training datasets

This step inputs the public medical image segmentation dataset, which includes original images and label images of pathological features. First, the training and test data are randomly distributed, and the number of training and test images is shown in Table 1, and then input into the research environment.

Table 1

Summary description of four public medical image datasets

Datasets	Train (piece)	Test (piece)	Size (pixel)	Batch	Epoch (iteration)	Learning rate
ISIC 2018 (25)	2,594	1,000	128×128	16	100	0.00015
DR HAGIS (26)	30	10	512×512	1	100	0.001
GlaS (27)	85	80	512×512	1	100	0.00015
CVC-ClinicDB (28)	500	112	384×288	8	100	0.00015

ISIC, International Skin Imaging Collaboration; DR HAGIS, Diabetic Retinopathy, Hypertension, Age-Related Macular Degeneration, and Glaucoma Images; GlaS, Gland Segmentation.

Step 2: image preprocessing

This step pre-processes the input image by adjusting them to the same size to reduce the burden on the model, normalizing the images so that each pixel value is between 0 and 1, and adjusting the color depth to ensure the image dataset meets the conditions for model training.

Step 3: data argumentation

After pre-processing the image dataset to enhance the model’s segmentation performance, this study applied data augmentation to all training images. Techniques included random horizontal and vertical flipping, adjusting brightness, contrast, and saturation, Gaussian blur, random rotation, and cropping and zooming. These techniques increased the dataset’s samples, enabling the model to predict pathological characteristics more accurately.

Step 4: proposed LUNeXt model

Based on the encoder and decoder of the U-Net, this study uses the ConvUNeXt, depthwise separable convolution, residual connection, and attention mechanism to design the LUNeXt model for improving the efficiency of image segmentation. For generality, we propose a pseudocode of the LUNeXt model algorithm, as shown in Table 2.

Table 2

LUNeXt model algorithm

Steps	Description
1. Define parameters of the LUNeXt	Such as N_channels =3; C=32; N_layer =2; N_classes =2; B= true
2. Define convolution modules of the LUNeXt	(I) Initial convolution block:
	• Using 3×3 depthwise separable convolution with padding
	• Define batch normalization and GELU (ReLU) activation
	(II) Create four down modules, each consisting of:
	• Define batch normalization for input channels
	• Use 2×2 convolution to downsample the spatial dimensions
	• Stacked convolution layers (number = N_layer)
	(III) Create four up modules, each consisting of:
	• Define batch normalization for input channels
	• Use bilinear upsampling or transpose convolution to upsampling
	• Use attention-gate to combine low features and high features
	• Stacked convolution layers (number = N_layer)
	(IV) Output convolution: use a 1×1 convolution to produce the final segmentation results
3. Assemble the complete LUNeXt model with the defined architecture, and input I to start train	–
4. Output predict image label P	–

Input: I, input image; N_channels, number of input image channels; N_classes, number of output classes; N_layer, number of layers in each down/up module; B, bilinear upsampling; C, base number of channels. Output: P, predict image label. GELU, Gaussian error linear unit; ReLU, rectified linear unit.

Step 5: train model

We used the Pytorch suite in the Python environment to implement the models to be trained, and the models for this study are U-Net, U-Net++, U-Net3+, ConvUNeXt, and the proposed LUNeXt model. First set the hyperparameters in the model environment, such as optimizer, loss function, batch, number of iterations, and learning rate. This study uses the root mean square propagation (RMSprop) optimizer under all models to automatically adjust the learning rate based on the gradient update of each parameter. The loss function uses a combination of Dice loss and cross-entropy loss (29). In addition, the batches, number of iterations, and learning rates vary according to different datasets.

Step 6: visual result and evaluation

After the training is completed, the test data is fed into the weights stored in the model, and the prediction image is generated and stored to facilitate analysis and performance comparison with the label image. Then, the visual segmentation results are compared with the label image. In the evaluation metrics, this study uses the number of parameters, FLOPs, Dice coefficient (DC), intersection over union (IoU), accuracy, precision, and recall to compare the listed models with the proposed model.

Measure metrics

This study employs the number of parameters and the number of FLOPs to measure the computational efficiency (algorithm/model complexity), where FLOPs per second, and we use five evaluation metrics to evaluate the performance of image segmentation for the listing models in this study. These metrics include four common metrics, including accuracy, recall, precision, and specificity (30), and two always used in image segmentation, including the DC (31) and IoU (32). Next, we introduce the five performance metrics as follows.

Accuracy

Accuracy (shown in the equation as Acc) in image segmentation is the ratio of correctly segmented pixels and actual pixels in the predicted segmentation image to the actual segmentation image. First, we compute the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values, and then its calculation method is shown in Eq. [1].

$A c c = \frac{T P + T N}{T P + T N + F P + F N}$ [1]

Precision

The precision (shown in the equation as Prc) indicates how many of all pixels predicted as segmented areas are actually segmented areas, as shown in Eq. [2].

$P r c = \frac{T P}{T P + F P}$ [2]

Recall

Recall (shown in the equation as Rec), also called sensitivity or TP rate, is defined as the ratio of TPs to the total number of positive data points in a given dataset. In image segmentation, the recall indicates how many areas are segmented among all the pixels that are actually segmented areas, as shown in Eq. [3].

$R e c = \frac{T P}{T P + F N}$ [3]

DC

The DC (23) frequently appears in the evaluation of image segmentation and target detection. The DC is used to calculate the similarity between the pixels of the predicted segmented and the pixels of the actual segmented image, and Its equation is just like F1, as shown in Eq. [4]. The range of the DC value is between 0 and 1. The closer to 1, the closer the predicted segmentation map is to the actual segmentation map. That is, the segmentation results of the model are more accurate.

$D C = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n * R e c a l l} = \frac{2 * T P}{(T P + F N) + (T P + F P)}$ [4]

IoU

The IoU is an important indicator in the field of image segmentation used to measure the similarity between the predicted segmentation and the actual segmentation image (32). The IoU is also called the Jaccard coefficient, as shown in Eq. [5]. The larger the IoU value is, the larger the overlap area between the actual segmentation result and the label image is, indicating that the segmentation result is more accurate.

$I o U = \frac{T P}{T P + F P + F N}$ [5]

Dataset

This study collected four public medical image segmentation datasets to verify the proposed method and make comparisons. A brief description of the four datasets is shown in Table 1. Next, these datasets are detailed as follows.

International Skin Imaging Collaboration (ISIC) 2018

The public ISIC 2018 dataset (25) is collected from the ISIC, which is mainly used for the diagnosis and research of skin diseases in medical images. This dataset includes a large number of skin disease images, covering various types of skin diseases and lesions, including melanoma, squamous cell carcinoma, basal cell carcinoma, and other different skin cases. In this study, the training and test data of ISIC 2018 are 2,549 and 1,000 images respectively. Each image is adjusted to 128×128, the batch size is set to 16, the iteration is set to 100, and the learning rate is preset to 0.00015. In addition, this study uses the RMSprop optimizer to automatically adjust the learning rate based on the gradient update of each parameter.

Diabetic Retinopathy, Hypertension, Age-Related Macular Degeneration, and Glaucoma Images (DR HAGIS) dataset

The DR HAGIS dataset (26) collected images of diabetic retinopathy patients in the UK with 40 color fundus images of different image sizes and resolutions. In this study, the training and test data of the DR HAGIS dataset are 30 and 10 images respectively, each image is adjusted to 512×512, the batch size is set to 1, the iteration is set to 100, and the learning rate is preset to 0.001. Further, we use the RMSprop optimizer to automatically adjust the learning rate based on the gradient update of each parameter. This dataset is a small sample of data, we expect to achieve higher performance based on this dataset with the larger image size and smaller batch size.

Gland segmentation (GlaS) dataset

The GlaS dataset (27) collected hematoxylin-eosin (H&E) stained slides of T3 and T42 colorectal cancer, with each section representing a different patient. In this study, the training and test data of the GlaS dataset are 85 and 80 images respectively, each image is adjusted to 512×512, the batch size is set to 1, the iteration is set to 100, and the learning rate is preset to 0.00015. Furthermore, we use the RMSprop optimizer to automatically adjust the learning rate based on the gradient update of each parameter.

CVC-ClinicDB

CVC-ClinicDB (28) is a database of frames extracted from colonoscopy videos. The CVC-ClinicDB dataset is a set of 612 images cropped from 31 colorectal endoscopy images, including the location of each polyp and the corresponding segmentation image. This study uses the training and test data of CVC-ClinicDB with 500 and 112 images respectively. Each image is adjusted to 384×288, the batch size is set to 8, the iterations are set to 100, and the learning rate is preset to 0.00015. Then, we use the RMSprop optimizer to automatically adjust the learning rate based on the gradient update of each parameter.

Results

This section first introduces the experimental environment and parameters setting, then the experimental results are detailed as follows.

Experimental environment and parameters setting

This study uses the Anaconda integrated development environment for this research. The programming language software is 3.9 Python, 9.3.0 Pillow suite, 2.0.0 PyTorch framework, Nvidia’s 11.7 Cuda, and 8.4.0 cuDNN versions. The operating system is Windows 10, the CPU is Intel i7-13700; in addition, the GPU is Nvidia RTX 3060. We used the Pillow package to pre-process the original images and label images of the medical image dataset and applied the PyTorch framework for data enhancement. Next, this experiment used U-Net, U-Net++, U-Net3+, ConvUNeXt, and the proposed LUNeXt model to train the collected datasets based on the PyTorch framework. At last, the parameter settings of each model are shown in Table 3.

Table 3

The parameter setting of the models used

Models	Parameters	References
U-Net	n_channels =3, n_classes =2, bilinear = true, feature_scale =4, deconv = true, batchnorm = true, base_c =32	Ronneberger et al. (12)
U-Net++	n_channels =3, n_classes =2, bilinear = true, feature_scale =4, deconv = true, batchnorm = true, ds = true, base_c =32	Zhou et al. (33)
U-Net3+	n_channels =3, n_classes =1, bilinear = true, feature_scale =4, deconv = true, batchnorm = true, base_c =32	Huang et al. (34)
ConvUNeXt	n_channels =3, n_classes =2, bilinear = true, base_c =32	Han et al. (20)
LUNeXt	n_channels =3, n_classes =2, n_layer =2, bilinear = true, base_c =32	Proposed model

n_channels =3 (RGB image); n_classes =2 (the class is 2: background and target area); n_layer =2 is the number of convolutional layers before each downsampling (upsampling) in LUNeXt; feature_scale =4 is the reduction (enlargement) of the image by 4 times; base_c =32 is the feature map channel number of the first layer. deconv, deconvolution; batchnorm, batch normalization; ds, preset the model in U-Net++ for deep supervised training.

Experimental results

Based on the proposed research procedure with six steps, we employ four public medical segmentation datasets to implement the experiments. Then, we compare the computational efficiency and performance of the proposed LUNeXT with the listing models as follows.

Computational efficiency

In this study, the DR HAGIS and GlaS datasets consist of high-resolution images with dimensions 3×512×512. We use these images to implement U-Net (12), U-Net++ (33), U-Net3+ (34), ConvUNeXt (20), and the proposed LUNeXt (with GELU and ReLU) to determine the parameter numbers and average FLOPs per image, as shown in Table 4. The results show that the parameter numbers and FLOPs of the LUNeXt model are much smaller than the listing models, and the proposed model achieves a significantly lighter model.

Table 4

Comparison results for parameter numbers and FLOPs

Models	Parameters (k)	FLOPs (G)	Times
U-Net (12)	4,318.434	40.5	2.32
U-Net++ (33)	11,798.888	199.9	16.63
U-Net3+ (34)	6,750.434	202.5	10.24
ConvUNeXt (20)	3,505.730	29	1.89
LUNeXt (GELU)	987.170^†	9.7^†	0.66^†
LUNeXt (ReLU)	987.170^†	9.7^†	0.67^†

G denotes one billion [10⁹] FLOPs per second. ^†, means the best performance in the metric for six models. FLOPs, floating-point operations; GELU, Gaussian error linear unit; ReLU, rectified linear unit.

Performance metrics

Similarly, we use four public medical datasets to run U-Net (14), U-Net++ (33), U-Net3+ (34), ConvUNeXt (20), and the proposed LUNeXt (GELU and ReLU), and this study compares the performance of the six models based on the five commonly metrics. The experimental results are shown in Table 5. The results show that the proposed LUNeXt model has the best performance in the DR HAGIS and CVC-ClinicDB datasets but the performance of ISIC 2018 and GlaS datasets is acceptable range.

Table 5

Comparative results of the four medical datasets for the five metrics

Datasets	Models	DC	IoU	Accuracy	Precision	Recall
ISIC 2018	U-Net	0.8406^†	0.7533^†	0.9112^†	0.8688	0.8096^†
	U-Net++	0.8039	0.7133	0.8900	0.8707	0.7433
	U-Net3+	0.8366	0.7526	0.9078	0.8880^†	0.7839
	ConvUNeXt	0.7999	0.7083	0.8935	0.8624	0.7503
	Proposed LUNeXt (GELU)	0.7994	0.7059	0.8928	0.8719	0.7402
	Proposed LUNeXt (ReLU)	0.7981	0.7045	0.8924	0.8614	0.7479
DR HAGIS	U-Net	0.4737	0.3181	0.9610	0.6411	0.4916
	U-Net++	0.4021	0.3080	0.9616	0.6012	0.4521
	U-Net3+	0.4357	0.3210	0.9617	0.6487	0.4320
	ConvUNeXt	0.6163	0.4480	0.9747	0.7414	0.5511
	Proposed LUNeXt (GELU)	0.6276	0.4595	0.9748	0.7290	0.5773^†
	Proposed LUNeXt (ReLU)	0.6366^†	0.4690^†	0.9763^†	0.7630^†	0.5603
GlaS	U-Net	0.7961	0.6812	0.7868	0.7632	0.8888
	U-Net++	0.7446	0.6137	0.7426	0.7222	0.8257
	U-Net3+	0.7848	0.6687	0.7905	0.7745	0.8599
	ConvUNeXt	0.8386^†	0.7392^†	0.8409^†	0.8080^†	0.9133^†
	Proposed LUNeXt (GELU)	0.8213	0.7072	0.8071	0.7724	0.9081
	Proposed LUNeXt (ReLU)	0.8184	0.7085	0.8229	0.7967	0.8839
CVC-ClinicDB	U-Net	0.4756	0.3651	0.9308	0.6099	0.5205
	U-Net++	0.4698	0.3256	0.9060	0.5856	0.5129
	U-Net3+	0.5276	0.4157	0.9267	0.5461	0.6543
	ConvUNeXt	0.5308	0.4280	0.9367	0.6038	0.6135
	Proposed LUNeXt (GELU)	0.5707	0.4592	0.9313	0.5993	0.6970^†
	Proposed LUNeXt (ReLU)	0.6072^†	0.4901^†	0.9388^†	0.6597^†	0.6944

^†, indicates that the model has the best predictive performance in the dataset. DC, Dice coefficient; IoU, intersection over union; ISIC, International Skin Imaging Collaboration; DR HAGIS, Diabetic Retinopathy, Hypertension, Age-Related Macular Degeneration, and Glaucoma Images; GlaS, Gland Segmentation; GELU, Gaussian error linear unit; ReLU, rectified linear unit.

In summary, this study introduces the LUNeXt model, which improves computational efficiency by reducing parameters and FLOPs while maintaining segmentation performance, as shown in Table 4. The experimental goal has been achieved: the lightweight LUNeXt model allows medical image segmentation to be performed on general computer hardware with an acceptable performance level, thereby lowering the learning threshold.

Discussion

After experiment and comparison, there are some findings and discussions as follows.

Pros and cons for U-Net family

While implementing the experiment, the authors encountered difficulties with different models. We summarize the advantages and disadvantages of these models based on previous studies and our experimental findings, as shown in Table 6.

Table 6

Pros & cons for U-Net family

Models	Pros	Cons
U-Net	(I) Efficient image segmentation	From the work of Ronneberger et al. (12):
	(II) Symmetric encoder-decoder structure	(I) Limited contextual information
	(III) Modularity and versatility	(II) Limited generalization
	(IV) Powerful feature learning	(III) High memory usage
	(IV) Powerful feature learning	(IV) Boundary artifacts: U-Net may suffer from boundary artifacts in segmentation results, especially when processing objects at image boundaries
U-Net++	(I) Improved segmentation performance: U-Net++ adds deep supervision to train different levels of feature information more accurately	From the study of Zhou et al. (33):
	(II) Effectively handle class imbalance: U-Net++ incorporates techniques such as weighted loss functions to help solve the common class imbalance problem in segmentation tasks	(I) Complexity and computational cost: U-Net++ uses a large number of intermediate layers and hopping connections, which dramatically increases the number of parameters, training difficulty, model complexity, and computational cost
	(III) Enhanced contextual information: U-Net++’s nested architecture allows the integration of multi-scale features to provide a broader context for segmentation	(II) Training difficulty: the increase in model complexity, while allowing the model to learn better, relatively makes the model more susceptible to learning noise, leading to overfitting
	(IV) Reduced boundary artifacts: U-Net++ aims to mitigate boundary artifacts, which can be problematic in traditional U-Net models	(III) Uninterpretable architecture: the nested structure of U-Net++ can make it difficult to interpret compared to the original U-Net
U-Net3+	(I) Improved skip connections: U-Net3+ uses full-scale skip connections to combine low-level and high-level semantics to learn the correlation of different depth features	Based on the experiment of this study:
	(II) Compared with U-Net and U-Net++, U-Net3+ greatly reduces the number of parameters in combining multi-scale features, making it an efficient model at that time	(I) Dataset limitation: the performance is worse in datasets with smaller samples
	(III) More accurate segmentation: U-Net3+ uses a CGM to avoid FPs	(II) Hardware limitation: computers still need to have certain performance and memory to be used
ConvUNeXt	(I) More efficient convolution: it adopts a large convolution kernel, deeply separable convolution, and normalization, which reduces the number of parameters significantly compared with the traditional convolution on the intrinsic encoder-decoder architecture while retaining a certain degree of segmentation performance	From the findings of the experiment in this study:
	(II) Combination of self-attention mechanism and hopping connection: the self-attention mechanism and hopping connection are based on LSTM architecture to extract more precise semantic features and filter some primary semantic features and background noises	(I) The number of parameters brought by the large convolution: the 7×7 size of the deep separable convolutional layer still occupies a considerable number of parameters in the whole model, which makes the model still similar to some traditional CNNs used for image segmentation
		(II) Resource consumption: larger depth separable convolutions, and insufficient optimization of the MLP layer, increasing display memory usage
Proposed LUNeXt	(I) Lightweight: it redesigned the ConvNeXt architecture; by adjusting the depth, the size of convolution can be separated and the convolution layer can be removed	(I) To increase computational efficiency, some datasets with too much noise only maintain acceptable performance
	(II) Better segmentation accuracy: it performs multiple convolutional block operations, and the segmentation efficiency can be maintained without increasing the number of parameters	(II) In the preprocessing stage, it needs to obey previous image preprocessing and data argumentation
	(III) Enhanced computational efficiency: the lightweight design speeds up the model training and enhances the computational efficiency of the computer, which can be trained even on devices with limited hardware performance

CGM, classification-guide module; FP, false positive; LSTM, long short-term memory; CNN, convolutional neural network; MLP, multi-layer perceptron.

Model performance

From the parameter numbers and FLOPs, the LUNeXt model is the best computational efficiency in the listing models, and the proposed model achieves an acceptable performance, as shown in Tables 3,4. The following explains the performance of different models in each dataset and finds out the reasons.

In the DR HAGIS and CVC-ClinicDB datasets, we see that the proposed LUNeXt model is better than other models in the five-evaluation metrics, and the ReLU activation function is the better selection. Furthermore, the computational efficiency is significantly reducing the number of parameters and FLOPs in this experiment. We can confirm that the proposed LUNeXt model is useful in image segmentation. However, U-Net++ and U-Net3+ (CVC-ClinicDB dataset only U-Net3+) were unable to learn features in this dataset, and the models ran out results close to 0 in the test for five metrics, except accuracy. The reasons are that disadvantages of U-Net++ and U-Net3+ are the training difficulty, model complexity, too many skip connections, and poor performance in smaller samples, as Table 6. It causes the two models to overfit when it does not learn any features; even if data augmentation is used to increase the number of samples, it cannot be compensated. It once again confirms the usability of the proposed LUNeXt model in medical image segmentation.
In the International Skin Imaging Collaboration 2018 dataset, we see that the U-Net model is better in most evaluation indicators. The segmentation results of ConvUNeXt and the proposed LUNeXt are both lower than other models. The reasons that using multiple ordinary convolution operations can learn features better, and U-Net advantages are modularity, versatility, and powerful feature learning, as shown in Table 6. However, the ConvUNeXt and proposed LUNeXt use depth-separable convolution blocks with fewer parameters, which results in the inability to adapt to the complex segmentation pixels of the image and output feature images without fully learning the pathological features.
In the result of the GLAS dataset, we find that the ConvUNeXt model is the best in the five-evaluation metrics because the advantages of the ConvUNeXt model are more efficient convolution and the combined self-attention with the hopping connection. However, the proposed LUNeXt model has better segmentation results than the traditional U-Net, U-Net++, and U-Net3+ models, and the results of the proposed LUNeXt model with the GELU activation function are better than ReLU. In sum, although the proposed model exhibits suboptimal performance on this dataset, the proposed lightweight model successfully achieves acceptable segmentation performance and the computational efficiency of the proposed LUNeXt model is significantly reducing the number of parameters and FLOPs in this dataset.

Explainable artificial intelligence (XAI)

XAI promotes the formation of new artificial intelligence research fields. The XAI (35) refers to all methods and approaches that enable human users to understand artificial intelligence models. In essence, the goal of XAI is to provide contemporary artificial intelligence models with the ability to explain their predictions, decisions, and actions. This can be achieved mainly in two different ways (36):

Design models that are inherently interpretable, i.e., their architecture allows the extraction of key insights about how decisions are made and value is calculated. One of the most popular examples in this category is decision trees for classification tasks.
Generate explanations “after the fact”, which has become very popular in computer vision and image analysis tasks.

The XAI in this study is based on the second way for the work of Ali et al. (36) to present the visual image segmentations of the four datasets under the original, mask, and five models used, as shown in Figure 6. We observed that U-Net++ failed to learn features from the DR HAGIS and CVC-ClinicDB datasets, resulting in near-zero scores in five metrics, except for accuracy. Similarly, U-Net3+ performed poorly, with results close to zero in the CVC-ClinicDB dataset. Consequently, the visual images appear empty in (row 5, column 3), (row 5, column 5), and (row 6, column 3) of Figure 6. In summary, we found that the proposed LUNeXt model performs better when the original image and mask are clearer. Today’s medical images have reached at least 512×512 pixels, and more pixels result in clearer images. The proposed LUNeXt model aligns with modern needs and enhances computational efficiency.

Figure 6 Visual image segmentations for the four datasets under the five models. GlaS (row 1): H&E stain. ISIC, International Skin Imaging Collaboration; DR HAGIS, Diabetic Retinopathy, Hypertension, Age-Related Macular Degeneration, and Glaucoma Images; GlaS, Gland Segmentation; H&E, hematoxylin-eosin.

Conclusions

This study presented a proposed LUNeXt model for lighting the U-Net family to increase computing efficiency and maintain acceptable performance metrics in image segmentation. In the experiment, we collected four medical image segmentation datasets to implement and compare the proposed LUNeXt model with the listing models. The experimental results show that the proposed LUNeXt model has good performance and effectively reduces computing resources (decreasing the number of parameters and FLOPs). Due to the proposed lightweight LUNeXt model is based on the ConvUNeXt, depthwise separable convolution, residual connection, and attention mechanism to design the model for improving the computing efficiency in image segmentation. Based on our experimental results and discussion, this study presents several novelties and contributions:

Redesigned the architecture of ConvNeXt and combined it with U-Net to propose the LUNeXt model, significantly reducing the number of parameters and FLOPs.
Implemented two convolution block operations before each upsampling and downsampling in the proposed LUNeXt model to extract more accurate pathological features, which accelerates model training, enhances computational efficiency, and allows for training on devices with limited hardware capabilities.
Described the advantages and disadvantages of the U-Net family based on previous studies and our experimental findings.

For future work, we plan to design a lighter model for the U-Net family by incorporating elements such as ConvUNeXt, depthwise separable convolution, residual connections, and attention mechanisms. Additionally, we aim to enhance the performance of the improved model to further innovate the U-Net family design.

Acknowledgments

Funding: None.

Footnote

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-1429/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Gu Z, Cheng J, Fu H, Zhou K, Hao H, Zhao Y, Zhang T, Gao S, Liu J. CE-Net: Context Encoder Network for 2D Medical Image Segmentation. IEEE Trans Med Imaging 2019;38:2281-92. [Crossref] [PubMed]
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. 2014. Available online: https://arxiv.org/abs/1409.1556
IoffeSSzegedyC.Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167. 2015. https://arxiv.org/abs/1502.03167
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:1-9.
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:2818-26.
Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI Cconference on Artificial Intelligence. 2017.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-8.
Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Communications of the ACM 2017;60:84-90. [Crossref]
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:4700-8.
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861. 2017. Available online: https://arxiv.org/abs/1704.04861
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC. Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:4510-20.
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Cham: Springer International Publishing; 2015:234-41.
Zhang Z, Tian H, Xu Z, Bian Y, Wu J. Application of a pyramid pooling Unet model with integrated attention mechanism and Inception module in pancreatic tumor segmentation. J Appl Clin Med Phys 2023;24:e14204. [Crossref] [PubMed]
Yang Y, Dasmahapatra S, Mahmoodi S. ADS_UNet: A nested UNet for histopathology image segmentation. Expert Syst Appl 2023;226:120128. [Crossref]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. 2020. Available online: https://arxiv.org/abs/2010.11929
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. arXiv:1706.03762. 2017. Available online: https://arxiv.org/abs/1706.03762
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:10012-22.
Wang G, Zhao Y, Tang C, Luo C, Zeng W. When shift operation meets vision transformer: An extremely simple alternative to attention mechanism. Proceedings of the AAAI Conference on Artificial Intelligence. 2022;36:2423-30. [Crossref]
Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022:11976-86.
Han Z, Jian M, Wang GG. ConvUNeXt: An efficient convolution neural network for medical image segmentation. Knowledge-Based Systems 2022;253:109512. [Crossref]
Ma C, Gu Y, Wang Z. TriConvUNeXt: A Pure CNN-Based Lightweight Symmetrical Network for Biomedical Image Segmentation. J Imaging Inform Med 2024;37:2311-23. [Crossref] [PubMed]
Dinh BD, Nguyen TT, Tran TT, Pham VT. 1M parameters are enough? A lightweight CNN-based model for medical image segmentation. In: 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE; 2023:1279-84.
Andrearczyk V, Whelan PF. Convolutional neural network on three orthogonal planes for dynamic texture classification. Pattern Recognition 2018;76:36-49. [Crossref]
Ni S, Jia P, Xu Y, Zeng L, Li X, Xu M. Prediction of CO concentration in different conditions based on Gaussian-TCN. Sensors and Actuators B: Chemical 2023;376:133010. [Crossref]
Codella NC, Gutman D, Celebi ME, Helba B, Marchetti MA, Dusza SW, Kalloo A, Liopyris K, Mishra N, Kittler H. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). IEEE; 2018:168-72.
Holm S, Russell G, Nourrit V, McLoughlin N. DR HAGIS-a fundus image database for the automatic extraction of retinal surface vessels from diabetic patients. J Med Imaging (Bellingham) 2017;4:014503. [Crossref] [PubMed]
Sirinukunwattana K, Pluim JPW, Chen H, Qi X, Heng PA, Guo YB, Wang LY, Matuszewski BJ, Bruni E, Sanchez U, Böhm A, Ronneberger O, Cheikh BB, Racoceanu D, Kainz P, Pfeiffer M, Urschler M, Snead DRJ, Rajpoot NM. Gland segmentation in colon histology images: The glas challenge contest. Med Image Anal 2017;35:489-502. [Crossref] [PubMed]
Bernal J, Sánchez FJ, Fernández-Esparrach G, Gil D, Rodríguez C, Vilariño F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput Med Imaging Graph 2015;43:99-111. [Crossref] [PubMed]
Yeung M, Sala E, Schönlieb CB, Rundo L. Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Comput Med Imaging Graph 2022;95:102026. [Crossref] [PubMed]
Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Information Processing & Management 2009;45:427-37. [Crossref]
Dice LR. Measures of the amount of ecologic association between species. Ecology 1945;26:297-302. [Crossref]
Kosub S. A note on the triangle inequality for the Jaccard distance. Pattern Recognition Letters 2019;120:36-8. [Crossref]
Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. Deep Learn Med Image Anal Multimodal Learn Clin Decis Support (2018) 2018;11045:3-11. [Crossref] [PubMed]
Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, Han X, Chen YW, Wu J. Unet 3+: A full-scale connected unet for medical image segmentation. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2020:1055-9.
Dwivedi R, Dave D, Naik H, Singhal S, Omer R, Patel P, Qian B, Wen Z, Shah T, Morgan G, Explainable AI. XAI): Core ideas, techniques, and solutions. ACM Comput Surv 2023;55:1-33. [Crossref]
Ali S, Abuhmed T, El-Sappagh S, Muhammad K, Alonso-Moral JM, Confalonieri R, Guidotti R, Del Ser J, Díaz-Rodríguez N, Herrera F. Explainable Artificial Intelligence (XAI): What we know and what is left to attain Trustworthy Artificial Intelligence. Inf Fusion 2023;99:101805. [Crossref]

Cite this article as: Cheng CH, Yang JH, Hsu YC. A streamlined U-Net convolution network for medical image processing. Quant Imaging Med Surg 2025;15(1):455-472. doi: 10.21037/qims-24-1429