Seg-SkiNet: adaptive deformable fusion convolutional network for skin lesion segmentation

Haiwang Nan; Zhenhao Gao; Limei Song; Qiang Zheng

doi:10.21037/qims-24-1451

Original Article

Seg-SkiNet: adaptive deformable fusion convolutional network for skin lesion segmentation

Haiwang Nan¹, Zhenhao Gao¹, Limei Song², Qiang Zheng¹

¹School of Computer and Control Engineering, Yantai University, Yantai, China; ²School of Medical Imaging, Shandong Second Medical University, Weifang, China

Contributions: (I) Conception and design: H Nan, Z Gao, Q Zheng; (II) Administrative support: L Song, Q Zheng; (III) Provision of study materials or patients: Z Gao; (IV) Collection and assembly of data: H Nan, Z Gao; (V) Data analysis and interpretation: H Nan, Z Gao; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Qiang Zheng, PhD. School of Computer and Control Engineering, Yantai University, No. 30, Qingquan Road, Laishan District, Yantai 264005, China. Email: zhengqiang@ytu.edu.cn.

Background: Skin lesion segmentation plays a significant role in skin cancer diagnosis. However, due to the complex shapes, varying sizes, and different color depths, precise segmentation of skin lesions is a challenging task. Therefore, the aim of this study was to design a customized deep learning (DL) model for the precise segmentation of skin lesions, particularly for complex shapes and small target lesions.

Methods: In this study, an adaptive deformable fusion convolutional network (Seg-SkiNet) was proposed. Seg-SkiNet integrated dual-channel convolution encoder (Dual-Conv encoder), Multi-Scale-Multi-Receptive Field Extraction and Refinement (Multi²ER) module, and local-global information interaction fusion decoder (LGI-FSN decoder). In the Dual-Conv encoder, a Dual-Conv module was proposed and cascaded with max pooling in each layer to capture the features of complex-shaped skin lesions. The design of the Dual-Conv module not only effectively captured edge features of the lesions but also learned deep internal features of the lesions. The Multi²ER module was composed of an Atrous Spatial Pyramid Pooling (ASPP) module and an Attention Refinement Module (ARM), and integrated multi-scale features of small target lesions by expanding the receptive field of the convolutional kernel, thereby improving the learning and accurately segmentation of small target lesions. In the LGI-FSN decoder, we integrated convolution and Local-Global Attention Fusion (LGAF) module in each layer to enable interactive fusion of local-global information in feature maps while eliminating redundant feature information. Additionally, we designed a densely connected architecture that fuses the feature maps from a specific layer of the Dual-Conv encoder and all of its preceding layers into the corresponding layer of the LGI-FSN decoder, preventing information loss caused by pooling operations.

Results: We validated the performance of Seg-SkiNet for skin lesion segmentation on three public datasets: International Skin Imaging Collaboration (ISIC)-2016, ISIC-2017, and ISIC-2018. The experimental results demonstrated that Seg-SkiNet achieved a Dice coefficient (DICE) of 93.66%, 89.44% and 92.29%, respectively.

Conclusions: The Seg-SkiNet model performed excellently in segmenting complex-shaped lesions and small target lesions.

Keywords: Skin lesion segmentation; deep learning (DL); U-Net; multi-scale feature

Submitted Jul 17, 2024. Accepted for publication Nov 15, 2024. Published online Dec 17, 2024.

doi: 10.21037/qims-24-1451

Introduction

The excessive proliferation of skin cells is the primary cause of skin cancer, of which melanoma is one of the most dangerous types (1), posing a significant risk to life. Early detection of skin lesions can reduce the risk of skin cancer. Dermoscopy, as a non-invasive imaging technique, can facilitate the early detection and diagnosis of skin lesions (2). Dermatologists typically visually examine and analyze skin lesions and surrounding tissues in dermoscopic images (3,4); however, this process is highly laborious and inherently subjective. Therefore, employing a computer-aided diagnosis (CAD) system for automatic skin lesion segmentation is imperative, as it significantly assists clinicians in improving the accuracy of their analyses.

Skin lesion segmentation methods include traditional methods and deep learning (DL) segmentation methods. Traditional segmentation methods mainly focus on the active contour model image segmentation (5), threshold image segmentation (6,7), and support vector machines (8,9), which aim to extract basic information from the lesions. Nevertheless, these methods often depend on extensive pre- and post-processing, limiting their robustness. In contrast, the DL approaches use label information during training to autonomously learn semantic features from images, thereby enhancing the accuracy of skin lesion segmentation. For example, Ronneberger et al. (10) proposed the U-Net network model, which has shown promising results in medical image segmentation. Building on this foundation, the Attunet (11) further enhanced segmentation performance by incorporating an attention mechanism [the details of the attention mechanism can be found in the Supplementary file (Appendix 1)].

Although the above methods have demonstrated significant segmentation performance, the varying shapes and sizes of skin lesions have presented challenges to the existing DL model in practice. Wu et al. (12) proposed a feature adaptive transformer network (FAT-Net) based on the classical encoder-decoder architecture for skin lesion segmentation. FAT-Net introduced a dual-encoder architecture, integrating convolutional neural networks (CNN) and transformer branches. The dual-encoder structure not only effectively captured local features but also integrated global contextual information for skin lesion segmentation. However, due to the fixed shape of convolutional kernels, CNN may not accurately capture the complex local contour features of lesions with irregular boundaries. Liu et al. (13) proposed a local-global information interaction network (LGI Net) for medical image segmentation, while incorporating the efficient channel attention (ECA) module to effectively capture interactions between channels. Although LGI Net achieved excellent segmentation results across datasets, the consecutive convolution operations in the encoder and bottleneck layers limited its ability to capture multi-scale contextual and complex lesion contour features, which was unfavorable for the segmentation of small lesions or lesions with complex shapes. Dong et al. (14) proposed an effective feedback attention network (FAC-Net), integrating the feedback fusion block (FFB) and the attention mechanism block (AMB) for skin lesion segmentation. Although the FAC-Net captured multi-scale features while enhancing important feature representation, the down-sampling operations within the network posed a risk of feature information loss. Therefore, it is valuable to design a customized DL model for fine segmentation of skin lesions, especially for the complex-shaped lesions and small target lesions.

In this study, an adaptive deformable fusion convolutional network (Seg-SkiNet) was proposed for fine segmentation of skin lesions. The proposed Seg-SkiNet model consists of three parts: dual-channel convolution encoder (Dual-Conv encoder), Multi-Scale-Multi-Receptive Field Extraction and Refinement (Multi²ER) module, and local-global information interaction fusion decoder (LGI-FSN decoder). The main contributions of this study are summarized as follows:

In each layer of the Dual-Conv encoder, a dual-channel convolution (Dual-Conv) module was proposed to learn the features of lesions with complex boundary structures. Specifically, the Dual-Conv module was constructed by an Edge feature channel integrating a deformable convolution (15) and an SE_Block (16) to learn the boundary features, and a Deep feature channel integrating a standard convolution and an SE_Block (16) to extract deep feature information within the lesions. The Dual-Conv module enabled the Seg-SkiNet to better capture and adapt to lesions with complex geometries, thereby improving its ability to extract both edge features and deep internal features.
In the bottleneck layer, the Multi²ER module was proposed to enhance the segmentation accuracy and robustness of small target lesions, integrating an Atrous Spatial Pyramid Pooling (ASPP) module (17) and an Attention Refinement Module (ARM) (18). The ASPP module (17) captured multi-scale features using convolution layers with different dilation rates, whereas the ARM module (18) highlighted the differences between lesion areas and the background in the image. The structure of Multi²ER module enhanced the Seg-SkiNet model’s ability to detect and segment small target lesions.
To mitigate the loss of feature information caused by pooling operations, a dense connection structure was introduced between the Dual-Conv encoder and the LGI-FSN decoder. The dense connection structure fused the feature maps from a specific layer of the encoder and all its preceding layers into the corresponding layer of the decoder. Simultaneously, in each layer of the LGI-FSN decoder, we employed the Local-Global Attention Fusion (LGAF) (13) to facilitate interactive fusion of local-global contextual information in the feature maps.

We evaluated the effectiveness of the Seg-SkiNet model on three publicly available datasets: ISIC-2016 (19), ISIC-2017 (20), and ISIC-2018 (21). Quantitative results demonstrated that the Seg-SkiNet model achieved outstanding performance (the DICE reached 93.66%, 89.44% and 92.29%, respectively) in skin lesion segmentation. Furthermore, visualizations indicated that the Seg-SkiNet model performed excellently in capturing complex contour lesions and small target lesions.

Methods

Participants

In this study, we validated the effectiveness of the Seg-SkiNet model using three publicly available datasets from the International Skin Imaging Collaboration (ISIC): ISIC-2016 (19), ISIC-2017 (20), ISIC-2018 (21). (I) The ISIC-2016 dataset contains 1,279 RGB skin lesion images, providing 900 training images and 379 testing images. We further divided the testing images into 100 validation images and 279 testing images. (II) The ISIC-2017 dataset contains 2,000 training images, 150 validation images, and 600 testing images. (III) The ISIC-2018 dataset provides a total of 2,594 images along with their corresponding labels, and according to the previous studies (12,14), we further split the dataset into training, validation, and testing sets in a 7:1:2 ratio. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Before being fed into the model, all images were resized to 256×256 and underwent Gaussian blurring and normalization.

The proposed Seg-SkiNet model

In this section, a Seg-SkiNet model (Figure 1) is proposed for fine segmentation of skin lesions. The flowchart of the Seg-SkiNet model is shown in Figure 1A, and its architecture is shown in Figure 1B. Specifically, the Seg-SkiNet was constructed by a Dual-Conv encoder that integrated a Dual-Conv module (Figure 1C) and max pooling in each layer, and a LGI-FSN decoder that integrated a LGAF module (13) (Figure 1D) and convolution in each layer. Meanwhile, the Dual-Conv encoder and the LGI-FSN decoder were connected at the bottom through the Multi²ER (Figure 1E) bottleneck layer, integrating an ASPP (17) module and an ARM module (18). Similar to the U-Net (10) structure, both the Dual-Conv encoder and the LGI-FSN decoder consisted of 5 layers, and a dense connections structure was designed between them to prevent feature information loss.

Figure 1 The overall framework diagram of the proposed model. (A) Flowchart of the Seg-SkiNet model. (B) Architecture of the proposed Seg-SkiNet model. (C) The Dual-Conv module. (D) The LGAF module. (E) The Multi²ER module. Seg-SkiNet, adaptive deformable fusion convolutional network; Dual-Conv, dual-channel convolution; Multi²ER, Multi-Scale-Multi-Receptive Field Extraction and Refinement; LGI-FSN, local-global information interaction fusion; SE, Squeeze-and-Excitation; LGAF, Local-Global Attention Fusion; MLP, Multilayer Perceptron.

Dual-Conv encoder

Standard convolution performed excellently in extracting local feature information; however, due to the fixed size of convolutional kernels and the limited receptive field, it failed to capture sufficient contour detail features when handling lesions with complex shapes or boundaries, resulting in decreased segmentation accuracy. On the contrary, deformable convolution (15) automatically adjusted the sampling points of the convolution kernel at each spatial position by introducing offsets. This module could effectively handle boundary features of various lesion shapes by flexibly altering the position of convolution operations and the shape of the convolutional kernels. This process could be formulated by Eq. [1]:

$Z (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n} + Δ p_{n})$ [1]

where, $w (p_{n})$ represents the weight of the convolution kernel at position $p_{n}$ , $x$ represents the input feature map, $Δ p$ represents the offset, and $R = {(- 1, - 1), (- 1, 0), \dots, (0, 1), (1, 1)}$ (defined a 3×3 convolution kernel with dilation 1) [the details of the deformable convolution can be found in the Supplementary file (Appendix 2)]. To combine the advantages of both methods in feature extraction, Dual-Conv encoder was designed.

The Dual-Conv encoder consisted of five layers; the Dual-Conv module was proposed to learn the lesion features with complex boundary structures in each layer of the Dual-Conv encoder. The structure of the Dual-Conv module is shown in Figure 2. Specifically, the Dual-Conv module was constructed by an Edge feature channel integrating a deformable convolution (15) and an SE_Block (16) to learn the boundary features of lesions with complex shapes, and a Deep feature channel integrating a standard convolution and an SE_Block to extract local feature information within the lesions. Subsequently, the outputs of the two channels were fused, which maximized the integration of the Dual-Conv encoder’s ability to learn boundary features and deep internal features, thereby enhancing the performance of the Seg-SkiNet model in segmenting lesions with complex boundary structures.

Figure 2 Architecture of the proposed Dual-Conv module. The Dual-Conv module was constructed by an Edge feature channel integrating a deformable convolution and an SE_Block to learn the boundary features of lesions with complex shapes, and a Deep feature channel integrating a standard convolution and an SE_Block to extract local feature information within the lesions. SE, Squeeze-and-Excitation; Dual-Conv, dual-channel convolution.

Specifically, in the Edge feature channel, the input first would be processed by a deformable convolution (15) to capture the contour boundaries features of irregular lesions. Subsequently, the SE_block (16) dynamically adjusted channel feature responses by learning weights between feature maps, thereby enhancing the representation of features crucial for segmentation tasks. Meanwhile, the skip connections were used to prevent the loss of feature information. This process could be formulated by Eq. [2]:

${\begin{cases} X_{D} = DeformConv2D (Input) \\ \tilde{X_{D}} = SE_Block (X_{D}) \\ Y_{1} = δ (2 X_{D} + \tilde{X_{D}}) \end{cases}$ [2]

where, $δ$ represents the activation function, and $Y_{1}$ represents the feature map output by the Edge feature channel. For the Deep feature channel, we replaced deformable convolution with a standard convolutional layer to better capture deep-level local detailed features within the lesions. This process could be formulated by Eq. [3]:

${\begin{cases} X_{C} = Conv 2 D (Input) \\ \tilde{X_{C}} = SE_Block (X_{C}) \\ Y_{2} = δ (2 X_{C} + \tilde{X_{C}}) \end{cases}$ [3]

where, $Y_{2}$ represents the feature map output by the Deep feature channel. Finally, the outputs from both channels were fused to obtain the final output of the Dual-Conv module. The dual-channel parallel design of the Dual-Conv module effectively improved the accuracy of lesion boundary recognition and enhanced the capture of internal details.

Multi²ER module

The limited receptive fields of convolution kernels in traditional convolution operations restrict their ability to capture the subtle features and contextual information of small target and low-contrast lesions. The ASPP module (17) expanded the receptive field of convolutional kernels by setting different dilation rates, thereby capturing multi-scale feature information in feature maps. The capability of ASPP module (17) was particularly effective for small target lesion segmentation, enabling better capture of subtle features and contextual information, thereby improving segmentation accuracy. The ARM module (18) highlighted differences between lesion areas and background by learning attention weights among channel feature maps. Therefore, to enhance the segmentation accuracy and robustness of small target lesions, the Multi²ER module was designed within the bottleneck of the Seg-SkiNet model, integrating the ASPP module (17) and the ARM module (18). The structure of the Multi²ER module is shown in Figure 3.

Figure 3 Architecture of the proposed Multi²ER module. Multi²ER module was proposed for the segmentation of small target skin lesions, integrating an ASPP module and an ARM. Conv, convolution; Multi²ER, Multi-Scale-Multi-Receptive Field Extraction and Refinement; ASPP, Atrous Spatial Pyramid Pooling; ARM, Attention Refinement Module.

Specifically, the ASPP module (17) consisted of five branches: the first branch adjusted the dimension of the feature map using 1×1 convolution. The middle three branches employed different dilation rates to enlarge the convolutional kernel’s receptive field. The final branch enhanced robustness of the module through pooling layer and upsampling layer. This process could be formulated by Eqs. [4-6]:

${\begin{cases} A_{1} = Conv 1 \times 1 (Input) \\ A_{2, 3, 4} = Conv 3 \times 3 (Input, rate (6, 12, 18)) \\ A_{5} = Upsample (Conv 1 \times 1 (pooling (Input))) \end{cases}$ [4]

$Feature 1 = Concat (A_{1}, A_{2}, A_{3}, A_{4}, A_{5})$ [5]

$Feature 2 = Conv 1 \times 1 (feature 1)$ [6]

where, $A_{i}$ represents the output of the $i^{t h}$ branch; $Concat (\cdot)$ represents concatenation function. Subsequently, the output of ASPP (17) was fed into the ARM module (18) to highlight texture differences between skin lesion areas and the background. Specifically, global average pooling was first applied on the feature maps to obtain a global feature pooling vector, which was then processed sequentially through 1×1 convolution layer and normalization layer, and finally mapped to attention scores by a Sigmoid function. Then, the attention scores were multiplied by the original feature map to emphasize the salient features of the lesion area. This process could be formulated by Eqs. [7,8]:

$Att_score = Sigmoid (B N (Conv (pooling (feature 2))))$ [7]

$Output = Input + Feature 2 \otimes Att_score$ [8]

Where, $Att_score$ represents the attention scores; $\otimes$ represents element-wise multiplication.

LGI-FSN decoder

To mitigate the loss of feature information caused by pooling operations, a dense connection structure was introduced between the Dual-Conv encoder and the LGI-FSN decoder. Specifically, the output of the Dual-Conv module from a specific layer of the Dual-Conv encoder was added to the outputs of the Dual-Conv module from all preceding layers, and the summed result was concatenated with the input of the corresponding layer of the LGI-FSN decoder. To maintain dimension matching between the feature maps, the outputs of the Dual-Conv module from all preceding layers were processed through the 1×1 convolution and pooling operations. Additionally, at each layer in LGI-FSN decoder, we employed the LGAF (13) module to facilitate interactive fusion of the local-global information within the feature maps.

Implementation details and evaluation indicators

The Seg-SkiNet model was implemented using PyTorch (https://pytorch.org/) on an NVIDIA RTX3090 (NVIDIA, Santa Clara, CA, USA) with 24 GB of memory, and Adam was chosen as the optimizer. The maximum number of training epochs was 70, the initial learning rate was set to 0.001, and the training batch size was set to 2.

In this study, four quantitative measures were used to evaluate the accuracy of segmentation: Dice coefficient (DICE), accuracy (ACC), sensitivity (SE), and specificity (SP). Each of them was defined as follows:

$Dice = \frac{2 \times T P}{2 \times T P + F P + F N}$ [9]

$A C C = \frac{T P + F N}{T P + T N + F P + F N}$ [10]

$S E = \frac{T P}{T P + F N}$ [11]

$S P = \frac{T N}{T N + F P}$ [12]

where $T P$ , $T N$ , $F P$ , and $F N$ represented true positives, true negatives, false positives, and false negatives, respectively. $T P$ and $T N$ represented the number of pixels which were correctly classified as skin lesions and background, respectively. $F P$ and $F N$ represented the number of misclassified pixels of skin lesions and background, respectively. The loss function selected for the experiments was the Dice coefficient loss (DiceLoss) (22), which could be formulated by Eq. [13]:

$DiceLoss = 1 - \frac{2 | X \cap Y |}{| X | + | Y |}$ [13]

where $X$ and $Y$ represent the predicted segmentation result and the real segmentation label, respectively.

Results

In this study, we evaluated the performance of the Seg-SkiNet model using three public datasets from the ISIC: ISIC-2016 (19), ISIC-2017 (20), and ISIC-2018 (21). We compared the proposed Seg-SkiNet model with 6 advanced medical image segmentation networks, including U-Net (10), Attunet (11), SLT-Net (23), LGI-Net (13), FAT-Net (12), and GFA-Net (24). The quantitative results are summarized in Tables 1-3 and the visual results are summarized in Figures 4-6. Additionally, we performed Wilcoxon signed rank tests on four evaluation metrics between the Seg-SkiNet model and all the compared methods.

Table 1

Quantitative comparison between the proposed Seg-SkiNet model and other segmentation approaches in the ISIC-2016 dataset

Method	ACC (%)	SE (%)	SP (%)	DICE (%)
U-Net	94.30*	81.10*	95.50	86.05*
Attunet	94.10*	82.60*	96.83*	88.02*
SLT-Net	95.30*	91.50*	95.30*	90.70*
LGI-Net	95.30*	92.40*	96.30*	91.20*
FAT-Net	96.04*	92.59*	96.02*	91.60*
GFA-Net	96.04*	92.95	97.25^†	89.92*
Seg-SkiNet	96.56^†	94.91^†	96.69	93.66^†

Best are labeled with ^†. * indicates the Seg-SkiNet model achieved a significant difference compared to other approaches in the Wilcoxon signed rank tests, with P<0.05. Seg-SkiNet, adaptive deformable fusion convolutional network; ISIC, International Skin Imaging Collaboration; ACC, accuracy; SE, sensitivity; SP, specificity; DICE, Dice coefficient.

Table 2

Quantitative comparison between the proposed Seg-SkiNet model and other segmentation approaches in the ISIC-2017 dataset

Method	ACC (%)	SE (%)	SP (%)	DICE (%)
U-Net	92.30*	70.30*	95.10*	72.80*
Attunet	92.50*	74.10*	96.40*	79.20*
SLT-Net	95.30^†	82.10*	95.30*	83.20*
LGI-Net	92.30*	82.40*	96.30*	81.00*
FAT-Net	93.26*	83.92^†	97.25*	85.00*
GFA-Net	93.97*	81.37*	97.87*^†	77.75*
Seg-SkiNet	94.22	83.91	97.33	89.44^†

Best are labeled with ^†. * indicates the Seg-SkiNet model achieved a significant difference compared to other approaches in the Wilcoxon signed rank tests, with P<0.05. Seg-SkiNet, adaptive deformable fusion convolutional network; ISIC, International Skin Imaging Collaboration; ACC, accuracy; SE, sensitivity; SP, specificity; DICE, Dice coefficient.

Table 3

Quantitative comparison between the proposed Seg-SkiNet model and other segmentation approaches in the ISIC-2018 dataset

Method	ACC (%)	SE (%)	SP (%)	DICE (%)
U-Net	95.70*	72.90*	97.70*	77.90*
Attunet	95.80*	90.20*	97.40	80.50*
SLT-Net	96.30*	82.10*	97.30*	83.20*
LGI-Net	95.66*	90.40*	95.30*	87.29*
FAT-Net	93.26*	89.92*	97.25*	85.00*
GFA-Net	96.29*	90.75*	97.79*^†	90.13*
Seg-SkiNet	96.82^†	91.86^†	97.33	92.29^†

Best are labeled with ^†. * indicates the Seg-SkiNet model achieved a significant difference compared to other approaches in the Wilcoxon signed rank tests, with P<0.05. Seg-SkiNet, adaptive deformable fusion convolutional network; ISIC, International Skin Imaging Collaboration; ACC, accuracy; SE, sensitivity; SP, specificity; DICE, Dice coefficient.

Figure 4 Visualization comparison of the proposed Seg-SkiNet model with other segmentation methods in the ISIC-2016 dataset. Seg-SkiNet, adaptive deformable fusion convolutional network; ISIC, International Skin Imaging Collaboration.

Figure 5 Visualization comparison of the proposed Seg-SkiNet model with other segmentation methods in the ISIC-2017 dataset. Seg-SkiNet, adaptive deformable fusion convolutional network; ISIC, International Skin Imaging Collaboration.

Figure 6 Visualization comparison of the proposed Seg-SkiNet model with other segmentation methods in the ISIC-2018 dataset. Seg-SkiNet, adaptive deformable fusion convolutional network; ISIC, International Skin Imaging Collaboration.

Results on the ISIC-2016

As shown in Table 1, the Seg-SkiNet model demonstrated excellent quantitative metrics on the ISIC-2016 dataset (ACC =96.56%, SE =94.91%, SP =96.69%, and DICE =93.66%). Compared to other methods, the Seg-SkiNet model exhibited improvements in ACC ranging from 0.52% to 2.46%, in SE ranging from 1.96% to 13.81%, and in DICE ranging from 2.06% to 7.61%. Furthermore, the Seg-SkiNet model demonstrated a statistically significant difference compared to other approaches on most quantitative metrics in the ISIC-2016.

We randomly selected six participants in the ISIC-2016 database to visualize the segmentation differences between the Seg-SkiNet model and other methods. As depicted in Figure 4, the continuous downsampling operations in the U-Net (10) and Attunet (11) resulted in a certain degree of feature information loss, whereas the standard convolutions limited the models’ ability to extract complex boundary features of lesions, leading to poor segmentation performance. SLT-Net (23) improved segmentation performance by introducing the CSwin Transformer (25) to capture global contextual information; however, it carried the risk of introducing additional redundant information. The consecutive convolution operations in LGI-Net (13) limited the model’s capacity to capture multi-scale contextual information, resulting in suboptimal segmentation performance for small target lesions. The dual-encoder architecture in FAT-Net (12) effectively captured local features and global contextual information in lesions, but the CNN restricted the model’s ability to capture complex boundary contour features and posed the risk of introducing extra redundant information. GFA-Net (24) partially alleviated these issues, but it still performed poorly in the segmentation of small target lesions and exhibited a certain degree of feature information loss. Compared to other methods, the Seg-SkiNet model demonstrated superior segmentation performance, particularly in capturing multi-scale features beneficial for the segmentation of small target lesions and contour features for segmenting lesions with complex boundaries.

Results on the ISIC-2017

As shown in Table 2, the Seg-SkiNet model demonstrated excellent quantitative metrics on the ISIC-2017 dataset (ACC =94.22%, SE =83.91%, SP =97.33%, and DICE =89.44%). Compared to other methods, the Seg-SkiNet model exhibited enhancements in DICE ranging from 4.44% to 16.64%. For the SE and SP metrics, the Seg-SkiNet model was slightly inferior to the FAT-Net and GFA-Net models but it outperformed the other 5 comparative methods. Furthermore, the Seg-SkiNet model demonstrated a statistically significant difference compared to other approaches on most quantitative metrics in the ISIC-2017.

We randomly selected six participants in the ISIC-2017 database to visualize the segmentation differences between the Seg-SkiNet model and other methods. As depicted in Figure 5, U-Net (10), Attunet (11), and LGI-Net (13) exhibited poor performance in lesion segmentation, caused by the feature information loss. SLT-Net (23) introduced additional redundant information, leading to poor segmentation performance. FAT-Net (12) and GFA-Net (24) failed to effectively capture the complex contour features of lesions. In contrast, the Seg-SkiNet model demonstrated superior segmentation performance, particularly for small target lesions and lesions with complex boundaries.

Results on the ISIC-2018

As shown in Table 3, the Seg-SkiNet model demonstrated excellent performance on the ISIC-2018 dataset (ACC =96.82%, SE =91.86%, SP =97.33%, and DICE =92.29%). Compared to other methods, the Seg-SkiNet model exhibited enhancements in ACC ranging from 0.52% to 3.56%, in SE ranging from 1.11% to 18.96%, and in DICE ranging from 2.16% to 14.39%. Furthermore, the Seg-SkiNet model demonstrated a statistically significant difference compared to other approaches on most quantitative metrics in the ISIC-2018.

We randomly selected six participants in the ISIC-2018 database to visualize the segmentation differences between the Seg-SkiNet model and other methods. As depicted in Figure 6, U-Net (10) and Attunet (11) exhibited poor performance in lesion segmentation, caused by the feature information loss. The LGI-Net (13) model demonstrated poor performance in segmenting small lesions, whereas the SLT-Net (23), FAT-Net (12), and GFA-Net (24) models failed to effectively capture the complex contour features of lesions. In contrast, the proposed Seg-SkiNet model demonstrated superior performance compared to other methods, particularly in handling small lesions and complex lesion boundaries.

Ablation study

To analyze the impact of different components within the Seg-SkiNet model on segmentation performance, we conducted an ablation study using ISIC-2016 dataset under the following settings: (I) baseline U-Net; (II) replace skip connections in U-Net with dense connections; (III) U-Net + dense connections + LGAF; (IV) U-Net + dense connections + LGAF + Dual-Conv; (V) Seg-SkiNet. The quantitative and qualitative results are shown in Table 4 and Figure 7, respectively. As shown in Table 4, with the addition of different components, the segmentation performance of the model improved progressively, indicating that different components contributed positively to the model’s segmentation efficacy.

Table 4

Ablation study results in the ISIC-2016 dataset

Method	ACC (%)	SE (%)	SP (%)	DICE (%)
U-Net	94.3*	81.1*	95.5	86.0*
U-Net + dense	94.7*	87.5*	95.8*	88.4*
U-Net + dense + LGAF	95.6*	91.3*	96.1	90.2*
U-Net + dense + LGAF + Dual-Conv	96.1*	93.1*	96.1*	92.2*
Seg-SkiNet	96.5^†	94.9^†	96.7^†	93.7^†

Best are labeled with ^†. * indicates the Seg-SkiNet model achieved a significant difference compared to other ablation approaches in the Wilcoxon signed rank tests, with P<0.05. ISIC, International Skin Imaging Collaboration; ACC, accuracy; SE, sensitivity; SP, specificity; DICE, Dice coefficient; LGAF, Local-Global Attention Fusion; Dual-Conv, dual-channel convolution; Seg-SkiNet, adaptive deformable fusion convolutional network.

Figure 7 Qualitative results of ablation studies under different scenarios of (I) baseline U-Net, (II) U-Net + dense, (III) U-Net + dense + LGAF, (IV) U-Net + dense + LGAF + Dual-Conv and (V) Seg-SkiNet. LGAF, Local-Global Attention Fusion; Dual-Conv, dual-channel convolution; Seg-SkiNet, adaptive deformable fusion convolutional network.

As shown in Figure 7, the U-Net model had lower segmentation accuracy. The structure of U-Net with dense connections effectively prevented feature information loss caused by pooling operations but introduced additional redundancy. The U-Net with dense connections and LGAF integration eliminated redundancy through interactive fusion of local-global contextual information; however, the model performed poorly in learning lesions with complex boundaries. The inclusion of the Dual-Conv module enabled the model to more accurately learn boundary features of lesions. Ultimately, the Seg-SkiNet model achieved optimal quantitative and qualitative results, maximizing the proximity of segmentation results to the ground truth labels.

Discussion

Skin lesions can be caused by various factors, including infections, allergic reactions, environmental exposures, or underlying systemic diseases (26-28). Dermoscopy is the most common method for examining skin lesions (2). Accurate segmentation of skin lesions can enhance the precision and efficiency of clinical diagnosis and improve the prognosis and treatment outcomes of skin diseases. However, relying on clinicians to manually annotate lesions is extremely laborious and time-consuming. To alleviate this issue, computer-aided diagnosis (CAD) has gradually penetrated the medical segmentation field (29).

Traditional skin lesion segmentation methods, such as thresholding method (6,7), region growing method (30), morphological method (31), and gradient vector method (32) often rely on manual feature extraction and exhibit poor robustness. The continuous advancement of artificial intelligence (AI) (33) has facilitated the penetration of DL into the field of medical image segmentation (34). However, existing DL models have been limited in performance when learning small target lesions or lesions with complex boundaries. Therefore, it is imperative to design a reasonable DL model to achieve accurate segmentation of skin lesions.

CNNs perform excellently in extracting deep features; however, they struggle with irregular targets due to the fixed size of convolutional kernels and limited receptive fields. Deformable convolutions (15) can modify the positions of the convolutional kernels through learned offsets, allowing them to accommodate various deformations in the input feature maps. This characteristic is highly suitable for extracting features from irregularly shaped skin lesions. Therefore, combining high-performance network structures such as the classical U-Net (10) segmentation network and deformable convolutions (15) endows our research with robust technical support.

In this study, the Seg-SkiNet model was proposed for the precise segmentation of skin lesions. The Seg-SkiNet model adopts an overall U-shaped architecture, consisting of three parts: Dual-Conv encoder, Multi²ER module, and LGI-FSN decoder. In each layer of the Dual-Conv encoder, we integrated Dual-Conv module and max pooling layer.

Dual-Conv module employed a parallel design with two channels: Edge feature channel and Deep feature channel. The Edge feature channel utilized deformable convolution (15) to learn lesion boundary features, whereas the Deep feature channel employed standard convolution to extract deep local features within lesions. The dual-channel design of Dual-Conv module enhanced the Seg-SkiNet model’s ability to capture features of lesions with complex shapes or indistinct boundaries. Within the Multi²ER module, ASPP (17) and ARM modules (18) were integrated for multi-scale feature fusion to learn features of small target lesions. Meanwhile, the dense connections were employed between the Dual-Conv encoder and the LGI-FSN decoder to alleviate information loss caused by pooling operations, whereas the LGAF module (13) facilitated interactive fusion of the local-global information of the feature maps in the LGI-FSN decoder. As shown in Figure 7, each module contributed to improving the segmentation accuracy to some extent. Quantitative and qualitative results on three publicly available datasets demonstrated the excellent performance of the Seg-SkiNet model. Furthermore, the Seg-SkiNet model demonstrated a statistically significant difference compared to other approaches on most quantitative metrics.

To balance the Seg-SkiNet model’s performance with the required costs, we compared the testing times between the Seg-SkiNet model and other competing methods using the ISIC-2016 database. The time required for testing by various DL models was as follows: U-Net (6.50 s), Attunet (8.79 s), SLT-Net (14.24 s), LGI-Net (10.09 s), FAT-Net (18.59 s), GFA-Net (17.66 s), and Seg-SkiNet (19.22 s). The testing times for these models were calculated based on the same set of 279 skin lesion images. Although our proposed Seg-SkiNet model exhibited a slight increase in testing time compared to other methods, it achieved a significant improvement in segmentation accuracy, with the DICE increasing from 2.06% to 7.61% (Table 1).

The Seg-SkiNet model demonstrated a certain level of structural and methodological generalizability. Firstly, the U-Net (10) architecture had been widely applied in the field of medical image segmentation (11,13,14). Secondly, the introduction of deformable convolutions (15) and modules such as Multi²ER enhanced the Seg-SkiNet’s ability to segment small lesions and lesions with complex contours. These characteristics of skin lesions were similar to those of other types of lesions, thus the Seg-SkiNet model could provide strong support for segmentation tasks in other types of medical imaging. In practical segmentation tasks, the overall framework and loss functions of the Seg-SkiNet model can be directly applied. However, considering the differences in data, preprocessing steps can be adjusted and fine-tuned according to the specific requirements of each task.

This study has the following limitations. (I) We have validated the Seg-SkiNet’s performance using only publicly available datasets. To enhance the Seg-SkiNet’s generalizability and clinical applicability, future work will involve collaboration with hospitals and medical institutions to obtain diverse clinical data. (II) This study has focused solely on the segmentation of skin lesions. In the future, techniques such as transfer learning (35) can be employed to extend the Seg-SkiNet’s application to other types of lesion segmentation tasks, including those related to pulmonary and neurological diseases. (III) The Multi²ER module within the Seg-SkiNet model requires substantial computational resources. Future work will employ more efficient and lightweight networks to optimize the utilization of computational resources.

Conclusions

In this study, a Seg-SkiNet model was proposed for the precise segmentation of skin lesions. The Seg-SkiNet was constructed by a Dual-Conv encoder that integrated a Dual-Conv module and max pooling in each layer, and a LGI-FSN decoder that integrated a LGAF module and convolution in each layer. Meanwhile, the Dual-Conv encoder and the LGI-FSN decoder were connected at the bottom through the Multi²ER bottleneck layer, integrating an ASPP module and an ARM module. Both the Dual-Conv encoder and the LGI-FSN decoder consisted of five layers, and the dense connections structure was designed between them to prevent feature information loss. The Seg-SkiNet model performed outstandingly across three publicly available datasets, demonstrating superior quantitative and qualitative metrics compared to other segmentation methods. We believe that the Seg-SkiNet model has the potential to provide guidance and assistance to clinicians in the diagnosis of skin diseases. Furthermore, the trained model parameters can be integrated into commonly used diagnostic platforms to facilitate intelligent diagnosis.

Acknowledgments

Funding: This work was supported by the National Natural Science Foundation of China (Nos. 61802330 and 61802331), Yantai City Science and Technology Innovation Development Plan (No. 2023XDRH006), Shandong Provincial Natural Science Foundation (No. ZR2020QH048), Open Project of Key Laboratory of Medical Imaging and Artificial Intelligence of Hunan Province, Xiangnan University (No. YXZN2022002) and Natural Science Foundation of Shandong Province (No. ZR2024MH072).

Footnote

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-1451/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Alahmadi MD. Multiscale attention U-Net for skin lesion segmentation. IEEE Access 2022;10:59145-54.
Yélamos O, Braun RP, Liopyris K, Wolner ZJ, Kerl K, Gerami P, Marghoob AA. Usefulness of dermoscopy to improve the clinical and histopathologic diagnosis of skin cancers. J Am Acad Dermatol 2019;80:365-77. [Crossref] [PubMed]
Haenssle HA, Fink C, Schneiderbauer R, Toberer F, Buhl T, Blum A, et al. Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann Oncol 2018;29:1836-42. [Crossref] [PubMed]
Kharazmi P, AlJasser MI, Lui H, Wang ZJ, Lee TK. Automated Detection and Segmentation of Vascular Structures of Skin Lesions Seen in Dermoscopy, With an Application to Basal Cell Carcinoma Classification. IEEE J Biomed Health Inform 2017;21:1675-84. [Crossref] [PubMed]
Duan D, Zhang H, Qiu C, Xia S. A review of active contour model based image segmentation algorithms. Chinese Journal of Biomedical Engineering 2015;34:445-54.
Celebi ME, Kingravi HA, Iyatomi H, Aslandogan YA, Stoecker WV, Moss RH, Malters JM, Grichnik JM, Marghoob AA, Rabinovitz HS, Menzies SW. Border detection in dermoscopy images using statistical region merging. Skin Res Technol 2008;14:347-53. [Crossref] [PubMed]
Emre Celebi M, Wen Q, Hwang S, Iyatomi H, Schaefer G. Lesion border detection in dermoscopy images using ensembles of thresholding methods. Skin Res Technol 2013;19:e252-8. [Crossref] [PubMed]
Basak H, Kundu R, Sarkar R. MFSNet: A multi focus segmentation network for skin lesion segmentation. Pattern Recognition 2022;128:108673. [Crossref]
Tran TT, Pham VT. Fully convolutional neural network with attention gate and fuzzy active contour model for skin lesion segmentation. Multimedia Tools and Applications 2022;81:13979-99. [Crossref]
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention-MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, part III 18; Springer; 2015.
OktayOSchlemperJFolgocLLLeeMHeinrichMMisawaKMoriKMcDonaghSHammerlaNYKainzB. Attention u-net: Learning where to look for the pancreas.2018. Available online: 10.48550/arXiv.1804.03999
Wu H, Chen S, Chen G, Wang W, Lei B, Wen Z. FAT-Net: Feature adaptive transformers for automated skin lesion segmentation. Med Image Anal 2022;76:102327. [Crossref] [PubMed]
Liu L, Li Y, Wu Y, Ren L, Wang G. LGI Net: Enhancing local-global information interaction for medical image segmentation. Comput Biol Med 2023; Epub ahead of print. [Crossref] [PubMed]
Dong Y, Wang L, Cheng S, Li Y. FAC-Net: Feedback Attention Network Based on Context Encoder Network for Skin Lesion Segmentation. Sensors (Basel) 2021;21:5172. [Crossref] [PubMed]
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y. Deformable convolutional networks. Proceedings of the IEEE International Conference on Computer Vision; 2017. Available online: http://dx.doi.org/10.48550/arXiv.1703.06211
Hu J, Shen L, Sun G. Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. Available online: http://dx.doi.org/10.48550/arXiv.1709.01507
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European Conference on Computer Vision (ECCV); 2018. Available online: http://dx.doi.org/10.48550/arXiv.1802.02611
Yu C, Wang J, Peng C, Gao C, Yu G, Sang N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. Proceedings of the European Conference on Computer Vision (ECCV); 2018. Available online: http://dx.doi.org/10.48550/arXiv.1808.00897
Gutman D, Codella NC, Celebi E, Helba B, Marchetti M, Mishra N, Halpern A. Skin lesion analysis toward melanoma detection: A challenge at the international symposium on biomedical imaging (ISBI) 2016, hosted by the international skin imaging collaboration (ISIC). 2016. Available online: http://dx.doi.org/10.48550/arXiv.1605.01397
Codella NC, Gutman D, Celebi ME, Helba B, Marchetti MA, Dusza SW, Kalloo A, Liopyris K, Mishra N, Kittler H. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018); 2018. doi: http://dx.doi.org/10.1109/ISBI.2018.8363547.
Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data 2018;5:180161. [Crossref] [PubMed]
Dice LR. Measures of the amount of ecologic association between species. Ecology 1945;26:297-302. [Crossref]
Feng K, Ren L, Wang G, Wang H, Li Y. SLT-Net: A codec network for skin lesion segmentation. Comput Biol Med 2022;148:105942. [Crossref] [PubMed]
Qiu S, Li C, Feng Y, Zuo S, Liang H, Xu A. GFANet: Gated Fusion Attention Network for skin lesion segmentation. Comput Biol Med 2023;155:106462. [Crossref] [PubMed]
Dong X, Bao J, Chen D, Zhang W, Yu N, Yuan L, Chen D, Guo B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. Available online: http://dx.doi.org/10.48550/arXiv.2107.00652
Parker ER, Mo J, Goodman RS. The dermatological manifestations of extreme weather events: a comprehensive review of skin disease and vulnerability. The Journal of Climate Change and Health 2022;8:100162. Available online: http://dx.doi.org/10.1016/j.joclim.2022.100162
Faraz K, Seely M, Marano AL. The role of the environment in allergic skin disease. Curr Allergy Asthma Rep 2024;24:323-30. [Crossref] [PubMed]
Kantor R, Silverberg JI. Environmental risk factors and their role in the management of atopic dermatitis. Expert Rev Clin Immunol 2017;13:15-26. [Crossref] [PubMed]
Singh SK. Diagnosis of skin cancer using novel computer vision and deep learning techniques: University of Essex; 2022. Available online: http://repository.essex.ac.uk/id/eprint/33026
Ma Z, Tavares JM. A Novel Approach to Segment Skin Lesions in Dermoscopic Images Based on a Deformable Model. IEEE J Biomed Health Inform 2016;20:615-23. [Crossref] [PubMed]
Schmid P. editor. Lesion detection in dermatoscopic images using anisotropic diffusion and morphological flooding. Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348); 1999. doi: 10.1109/ICIP.1999.817154.
Erkol B, Moss RH, Stanley RJ, Stoecker WV, Hvatum E. Automatic lesion boundary detection in dermoscopy images using gradient vector flow snakes. Skin Res Technol 2005;11:17-26. [Crossref] [PubMed]
Frey CB, Osborne M. Generative AI and the future of work: a reappraisal. Brown Journal of World Affairs 2024;30.
Zhang Z, Li Y, Shin BS. C2-GAN: Content-consistent generative adversarial networks for unsupervised domain adaptation in medical image segmentation. Medical Physics 2022;49:6491-504. Available online: http://dx.doi.org/10.1002/mp.15944
Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He QJPotI. A comprehensive survey on transfer learning. IEEE; 2020;109:43-76. http://dx.doi.org/10.1109/JPROC.2020.3004555.

Cite this article as: Nan H, Gao Z, Song L, Zheng Q. Seg-SkiNet: adaptive deformable fusion convolutional network for skin lesion segmentation. Quant Imaging Med Surg 2025;15(1):867-881. doi: 10.21037/qims-24-1451

Seg-SkiNet: adaptive deformable fusion convolutional network for skin lesion segmentation

Introduction