Attention-guided context asymmetric fusion networks for the liver tumor segmentation of computed tomography images

Fang Wang; Xue-Li Cheng; Ning-Bin Luo; Dan-Ke Su

doi:10.21037/qims-23-1747

Original Article

Attention-guided context asymmetric fusion networks for the liver tumor segmentation of computed tomography images

Fang Wang^#, Xue-Li Cheng^#, Ning-Bin Luo, Dan-Ke Su

Department of Radiology, Guangxi Medical University Cancer Hospital, Nanning, China

Contributions: (I) Conception and design: NB Luo, DK Su; (II) Administrative support: NB Luo, DK Su; (III) Provision of study materials or patients: F Wang; (IV) Collection and assembly of data: F Wang, XL Cheng; (V) Data analysis and interpretation: F Wang, XL Cheng; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Ning-Bin Luo, PhD; Dan-Ke Su, PhD. Department of Radiology, Guangxi Medical University Cancer Hospital, 71 Hedi Road, Nanning 530021, China. Email: luoningbin2012@hotmail.com; sudanke33@sina.com.

Background: Liver tumor segmentation based on medical imaging is playing an increasingly important role in liver tumor research and individualized therapeutic decision-making. However, it remains a challenging in terms of the accuracy of automatic segmentation of liver tumors. Therefore, we aimed to develop a novel deep neural network for improving the results from the automatic segmentation of liver tumors.

Methods: This paper proposes the attention-guided context asymmetric fusion network (AGCAF-Net), combining attention guidance and fusion context modules on the basis of a residual neural network for the automatic segmentation of liver tumors. According to the attention-guided context block (AGCB), the feature map is first divided into multiple small blocks, the local correlation between features is calculated, and then the global nonlocal fusion module (GNFM) is used to obtain the global information between pixels. Additionally, the context pyramid module (CPM) and asymmetric semantic fusion module (AFM) are used to obtain multiscale features and resolve the feature mismatch during feature fusion, respectively. Finally, we used the liver tumor segmentation benchmark (LiTS) dataset to verify the efficiency of our designed network.

Results: Our results showed that AGCAF-Net with AFM and CPM is effective in improving the accuracy of liver tumor segmentation, with the Dice coefficient increasing from 82.5% to 84.1%. The segmentation results of liver tumors by AGCAF-Net were superior to those of several state-of-the-art U-net methods, with a Dice coefficient of 84.1%, a sensitivity of 91.7%, and an average symmetric surface distance of 3.52.

Conclusions: AGCAF-Net can obtain better matched and accurate segmentation in liver tumor segmentation, thus effectively improving the accuracy of liver tumor segmentation.

Keywords: Liver tumor segmentation; deep learning; attentional mechanisms; context pyramid module (CPM); asymmetric fusion module

Submitted Dec 09, 2023. Accepted for publication May 14, 2024. Published online Jun 27, 2024.

doi: 10.21037/qims-23-1747

Introduction

With advances being made in precision medicine technology, medical image segmentation is increasingly being employed in medical research and clinical therapeutic decision-making (1,2). It can provide more information for diagnosis and to guide clinical treatment. Traditional manual segmentation methods depend on operators to identify location and manually outline the boundary of region of interest (ROI) on each slice of an image (3). This process is subjective and time-consuming and more importantly, can result in significant interoperator variability, especially in the segmentation of tumor on computed tomography (CT) images, even for more experienced radiologists. Therefore, the automated and precise segmentation of medical images is highly desirable for tumor-related research and treatment. In addition, due to the unified execution standards of artificial intelligence, automatic segmentation can efficiently and robustly overcome the problems faced by traditional manual segmentation.

Traditional medical image segmentation methods, such as threshold segmentation, region growing (4,5), level set (6), and active contour (7-9), mostly depend on manual features and energy function settings. These approaches do not substantially apply process multicategory segmentation tasks due to an inability to obtain high-level semantic information. This is because traditional medical image segmentation methods rely on low-level features in segmenting the image and are sensitive to noise. With the advent of the big data era, deep learning techniques have been rapidly developed and are being been widely used in image segmentation. Deep learning can achieve higher accuracy than can machine learning via the learning of high-level features through multilayer cascaded convolution and by reducing the number of parameters from dataset (10).

Fully convolutional networks (FCNs), which have the ability to recover the category of which each pixel belongs to in abstracting features, have extended image-level classification to pixel-wise classification (11), representing a breakthrough in the field of semantic segmentation. However, the final segmentation size of FCNs is smaller than is the original image, and the result is relatively rough. Therefore, Chen et al. (12) proposed a new FCN-based model called DeepLab. This model uses a fully connected conditional random field for refining FCN segmentation results by combining the responses at the final deep convolutional neural network layer. It optimizes the rough edge part according to the pixel distribution of the target edge and prediction result. In addition, in order to segment liver tumor on multiphase enhanced CT images, Sun et al. (13) designed a multichannel FCN (MC-FCN), which allows each phase to train a network individually and joins together their high-level features. Compared with previous models, this model achieved superior accuracy and robustness. Lei et al. (14) also developed a residual FCN for liver tumor segmentation, solving the problem of gradient vanishing that occurs with increasing network depth.

Although FCN has achieved excellent performance in image segmentation, its training efficiency still needs to be improved. Based on FCN, Ronneberger et al. (15) developed a U-shape Net (U-Net) for biological image segmentation, which has also been broadly applied for segmenting medical images. Although U-Net operates in similar fashion to FCN, as it relates to feature fusion, U-net adopts a one-by-one splicing method, while FCN employs point-by-point addition. Moreover, U-Net involves a classic encoder-decoder structure, which improves the accuracy of the segmentation. Nevertheless, two-dimensional (2D) convolution lacks the ability to obtain interslice information from three-dimensional (3D) medical imaging modalities, such as magnetic resonance imaging (MRI) and CT. Therefore, toward achieving more accurate segmentation of the liver, Sun et al. (16) further developed a new deep FCN structure named 3D Unit-C2, which makes use of the 3D spatial information of CT images and effectively combines the shallow and deep layers features in liver region. Moreover, Sun et al. further optimized the network by using the 3D conditional random field such that the accuracy of segmentation at the boundary is further improved compared with that of previously developed methods. Gu et al. (17) also proposed a U-Net–based model using a context encoder to improve the efficiency of feature extraction and reported the significantly superior segmentation of 2D medical images. In addition, Zhao et al. (18) developed a multiscale, supervised 3D U-Net for the segmentation of kidney and renal tumor CT images, fusing deep supervision and exponential logarithmic loss function to accelerate the training efficiency of 3D U-Net and optimize segmentation performance.

The attention mechanism has gradually been incorporated into image segmentation due to its advantages of high efficiency, strong versatility, and limited computation demand, especially in the detection and segmentation of small objects. Oktay et al. (19) applied the attention mechanism to the jumping connection of U-Net, proposing attention U-Net for medical image segmentation. The use of the attention mechanism in limited model space effectively improves the segmentation accuracy. Alom et al. (20) added residual networks and recursive convolutional neural networks to U-Net, which could provide a better architecture with the same number of grids. Li et al. (21) connected two U-Net networks in parallel, in which the first learns and records the feature labels of the segmented images and then uses the information of this route to guide the training of the other U-Net network to ensure that the segmentation information of liver tumors is always recorded. Duan et al. (22) integrated attention gates into standard convolutional neural networks models, such as visual geometry group and U-Net, with the generated model being able to learn how to suppress features of irrelevant regions in the input image while highlighting those features that may be valuable for specific tasks.

On the basis of previous work in this field, we propose a deep neural network called attention-guided context asymmetric fusion network (AGCAF-Net), combining attention guidance and fusion context modules in a residual neural network (ResNet) for the automatic segmentation of liver tumors. In this network, according to the attention-guided context block (AGCB), the feature map is first divided into multiple small blocks, and the local correlation between features is calculated, and then the global nonlocal fusion module (GNFM) is used to obtain the global information between pixels. Moreover, the context pyramid module (CPM) is used to obtain multiscale features, and the multiscale attention-guided context modules and initial feature maps are fused to obtain more accurate feature representations. In addition, in order to solve the feature mismatch during feature fusion, we propose an asymmetric semantic fusion module (AFM), which uses asymmetric filtering after feature fusion. To evaluate the segmentation ability for liver tumors of the proposed AGCAF-Net, we conducted experiments comparing its performance with those other currently used segmentation methods.

The main contributions of this work are summarized as follows:

Based on the ResNet network, a new deep neural network named AGCAF-Net is proposed, which combines attention guidance and fusion context modules to extract more detailed feature information while suppressing irrelevant information, thus improving the efficiency and robustness of the network training.
Our segmentation method replaces the convolutional module of the convolutional network with a residual module. Through the residual connection and batch normalization (BN) of the data, the segmentation ability of the model is improved.
We used the liver tumor segmentation benchmark (LiTS) dataset to test the proposed model, which verified the effectiveness of the CPM and AFM modules in improving the accuracy of the model, yielding a Dice coefficient of more than 84%.

Methods

This study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

In this work, we developed a novel network, AGCAF-Net, for automatic segmentation of liver tumors. The network is mainly divided into three modules (Figure 1): AGCB, AFM, and CP. The backbone of our framework is ResNet with three downsampling layers and three upsampling layers. The main body of the network is composed of CPM and AFM. The CPM module in the network is composed of several AGCB modules of different scales, and the AGCB module includes the GNFM.

Figure 1 Attention-guided context asymmetric fusion network. AFM, asymmetric semantic fusion module; CPM, context pyramid module.

Overall network structure

First, for a given input image, each pixel is first classified and then processed by neural networks (23) to distinguish whether they are tumor pixels or not, and finally a segmentation result of the same size as the input image is output. In addition, the input image is divided into three downsampling layers to obtain a feature map X∈RW×H, generating a feature map C, with the feature information being aggregated through the CPM. Second, AFM is used in the upsampling stage to obtain a more precise target. The final output image with the same size as the original input image is the segmentation result map of the liver tumor.

AGCB

AGCB is the basic module of the network, the structure of which is illustrated in Figure 2. The upper branch of this module represents the global semantic associations (24), and the lower branch represents the local semantic associations. The deepest layer in the encoder-decoder framework captures the sematic information with the target location. However, the resolution of the deep features is low, leading to poor feature representation. Therefore, the AGCB with global and local semantic associations can be used in the deepest layer to adjust the receptive field and enhance the discriminative ability of the target location. The following sections detail the local association, global association, and the fusion between the local association and global association.

Figure 2 Schematic diagram of the attention-guided context block. The upper branch represents global semantic associations, and the lower branch represents the local semantic associations. GNFM, global nonlocal fusion module; PA, pixel attention.

Local association

The module divides the input feature map X’ into s*s pixels of size w*h, where the values of w and h is are calculated as follows: $W = c e i l (\frac{W}{s})$ and $h = c e i l (\frac{h}{s})$ . In addition, the module can more effectively calculate the dependencies of pixels in the local range through a nonlocal block, which is easily embedded into neural networks, with all the fragments sharing weights during calculation (25). Finally, the output fragments are regathered to form a new local correlation feature map P.

The perceptual field of the network is confined to a local range by this module, the dependencies between the local area pixels are applied to gather pixels belonging to the same category, and then the probability of the target appearing is calculated. This method can not only obtain the discriminative results of the local region but also eliminate the effect of internal structure noise within the patch on the tumor pixels. In addition, this network can also conserve computing resources via local correlation calculation, thereby accelerating the training and inference process of the network.

Global association

For local association in the global branch, the Gaussian noise in the background patch and targets in the tumors patch may have similar responses, which is not conducive to the distinction of the target position (26). In order to solve this problem, we propose a GNFM, as shown by the upper dotted line in Figure 2. Our assumption is that the network can use the global association information through this module to eliminate the influence of similar parts and noise on the result. Furthermore, this is combined with the local association information to accurately determine and distinguish the tumor pixels. The detailed process is as follows: the input map is given; the global association with adaptive pooling is used to extract the features of each patch and obtain the pooled features of size s×s, where each pixel inside represents the features of each patch; the context information of each patch is obtained through the nonlocal block; finally, the information between channels is integrated through the pixel attention to acquire accurate attention guidance and obtain the final guide map.

Fusion of the feature map and guide map

We propose two methods for fusing the feature maps and guide maps (27). The first method involves upsampling G to W×H using interpolation, as indicated by I(·) and then point-multiply each patch of the feature map P with the corresponding element in P. The formula is shown in Eq. [1]. The other method does not employ interpolation upsampling but rather directly uses each fragment in P and multiplies it with the pixel at the corresponding position in G. See Eq. [2] for the formulae of this method.

$A_{e} = β \cdot δ (W P \otimes I (G)) + X^{'}$ [1]

$A_{p} = β \cdot δ (W [P_{1} G_{1}, P_{2} G_{2}, \dots, P_{S^{2}} G_{S^{2}}]) + X^{'}$ [2]

where β is the trained parameter, and δ is the activation function in which the rectified linear unit (ReLU) is used. This process aims to smooth the edges and generate a natural guide map. Moreover, in order to achieve a more efficient representation, we set a learning parameter β to add the output to the input, selecting more efficient segmentation features by using a self-adapting network.

CPM

The structure of the CPM used for automatic extracting of liver tumors is shown in Figure 3. The input feature image in this module is X, and after 1×1 convolution is applied for dimensionality reduction (28) and AGCBs of different scales simultaneously loaded, the obtained result can be represented as $A = {A^{S_{1}}, A^{S_{2}}, \dots}$ , where S is the scale’s vector. Finally, the multiple aggregation feature maps A set are fused with the original map, and the output result is obtained through 1×1 convolution to integrate the channel information, thus forming a context pyramid consisting of AGCBs of varying scales.

Figure 3 Schematic diagram of the context pyramid module. AGCB, attention-guided context block; CPM, context pyramid module; Conv, convolution.

Asymmetric fusion module

After learning is implemented through convolutional block attention module (CBAM) (29) and asymmetric contextual modulation (ACM) (30), we use an AFM that fuses low-level and deep-level semantics, the structure of which is shown in Figure 4.

Figure 4 Schematic diagram of the asymmetric semantic fusion module. Conv, convolution; BN, batch normalization; ReLU, rectified linear unit; AFM, asymmetric fusion module; AvgPool, average pooling.

In AFM, the low-level semantics Xl and the deep-level semantics Xd are the inputs. Xd adds BN to normalize the distribution of feature maps during training to solve the gradient vanishing problem. Moreover, the ReLU structure is added for nonlinear transformation, which can improve the regularization of the model and provide an identity map to facilitate the optimization of the model. In addition, since the low-level semantics Xl contains a massive amount of target location information, we apply the pixel-attention mechanism expressed as Eq. [3]. For deep-level semantics Xd, we initially perform dimensionality reduction with 1×1 convolution to obtain substantially more information and then adopt the channel attention mechanism to extract the crucial channel, which is expressed in Eq. [4].

$g_{p a} (X) = σ (W_{2} δ (W_{1} X))$ [3]

$g_{c a} (X) = σ (W_{2} δ (W_{1} P (X)))$ [4]

$X_{a f m} = (X_{l} + δ (W X_{d})) \otimes g_{p a} (X_{l}) ⊙ g_{c a} (X_{d})$ [5]

After features are summed together, we separately constrain g_pa and g_ca through Eq. [5], which can be used to eliminate the feature mismatch arising from separation constraints. In Eq. [5], $\otimes$ is the unit multiplication, $⊙$ is the vector tensor multiplication, and σ is the sigmoid function.

Loss function

In order to accurately predict the category of each pixel, our network needs to be trained. The conventional loss functions (cross-entropy loss, etc.), cannot deal with the problem of class imbalance due to the excessive number of small targets in the images. Therefore, we employ the Dice loss function (31) to evaluate segmentation performance as follows:

$L_{s e g} = 1 - \frac{1}{K} \sum_{i = 1}^{K} \frac{2 | y_{i} \cap y_{i}^{p} |}{| y_{i} | + | y_{i}^{p} |}$ [6]

$L_{t o t a l} = α L_{r o i} + (1 - α) L_{s e g}$ [7]

where K denotes the number of categories, $y_{i} \in ℝ^{W \times H}$ is the segmentation result, and $y_{i}^{D} \in ℝ^{W \times H}$ represents the predicted value. We set a hyperparameter of α for controlling the ratio of the auxiliary loss and the Dice loss to obtain the minimal loss value.

Datasets and evaluation metrics

The LiTS is a dataset that was specifically created for the liver and liver tumor segmentation competition jointly organized by the Medical Image Computing and Computer Assisted Intervention (MICCAI) and IEEE International Symposium on Biomedical Imaging (ISBI) in 2017 (32). It includes contrast-enhanced abdominal CT scan data of 201 patients from 7 different hospitals. From this, we used 131 labeled data sets, which are not only comprehensively labeled but also contain the different numbers and sizes of tumors in each sample. This is suitable for evaluating the performance of AGCAF-Net in liver tumor segmentation. The data sample is shown in Figure 5: the red part is the patient’s liver, and the green part is the tumor.

Figure 5 Example of liver and tumor segmentation. (A) Sample from the LiTS dataset, (B) LiTS data label sample, (C) liver and tumor segmentation map, (D) liver segmentation map, and (E) tumor segmentation map. LiTS, liver tumor segmentation benchmark.

Before conducting the experiment, we first converted the file from nii format to png format with a resolution of 256×256 according to the Z axis and then performed liver and tumor label separation according to the dataset labels. This process was as follows: in the label (Figure 5B), we marked the liver as 1 and the tumor as 2. Then, after loading the image, we set the pixels that are not 1 in each image to 0 to obtain the label that only contains the liver (Figure 5D). Similarly, we set the other pixels that are not 2 to 0 to obtain the label that only contains the tumor (Figure 5E). The results after label separation are shown in Figure 5.

We divided the 131 datasets of the LiTS into a training set, validation set, and test set in an 8:1:1 ratio. After model training, the validation set was used to verify the model and guide model optimization. Finally, the test set was used as the final evaluation of the model after model training. The experiment was implemented based on the PyTorch and the Python 3.7 programming language, with an input scale of 256×256×1. We applied stochastic gradient descent as the optimization method, with a momentum of 0.9 and a weight decay coefficient of 0.0004. The initial learning rate was 0.05, and we adopted the polygon attenuation strategy during the process of deep learning training. The equation is as follows:

$l r = b a s e_l r \times {(1 - \frac{e p o c h}{n u m_e p o c h})}^{p o w e r}$ [8]

Where lr is the new learning rate; base_lr is the initial learning rate; and epoch and num_epocht are the iteration ordinal number and maximum iteration ordinal number, respectively; and power is set to 0.9.

The evaluation metrics of this experiment were the pixel-wise metrics based on confusion matrix (33), including precision, recall, Dice similarity coefficient (Dice), and F_measure, as well geometric matching metrics, including average symmetric surface distance (ASD). The formulae for these metrics are expressed as Eqs. [9-13].

$p r e c i s i o n = \frac{T P}{T P + F P}$ [9]

$R e c a l l = \frac{T P}{T P + F N}$ [10]

$D i c e = \frac{2 | T P |}{2 | T P | + | F P | + | F N |}$ [11]

$F_{m e a s u r e} = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}$ [12]

In these equations, TP is the number of pixels that were correctly segmented in the tumor; FN is a false negative, indicating that missegmentation of a number of pixels belonging to background into tumors; and FP is a false positive, indicating the missegmentation of a number of pixels belonging to tumors into background. Precision and recall influence each other and respectively refer to the percentage of correctly classified pixels among all predicted tumors and labeled tumors. The higher the value of precision and recall is, the closer the segmentation result of the algorithm is to the standard label and the better the segmentation performance of the algorithm.

Furthermore, we used F_measure to evaluate the relationship between precision and recall, which indicated that they are equally important. In medical image segmentation, the Dice coefficient is the most important evaluation metric. As it incorporates both precision and recall, numerous studies of medical image segmentation use it as the final evaluation metric. The Dice coefficient ranges from 0% to 100%, with a value of 0% or 100% indicating that the label sample and algorithm segmentation results have no overlap or completely overlap, respectively. The ASD is calculated as follows:

$A S D = \frac{1}{| L_{G} | + | L_{P} |} \times (\sum_{p_{1} \in L_{G}} \min_{q_{1} \in L_{P}} ‖ p_{1} - q_{1} ‖ + \sum_{p_{2} \in L_{P}} \min_{q_{2} \in L_{G}} ‖ p_{2} - q_{2} ‖)$ [13]

where, L_G and L_P represent the contour of algorithm segmentation result and standard label, respectively; p1 and q2 are the pixels on the contour L_G; and p2 and q1 are the pixels on the contour L_P. ASD is used to measure the average distance between L_G and L_P, with a smaller ASD value indicating a higher segmentation accuracy.

Results

The effect of ResNet depth

The residual mechanism used by ResNet can effectively solve the problem of gradient vanishing and gradient exploding, but different depths of ResNet have different effects on the images segmentation results. In our study, we selected and compared the performance of ResNet-18 and ResNet-34 network models, with the results of the two models being shown in Table 1. Furthermore, we randomly selected six datasets from the dataset for evaluation, with Figure 6 showing the performance of the ResNet-18 and ResNet-34 models in segmenting liver tumors on the test set.

Table 1

Operational results of ResNet-18 and ResNet-34

Metric	Backbone
Metric	ResNet-18	ResNet-34
Precision	0.858	0.878
Recall	0.826	0.784
Dice	0.841	0.828
AUC	0.941	0.921

AUC, area under the curve.

Figure 6 Tumor segmentation examples based on ResNet-18 and ResNet-34. (A) Original image samples images of six different patients in test set of the LiTS dataset, (B) standard calibrated label of liver tumor, (C) the image segmentation results of ResNet-18, and (D) the image segmentation results of ResNet-34. LiTS, liver tumor segmentation benchmark.

As shown in Table 1, we found that as the network structure becomes more complex, the precision values obtained by models slightly increase from 0.858 to 0.878, the the recall values decrease from 0.826 to 0.784, and the Dice and area under the curve (AUC) values also slightly decrease. In terms of the Dice coefficient, which is the most important evaluation metric in medical image segmentation, the Dice coefficient drops from 0.841 in ResNet-18 to 0.828 in ResNet-34, suggesting that the segmentation result of ResNet-18 is better than that of ResNet-34. This may be explained by the fact that although the complexity of the model is improved, the number of LiTS datasets is limited, which may lead to network overfitting.

In addition, Figure 6 also shows that the tumor segmentation map produced by the two models are roughly in line with the given calibrated label, and, more importantly, that it can also can identify the approximate contour and location of small tumors, thus demonstrating good fit with the calibrated label. Compared with that of ResNet-34, the segmentation result of ResNet-18 is more similar to the given calibrated label. Nonetheless, the ability of this network to process the image edge information still needs to be improved, and the segmentation results are disparate in some regards to the given standard.

The effect of CPM and AFM

Based on the results of the above-described experiment, we selected ResNet-18 as the basic network structure for the ablation study to verify the rationality of AGCAF-Net and evaluated whether adding CPM and AFM on different backbones is effective for tumor segmentation. The results of the ablation study for the different models are shown in Table 2, and the tumor segmentation map is shown in Figure 7.

Table 2

Results of the ablation experiment among the different models

Backbone	CPM	AFM	F_measure	Dice
ResNet-18	–	–	0.7463	0.825
ResNet-18	√	–	0.7564	0.832
ResNet-18	–	√	0.7528	0.827
ResNet-18	√	√	0.7876	0.841

CPM, context pyramid module; AFM, asymmetric semantic fusion module.

Figure 7 The tumor segmentation map of ResNet-18. (A) Original input image and (B) standard calibrated label. (C) The segmentation map generated by ResNet-18 without AFM and CPM added to the backbone. (D) The segmentation map generated by ResNet-18 with CPM but without AFM added to the backbone. (E) The segmentation map generated by ResNet-18 with AFM but without CPM added to the backbone. (F) The segmentation map generated by ResNet-18 with AFM and CPM added to the backbone. AFM, asymmetric semantic fusion module; CPM, context pyramid module.

As shown in Table 2 and Figure 7, the Dice coefficient increased after the CPM module was added to the backbone in comparison to the baseline ResNet-18 network, which is mainly due to the fact that the CPM module is able to acquire multiscale features, allowing for the more adequate training of the model. Similarly, after AFM was added to the backbone, the Dice coefficient of the new model was improved compared to that of the baseline ResNet-18 network. The above results indicate that both CPM and AFM can increase the model’s focus on the liver tumor region, especially for more accurate identification of small targets. After both the CPM module and AFM module were added, both evaluation metrics, F_measure and Dice coefficient, reached optimal values, which indicates that under the same network structure, adding CPM and AFM modules can indeed improve the performance of model. Therefore, we selected the ResNet-18 network containing the CPM and AFM modules, named AGCAF-Net, as the training model.

The effect of reduction ratios

In networks, although dimensionality reduction can decrease redundant information and accelerate network training, it may also lead to information loss. In AGCAF-Net, there is dimensionality reduction in the CPM module and the GNFM, with the ratio of dimensionality reduction being denoted as (r_c; r_n). The optimal means to segmenting small targets could be determined by comparing the F_measure and Dice coefficient at different dimensionality reduction ratios. The total ratio is limited to 64 and 128, and the data are shown in Table 3. Different dimensional ratios cause small fluctuations in the evaluation metrics, but the overall change is not significant.

Table 3

Accuracy of the different dimensionality reduction ratios

Reduction ratio	F_measure	Dice
(4, 16)	0.6203	0.827
(8, 8)	0.6207	0.829
(16, 4)	0.6497	0.841
(8, 16)	0.6375	0.840
(16, 8)	0.6198	0.818

Comparison to state-of-the-art methods

In order to further analyze the performance of AGCAF-Net for tumor segmentation, we compared its metrics with those of the U-Net, Trans U-Net, Attention U-Net, and Dense U-Net tumor segmentation algorithms. Furthermore, small tumors, large tumors, and sporadically distributed insular tumors were selected for in-depth analyses, and the output results are shown in Figure 8. In the figure, the red box indicates the part of the segmentation result that differs between methods, and the green box indicates the part of the segmentation result that is incorrect. Although other methods could segment liver tumors, their segmentation of tumor images demonstrated varying degrees of difference with the standard calibrated label. Small tumors could be misclassified in nonliver regions when tumor images were segmented via U-Net or the Trans U-Net method. The shape of the tumor within the circular red box suggests that the other methods miss more of the edges in the segmentation of large tumors as compared to AGCAF-Net. In the segmentation of distributed insular tumors, the other methods undersegmented or did not segment small tumors. However, the size, shape and contour, and location of the tumor images segmented by AGCAF-Net were extremely similar to those of standard calibrated labels. Overall, AGCAF-Net demonstrated superior accuracy and high precision in liver tumor segmentation.

Figure 8 The tumor segmentation map of ResNet-18. (A) Original input image, (B) the given calibrated label, (C) the segmentation map generated by U-Net, (D) the segmentation map generated by Trans U-Net, (E) the segmentation map generated by attention U-Net, (F) the segmentation map generated by Dense U-Net, and (G) the segmentation map generated by AGCAF-Net. The red box indicates the part of the segmentation result that differed between the methods, and the green box indicates the part of the segmentation result that is incorrect. AGCAF-Net, attention-guided context asymmetric fusion network.

Furthermore, sensitivity, Dice coefficient, and ASD were selected to quantitatively analyze the segmentation accuracy of AGCAF-Net and the other three methods, with the results being shown in Table 4. The metrics of AGCAF-Net are superior to those of U-Net (15), Trans U-Net (34), Attention U-Net (19), and Dense U-Net (35). The results show that AGCAF-Net can more accurately detect the location of the tumor than can the other methods and can provide accurate segmentation with higher precision.

Table 4

Comparison of liver tumor segmentation performance on the LiTS dataset of the different models

Method	Dice	Sensitivity	ASD
U-Net	0.643	0.897	7.06
Trans U-Net	0.701	0.901	6.52
Attention U-Net	0.691	0.903	5.84
Dense U-Net	0.727	0.914	3.64
Proposed method (AGCAF-Net)	0.841	0.917	3.52

LiTS, liver tumor segmentation benchmark; ASD, average symmetric surface distance; AGCAF-Net, attention-guided context asymmetric fusion network.

Discussion

Liver tumor segmentation based on medical image is becoming increasingly relevant to the quantitative analysis of radiology and surgery, as it can assist radiologists and surgeons in achieving the most accurate diagnosis and inform individualized therapeutic decision-making. However, the accuracy of tumor segmentation remains suboptimal. In order to improve the accuracy of the automatic segmentation of liver tumors, we propose a new method, AGCAF-Net, which is based on ResNet. In our new model, the convolutional module of the convolutional network is replaced by a residual module, with the residual connections and BN being responsible for processing the data. In addition, we also include AGCB, GNFM, CPM, and AFM sequentially in the network, increasing the amount of feature information and suppression of other irrelevant information to improve the efficiency and robustness of network training. The sensitivity and Dice coefficient of AGCAF-Net were 0.917 and 0.841, respectively, both of which were higher than those of the other methods. The segmentation experiments demonstrated that the model has high segmentation accuracy. Furthermore, the analysis of the shape (Figure 8) illustrated that our method can accurately segment single and multiple small tumor fractions without misclassification problems. In addition, our method segmented large tumor contours that were highly similar to the gold standard contours. The results indicate that our method can provide superior performance in liver tumor segmentation compared to other existing methods (15,19,34,35).

However, the proposed model involves certain limitations that should be noted. First, the segmentation accuracy of AGCAF-Net for small tumors needs to be further improved in medical images with low contrast and abundant noise. Additional postprocessing after rough segmentation of the image can be considered to aid in obtaining more refined segmentation results. Second, although our model achieved good accuracy, the process of converting from 3D nii format to 2D png to obtain input network data resulted in the loss of some of the interchip information. In future work, we will endeavor to use random cropping and random rotation to increase the data diversity.

Conclusions

We have developed the AGCAF-Net for the liver tumor segmentation of CT images. In this model, via the introduction of the AGCB module, the feature map is divided into multiple patches to calculate the local correlation of the features. Subsequently, the global correlation is calculated via the GNFM to obtain the global information between semantics, and finally CPM fuses the context module and the initial feature map to obtain a more accurate feature representation. In addition, AFM is used to solve the feature mismatch that occurs during feature fusion. In the experiment on the LiTS dataset, AGCAF-Net segmentation method yielded a Dice coefficient of 0.841, indicating it has better performance in liver tumor segmentation compared to other state-of-the-art methods. In the future, we will consider performing additional postprocessing to obtain more refined segmentation results and attempt to segment images in 3D through other networks such as residual attention U-Net to retain more image information and achieve better segmentation performance.

Acknowledgments

Funding: This study received funding from grants of the Guangxi Science and Technological Development Project (No. Guike AB18126031), the Guangxi Self-financing Scientific Research Subject (No. Z2016484), and the Guangxi Clinical Research Center for Medical Imaging Construction (No. Guike AD20238096).

Footnote

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-23-1747/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, van der Laak JAWM, van Ginneken B, Sánchez CI. A survey on deep learning in medical image analysis. Med Image Anal 2017;42:60-88. [Crossref] [PubMed]
Fang K, He B, Liu L, Hu H, Fang C, Huang X, Jia F. UMRFormer-net: a three-dimensional U-shaped pancreas segmentation method based on a double-layer bridged transformer network. Quant Imaging Med Surg 2023;13:1619-30. [Crossref] [PubMed]
Bellon E, Feron M, Maes F, Hoe LV, Delaere D, Haven F, Sunaert S, Baert AL, Marchal G, Suetens P. Evaluation of manual vs semi-automated delineation of liver lesions on CT images. Eur Radiol 1997;7:432-8. [Crossref] [PubMed]
Akram MU, Khanum A, Iqbal K. An automated System for Liver CT Enhancement and Segmentation. GVIP 2010;10:17-22.
Jayanthi M, Kanmani B. Extracting the Liver and Tumor from Abdominal CT Images. 2014 Fifth International Conference on Signal and Image Processing, Bangalore, India, 2014:122-5
Li Y, Zhao YQ, Zhang F, Liao M, Yu LL, Chen BF, Wang YJ. Liver segmentation from abdominal CT volumes based on level set and sparse shape composition. Comput Methods Programs Biomed 2020;195:105533. [Crossref] [PubMed]
Li G, Chen X, Shi F, Zhu W, Tian J, Xiang D. Automatic Liver Segmentation Based on Shape Constraints and Deformable Graph Cut in CT Images. IEEE Trans Image Process 2015;24:5315-29. [Crossref] [PubMed]
Ciecholewski M. Automatic liver segmentation from 2D CT images using an approximate contour model. J Sign Process Syst 2014;74:151-74.
Heimann T, van Ginneken B, Styner MA, Arzhaeva Y, Aurich V, Bauer C, et al. Comparison and evaluation of methods for liver segmentation from CT datasets. IEEE Trans Med Imaging 2009;28:1251-65. [Crossref] [PubMed]
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations 2015. ArXiv: 1409.1556.
Shelhamer E, Long J, Darrell T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans Pattern Anal Mach Intell 2017;39:640-51. [Crossref] [PubMed]
ChenLPapandreouGKokkinosIMurphyKYuilleA.Semantic Image Segmentation with Deep Con-volutional Nets and Fully Connected CRFs. arXiv: 1412.7062, 2014.
Sun C, Guo S, Zhang H, Li J, Ma S, Li X. Liver Lesion Segmentation in CT Images with MK-FCN. 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 2017:1794-8,
LeiBKimJKumarAFengD.Automatic Liver Lesion Detection using Cascaded Deep Residual Networks. arXiv: 1704.02703, 2017.
Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab N, Hornegger J, Wells W, Frangi A. editors. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes in Computer Science(), vol 9351. Springer, 2015:234-41.
Sun M, Xu J, Ma W, Zhang Y. A New Fully Convolutional Network for 3D Liver Region Segmentation on CT Images. Chinese Journal of Biomedical Engineering 2018;37:385-93.
Gu Z, Cheng J, Fu H, Zhou K, Hao H, Zhao Y, Zhang T, Gao S, Liu J. CE-Net: Context Encoder Network for 2D Medical Image Segmentation. IEEE Trans Med Imaging 2019;38:2281-92. [Crossref] [PubMed]
ZhaoWZengZ.Multi Scale Supervised 3D U-Net for Kidney and Tumor Segmentation. arXiv: 1908.03204, 2019.
OktayOSchlemperJFolgocLLeeMHeinrichMMisawaKMoriKMcDonaghSHammerlaNKainzBGlockerBRueckertD.Attention U-Net: Learning Where to Look for the Pancreas. arXiv: 1804.03999, 2018.
Alom MZ, Yakopcic C, Nasrin MS, Taha TM, Asari VK. Breast Cancer Classification from Histopathological Images with Inception Recurrent Residual Convolutional Neural Network. J Digit Imaging 2019;32:605-17. [Crossref] [PubMed]
Li S, Tso GKF, He K. Bottleneck feature supervised U-Net for pixel-wise liver and tumor segmentation. Expert Syst Appl 2020;145:113131.
Duan J, Bello G, Schlemper J, Bai W, Dawes TJW, Biffi C, de Marvao A, Doumoud G, O'Regan DP, Rueckert D. Automatic 3D Bi-Ventricular Segmentation of Cardiac Images by a Shape-Refined Multi- Task Deep Learning Approach. IEEE Trans Med Imaging 2019;38:2151-64. [Crossref] [PubMed]
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016:770-8.
Dong H, Song K, He Y, Xu J, Yan Y, Meng Q. PGA-Net: Pyramid feature fusion and global context attention network for automated surface defect detection. IEEE Transactions on Industrial Informatics 2020;16:7448-58.
Wang X, Girshick R, Gupta A, He K. Non-local neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018:7794-803.
Nie J, Qu S, Wei Y, Zhang L, Deng L. An infrared small target detection method based on multiscale local homogeneity measure. Infrared Phys Technol 2018;90:186-94.
Zhang D, Wei S, Li S, Wu H, Zhu Q, Zhou G. Multi-modal graph fusion for named entity recognition with targeted visual guidance. Proceedings of the AAAI Conference on Artificial Intelligence 2021;35:14347-55.
Zuo B, Lee F, Chen Q. An efficient U-shaped network combined with edge attention module and context pyramid fusion for skin lesion segmentation. Med Biol Eng Comput 2022;60:1987-2000. [Crossref] [PubMed]
Woo S, Park J, Lee JY, Kweon IS. CBAM: Convolutional Block Attention Module. Proceedings of the European Conference on Computer Vision (ECCV), 2018:3-19.
Dai Y, Wu Y, Zhou F, Barnard K. Asymmetric contextual modulation for infrared small target detection. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021:950-9.
Milletari F, Navab N, Ahmadi S. A. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 2016:565-71.
Bilic P, Christ P, Li HB, Vorontsov E, Ben-Cohen A, Kaissis G, et al. The Liver Tumor Segmentation Benchmark (LiTS). Med Image Anal 2023;84:102680. [Crossref] [PubMed]
Zhang F, Zhao Y, Luo B, Pan Y, Liao M. Cross-layer connected network with adaptive attention mechanism for 3D multi-organ and tumor segmentations from CT. Opt Laser Technol 2023;167:109662.
UmmadiV.U-Net and its variants for Medical Image Segmentation: A short review. arXiv: 2204.08470, 2022.
Dong R, Pan X, Li F. Dense U-net-based semantic segmentation of small objects in urban remote sensing images. IEEE Access 2019;7:65347-56.

Cite this article as: Wang F, Cheng XL, Luo NB, Su DK. Attention-guided context asymmetric fusion networks for the liver tumor segmentation of computed tomography images. Quant Imaging Med Surg 2024;14(7):4825-4839. doi: 10.21037/qims-23-1747

Attention-guided context asymmetric fusion networks for the liver tumor segmentation of computed tomography images

Introduction

Methods

Overall network structure

AGCB

Local association

Global association

Fusion of the feature map and guide map

CPM

Asymmetric fusion module

Loss function

Datasets and evaluation metrics

Results

The effect of ResNet depth

Table 1

The effect of CPM and AFM

Table 2

The effect of reduction ratios

Table 3

Comparison to state-of-the-art methods

Table 4

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share