Gross-tumor-volume segment-anything model for medical 2D images integrating gross tumor volume-minimal feature integration technology for lung cancer segmentation
Original Article

Gross-tumor-volume segment-anything model for medical 2D images integrating gross tumor volume-minimal feature integration technology for lung cancer segmentation

Chen Yi1, Yuxin Li1, Shuhang Cao1, Qiliang Xiong1, Shaofeng Jiang1, Huaiwen Zhang2

1Department of Biomedical Engineering, Key Laboratory of Non-destructive Testing of Education, Nanchang Hangkong University, Nanchang, China; 2Department of Radiation Oncology, Jiangxi Cancer Hospital & Institute, Jiangxi Clinical Research Center for Cancer, The Second Affiliated Hospital of Nanchang Medical College, Nanchang, China

Contributions: (I) Conception and design: S Jiang, H Zhang; (II) Administrative support: S Jiang; (III) Provision of study materials or patients: H Zhang; (IV) Collection and assembly of data: C Yi, Y Li, S Cao, Q Xiong, H Zhang; (V) Data analysis and interpretation: C Yi, Y Li, S Cao, Q Xiong, S Jiang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Dr. Shaofeng Jiang, PhD. Department of Biomedical Engineering, Key Laboratory of Non-destructive Testing of Education, Nanchang Hangkong University, No. 696, Fenghe South Avenue, Honggutan, Nanchang 330063, China. Email: jsphone@163.com; Dr. Huaiwen Zhang, MD. Department of Radiation Oncology, Jiangxi Cancer Hospital & Institute, Jiangxi Clinical Research Center for Cancer, The Second Affiliated Hospital of Nanchang Medical College, No. 519, Beijing East Road, Nanchang 330029, China. Email: 1761580890@qq.com.

Background: Accurate segmentation of lung cancer gross tumor volume (GTV) on computed tomography (CT) is critical for radiotherapy planning yet remains difficult due to low tumor-tissue contrast, small target size, and high intratumoral heterogeneity. This study aimed to develop and validate an automatic method—a GTV segment-anything model generative adversarial network (GTV-SAMGAN)—for accurate, robust, and clinically efficient GTV segmentation on CT, with particular emphasis on small, low-contrast, and heterogeneous lesions.

Methods: We propose GTV-SAMGAN, built upon SAM medical 2D image (SAM-Med2D), integrating a newly developed GTV-minimal feature integration technology MFIT (GTV-MFIT) module with a GAN-based training scheme. The performance of GTV-SAMGAN was evaluated on a local clinical dataset and a public non-small cell lung cancer-radiomics (NSCLC-Radiomics) dataset (https://wiki.cancerimagingarchive.net/display/Public/NSCLC-Radiomics). We compared the proposed model against representative baselines (including SAM-Med2D and SwinU-Net) using the Dice coefficient, sensitivity, and specificity.

Results: On the local dataset, GTV-SAMGAN achieved a Dice coefficient of 83.74%, a sensitivity of 84.28%, and a specificity 99.98%, outperforming the other models. Compared to SwinU-Net, GTV-SAMGAN increased the Dice coefficient and sensitivity by 10.71% and 10.15%, respectively; compared to SAM-Med2D, it increased the Dice coefficient and sensitivity by 7.69% and 7.75%, respectively. On the NSCLC-Radiomics dataset, GTV-SAMGAN achieved a Dice coefficient of 82.92% and a sensitivity of 82.25%, representing an improvement over SAM-Med2D of 6.68% and 9.61%, respectively.

Conclusions: By coupling SAM-Med2D with GTV-MFIT and GAN training, GTV-SAMGAN substantially improves lung cancer GTV segmentation, particularly for small and heterogeneous tumors, thereby enhancing the precision and efficiency of radiotherapy planning.

Keywords: Lung cancer; gross tumor volume (GTV); generative adversarial network (GAN); GTV segment-anything model generative adversarial network (GTV-SAMGAN); segmentation


Submitted Feb 14, 2025. Accepted for publication Oct 27, 2025. Published online Dec 11, 2025.

doi: 10.21037/qims-2025-370


Introduction

Lung cancer is one of the most prevalent malignant tumors in the world, with approximately 2.2 million new cases and 1.7 million related deaths annually, making it the second leading cause of cancer-related mortality (1). Recent data from the World Health Organization and China’s National Cancer Center indicate that lung cancer is the foremost cause of cancer mortality globally and in China (2,3). Despite substantial advances in multimodal therapy being achieved over the past decade, lung cancer is one of the most common causes of cancer mortality, responsible for roughly one-quarter of all cancer deaths (4). The primary treatment options for lung cancer include surgery, chemotherapy, radiotherapy, and immunotherapy, with approximately 60–70% of patients requiring radiotherapy (5). Specifically, stereotactic body radiation therapy (SBRT)—primarily used for treating early-stage non-small cell lung cancer (NSCLC)—precisely delivers high doses of radiation to small tumor targets via multiple conformal coplanar and noncoplanar beams over several sessions (6,7). Accurate definition of the gross tumor volume (GTV) is essential for the precise delineation of the clinical target volume (CTV) and planning target volume (PTV) (8).

In conventional radiotherapy, tumor regions are usually contoured by oncologists on computed tomography (CT) scans (9,10). However, hand-drawn contours are prone to interobserver variation in GTV, which can compromise treatment planning and introduce both random and systematic errors, ultimately influencing the distribution of the delivered dose (11,12). With the rapid progress made in medical imaging and computing, automated segmentation methods have therefore become increasingly relevant for clinical SBRT.

Accurate contouring of targets and organs at risk (OARs) is fundamental to radiotherapy planning. Despite the development of consensus statements and standardized rules, manual target outlining still relies heavily on the clinician’s expertise, is labor-intensive, prone to variability, and can diminish treatment efficacy (13). The adoption of artificial intelligence (AI)—especially convolutional neural networks (CNNs)—has improved both the efficiency and reproducibility of contouring due to providing resilience against image noise, blurring, and contrast fluctuations (14,15). In radiation oncology, CNN models executed on GPUs allow for the rapid segmentation of GTVs and normal structures. For instance, Rhee et al. (16) found that this configuration resulted in a Dice similarity coefficient (DSC) of 0.86 for pelvic CTV segmentation, Men et al. (17) yielded a DSC of 0.809 for primary nasopharyngeal carcinoma using a deep deconvolutional network, and Wang et al. (18) employed a patient-specific adaptive CNN (A-net) trained on magnetic resonance imaging (MRI) and GTV labels, achieving a mean DSC of 0.82±0.10. For NSCLC on CT, Zhang et al. improved residual net (ResNet)-based GTV segmentation, yielding a mean DSC of 0.73 (19). Collectively, these studies indicate the strong potential of deep learning to enhance radiotherapy accuracy and productivity.

Despite the rapid adoption of deep learning-based methods for lung cancer GTV contouring, their use in radiotherapy planning remains notably constrained (20). The major obstacles include poor tumor-background contrast in medical images, which hinders reliable localization and segmentation, and the small volumetric proportion of tumors in CT scans, which introduces class imbalance and degrades model performance. Additionally, the substantial internal heterogeneity of lung tumors—including cystic regions, calcifications, and necrosis, adds to the complexity of automatic segmentation. To address these challenges, we developed the GTV segment-anything model generative adversarial network (GTV-SAMGAN), which integrates a GTV-minimal feature integration technology MFIT (GTV-MFIT) preprocessing module with a GAN to enhance segmentation performance in complex medical imaging scenarios. By improving the model’s capacity to identify tumor regions, GTV-SAMGAN can offer notable improvements in GTV delineation for lung cancer radiotherapy.

Related work

Medical image segmentation has progressed markedly with U-Net and follow-on designs. After being introduced in 2015 (21), U-Net’s simple encoder-decoder layout emerged as a foundational paradigm. Subsequent research has primarily focused on enhancing U-Net’s segmentation performance. In 2018, UNet++ (22) introduced dense skip pathways and deep supervision within the encoder-decoder framework, reducing the semantic disparity between the two stages and markedly boosting segmentation accuracy beyond the original U-Net and several variants. In 2021, no-new U-Net (nnU-Net) (23) advanced a more systematic architectural optimization—deepening the network, replacing batch normalization with group normalization, and adding axial attention in the decoder—which delivered state-of-the-art performance at the Medical Image Computing and Computer Assisted Intervention (MICCAI) BrainLesion Workshop, with Dice scores of 88.35%, 88.78%, and 93.19% for the enhancing tumor, tumor core, and whole tumor, respectively.

That same year, SwinU-Net was proposed, which introduced Transformer blocks into the U-Net paradigm, yielding a fully Transformer-based variant. Exploiting the U-shaped encoder-decoder with skip links, SwinU-Net (24) surpassed conventional CNNs on multi-organ and cardiac segmentation benchmarks. In 2024, extended long short-term memory UNet (xLSTM-UNet) (25) combined the long-range dependency modeling of xLSTM with U-Net’s hierarchical structure, demonstrating superior segmentation accuracy and efficiency in both 2D and 3D biomedical image analysis. Structure-enhanced attention network (SEA-Net) (26) was subsequently introduced, incorporating squeeze-and-excitation and attention modules to form a spiral closed-loop pathway; it particularly excels in preserving small target features and outperforms traditional U-Net in tasks such as brain MRI and peripheral blood smear segmentation.

A significant breakthrough in the field of image segmentation occurred in 2023, with the introduction of the segment-anything model (SAM) (27). Trained on an extensive dataset of over 1 billion masks and 11 million images, SAM demonstrates exceptional generalization capabilities. SAM was further adapted for medical image segmentation in the form of SAM-medical 2D image (Med2D) (28). This was achieved through use of a medical dataset containing approximately 4.6 million images and 19.7 million masks and the incorporation of interactive prompts such as bounding boxes, points, and masks. This model exhibited robust performance across nine datasets in the MICCAI 2023 challenge, significantly improving the precision and efficiency of medical image segmentation. Collectively, these advancements have not only solidified the technical foundation of medical image segmentation but also paved the way for future research and clinical applications.


Methods

GTV-SAMGAN

In this model (Figure 1), we adopt GAN architecture for training. First, the input data is processed with the GTV-MFIT module, which adapts the original 512×512 pixel images and their corresponding labels to an appropriate size. Specifically, in the local dataset (get from Jiangxi Cancer Hospital), based on the spatial distribution characteristics of the tumor regions, the GTV-MFIT module determines the minimal cropping size of 256×256 pixels to fully cover all critical areas. For the NSCLC-Radiomics dataset [get from The Cancer Imaging Archive (TCIA)], the corresponding size is 342×342 pixels. This strategy effectively reduces the redundant background while ensuring the integrity of the segmentation targets. The preprocessed data is then fed into the generator for training. For the generator, we use the SAM-Med2D network (referred to as SAM-med), while the discriminator is based on a simple neural network. This structure is designed to enhance the efficiency of model training and improve the quality of the generated images.

Figure 1 The architecture of the GTV-SAMGAN model. GTV-SAMGAN, gross tumor volume segment-anything model generative adversarial network; MLP, multilayer perceptron.

GTV-MFIT

To optimize the input for our pretrained generator model while preserving critical tumor information, we developed the GTV-MFIT for intelligent image preprocessing. The methodology proceeds as follows: Initially, all training image annotations are analyzed to precisely localize the tumor regions. For each image, a tight rectangular bounding box optimally encloses the annotated tumor area. Subsequently, through comprehensive cross-image analysis, the smallest common rectangular region that encompasses all critical tumor areas across the entire dataset is determined. This approach achieves two key objectives: (I) it focuses computational resources on regions of clinical relevance, thereby enhancing feature extraction efficiency; and (II) it significantly reduces background interference and model complexity while ensuring no loss of diagnostically important information. Notably, for the local dataset, the entire thoracic region fits comfortably within a 256×256 pixel dimension, guaranteeing that the resizing process does not compromise segmentation accuracy. The empirically determined bounding box coordinates are [194, 161, 450, 417] for the local dataset and [129, 94, 471, 436] for the NSCLC-Radiomics dataset. Figure 2 provides a comprehensive visualization of this optimized preprocessing pipeline and its outcomes.

Figure 2 GTV-MFIT. GTV-MFIT, gross tumor volume-minimal feature integration technology.

Generator

In this study, we utilized the SAM-Med2D architecture, which is specifically optimized for medical imaging, as the generator. SAM-Med2D is an adaptation and fine-tuning of the standard SAM framework, configured to meet the unique requirements of medical image processing.

Adaptation and optimization of the image encoder

SAM-Med2D integrates a pretrained Vision Transformer (ViT) as the image encoder, which processes high-resolution input images and reduces the feature map to 1/16 of its original size. Given the complexity of medical image encoding and the associated high computational cost, we chose to freeze the parameters of the original image encoder during fine-tuning and embedded adapter modules within each transformer block. These adapters adjust the feature maps in both the channel and spatial dimensions to better accommodate the unique characteristics of medical images. In the channel dimension, global average pooling is applied to compress the input feature map resolution of the input feature map, which is followed by linear layers for compression and reconstruction, with the compression ratio being 0.25. The input feature map is adjusted through weights derived from a sigmoid activation function. In the spatial dimension, convolutional layers and transposed convolutional layers downsample and upsample the feature map while maintaining the same number of channels. Skip connections incorporated after each adapter layer ensure the integrity of information.

Fine-tuning of the prompt encoder and mask decoder

The prompt encoder of SAM-Med2D supports various prompt types, including points, bounding boxes, and masks. For medical imaging applications, we specifically modified the prompt encoder to enhance its applicability in medical imaging tasks. For sparse prompts such as points and bounding boxes, each is represented by positional encoding combined with learned embeddings. For dense prompts, we use low-resolution feature maps generated after the model’s initial iteration as mask prompts and downsampled via convolutional embedding, and their channel dimensions are subsequently mapped to 256 via 1×1 convolutions. The mask decoder receives the outputs of both the image encoder and the prompt encoder, fusing the information through a cross-attention mechanism. The structure of the mask decoder remains unchanged during training, while its parameters are continuously updated. Each prompt can generate multiple predicted masks; during backpropagation, the mask with the highest intersection over union (IoU) is selected to update the gradients, thereby improving the model’s accuracy and efficiency.

Discriminator

In this study, the discriminator was implemented with a PyTorch-based neural network architecture specifically designed to process single-channel images of size 256×256 pixels. The network first converts the image into a 1D vector of 65,536 dimensions. Structurally, the network consists of three fully connected layers, which sequentially reduce the dimensionality to 512, then to 256, and finally to 9. To maintain gradient flow during training and prevent vanishing gradients, each fully connected layer is followed by a leaky rectified linear unit (LeakyReLU) activation function with a negative slope of 0.2. At the final stage of the network, a sigmoid activation function is used to output a vector containing 9 elements, which represents the probability of the image being real.

Training

The model was built with PyTorch and executed on a single 8 GB RTX 4060 GPU (Nvidia, Santa Clara, CA, USA). Our environment included CUDA 11.8 (Nvidia), PyTorch 2.3.1, Windows 11 (Microsoft Corp., Redmon, WA, USA), and VSCode 1.90.1.0. Training included Adam [learning rate (LR) = 1×10−4] run for 100 epochs with a batch size of 4. For the local dataset, GTV-SAMGAN completed training in about 8 hours.

Patients

This retrospective study was approved by the Ethics Committee of Jiangxi Cancer Hospital & Institute (date: 2023-08-01; approval No. 2023KY154). The requirement for written informed consent was waived due to the use of deidentified data and minimal risk. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. For the public NSCLC-Radiomics dataset, data were de-identified and are publicly available, and thus additional consent was not required.

In this study, we used two datasets. The first was a clinical dataset from a local hospital (Jiangxi Cancer Hospital & Institute), which included 112 patients with lung cancer who underwent radiotherapy, with a median age of 64 years (ranging from 40 to 89 years). The data were collected from 2015 to 2023. All transverse 2D CT image data were divided into a training set and a validation set in a 4:1 ratio, with the training set consisting of 1,139 images and the validation set consisting of 302 images, each with a resolution of 512×512.

Another dataset, named the NSCLC-Radiomics dataset (29), consists of CT scans from 422 patients with NSCLC. The data are stored in Digital Imaging and Communications in Medicine Radiotherapy Structure Set (DICOM RTSTRUCT) files, and annotations of the GTV are provided in the form of mask labels in DICOM segmentation files, which manually delineate the 3D tumor volumes. This dataset includes 51,230 slices, of which only 7,438 contain lung tumors. The 422 samples were divided into training and validation sets in a 4:1 ratio, with slices containing lung tumors extracted as samples for both sets. The training set included 6,058 images, while the validation set included 1,380 images.

CT scanning parameters of the local dataset

The images used for analysis were obtained from CT scans performed for radiotherapy planning. The enhanced CT scans were conducted with a SOMATOM Definition AS 20-slice CT simulator system (Siemens Healthineers, Erlangen, Germany). The scanning parameters included a tube voltage of 120 kVp and a tube current of 165 mAs. The scanning range extended from the upper border of the midneck to the upper abdomen (covering the esophagus and entire lungs). The acquired images had a pixel size of 512×512, with a slice thickness of 5 mm and a field of view (FOV) of 400 mm. After scanning, the reconstructed CT images were transferred to the specialized radiotherapy 3D treatment planning system, Pinnacle3 version 9.16 (Philips, Amsterdam, the Netherlands). After wavelet transform filtering preprocessing, features for each tissue were extracted.

Loss function

Generator loss function

We designed a specialized hybrid loss function for the generator, G_Loss, which integrates focal loss [FL(pt)], Dice loss (DiceLoss), mask IoU loss (MaskIoULoss), and mean loss (Mean_Loss). This combination aims to optimize the generator’s performance in predicting image segmentation masks, ensuring efficiency and accuracy of the loss function in the handling of segmentation tasks.

Focal loss [FL(pt)] primarily addresses the issue of class imbalance, which is applicable in scenarios with a large number of easily classified samples, such as object detection. This loss function adjusts the cross-entropy loss by introducing a modulation factor that reduces the weight of easily classified samples, thereby enhancing the model’s focus on hard-to-classify samples. The formula for FL(pt) is as follows:

FL(pt)=αt(1pt)γlog(pt)

where pt is the probability predicted by the model, αt is the weight α of the positive class samples, and γ is the focusing parameter. Through adjustment of the γ value, the contribution of easily classified samples to the loss function can be changed.

Dice loss is widely used in image segmentation, especially for datasets with class imbalance. It is derived from the Dice coefficient, which quantifies the spatial overlap between the predicted segmentation mask and the ground truth (GT) mask. Maximizing this overlap helps the Dice loss improve segmentation accuracy. The formula for Dice loss is as follows:

DiceLoss=1Dice

MaskIoULoss evaluates image segmentation performance and is specifically designed to optimize the IoU between the predicted mask and the GT mask. IoU is a metric that measures the degree of overlap between two sets and is widely used in image segmentation and object detection. MaskIoULoss directly optimizes the IoU, making it closely related to performance metrics in practical applications. The formula for MaskIoULoss is as follows:

MaskIoULoss=mean((IoUpred_IoU)2)

where IoU represents the IoU between the predicted mask and the GT mask.

Mean_Loss calculates the average probability of positive and negative samples generated by the generator, and it is typically used in GANs. In this context, the generator’s goal is to deceive the discriminator into classifying the generated data (in this case, masks) as real data. The formula for Mean_Loss is as follows:

Mean_Loss=11mi=1mvi

where m is the number of samples generated by the generator, and vi is the probability that the i-th generated sample is classified as real by the discriminator.

The total loss of the generator (G_Loss) is obtained as the weighted sum of all the aforementioned losses. The specific formula is as follows:

G_Loss=20.0×FL(pt)+DiceLoss+1.0×MaskIoULoss+Mean_Loss

Discriminator loss function

In GANs, the loss function of the discriminator, denoted as D_Loss, is a critical component. It primarily consists of the Mean_Loss of the generator and the discriminator. The core function of the discriminator’s loss function is to guide it to effectively differentiate between real data and fake data generated by the generator. In this process, the optimization objectives of the discriminator are twofold: first, to improve the accuracy of the discriminator in identifying real data; and second, to reduce its misclassification rate of fake data. The specific form of the loss function is as follows:

D_Loss=11mi=1mri+1ni=1nvi

where m is the number of samples generated from labeled images, ri is the probability that the i-th sample from the labeled images is classified as real by the discriminator, n is the number of samples generated by the generator, and vi is the probability that the i-th generated sample is classified as real by the discriminator.

Evaluation indicators

We assessed and compared segmentation methods with the Dice coefficient (30), specificity, and sensitivity (31). These measures offer a comprehensive basis for quantifying segmentation accuracy and reliability.

The DSC evaluates how well the predicted segmentation matches the reference. It is defined as follows:

Dice=2|XY||X|+|Y|

where X is the GT mask, Y is the predicted mask, |XY| is their overlap, and |X| + |Y| denotes the sum of the volumes (voxel counts).

Sensitivity quantifies the share of true GTV pixels that the model correctly detects, indicating how well the algorithm identifies tumor tissue. It is computed as follows:

Sensitivity=TPTP+FN

where TP (true positive) is the accurate segmentation of GTV regions, and FN (false negative) is the misclassification of a GTV region as a non-GTV region.

Specificity quantifies how well the method rejects non-GTV tissue. It is defined as follows:

Specificity=TNTN+FP

where TN (true negative) is the correctly identified non-GTV pixels, and FP (false positive) is the non-GTV pixels incorrectly labeled as GTV.

Statistical analysis

For each model, Dice values across 100 training epochs are summarized as the mean ± standard deviation (SD) and were visualized with box-violin plots [center dot = mean; box = interquartile range (IQR); whiskers =1.5× IQR]. Distributional assumptions were screened with the Shapiro-Wilk test for normality and the Levene test for homogeneity of variances. An omnibus one-way analysis of variance was first performed across models. When the omnibus test was significant, pairwise two-sided Welch t-tests were conducted between models. Statistical significance was defined a priori as two-sided P<0.05. All tests were two-sided.


Results

Comparison with other methods on the local dataset

Table 1 provides a comparative analysis of various image segmentation methods on the local dataset with the validation set. The evaluated methods included U-Net, its variants (UNet++, UNet++ Deep Supervision, and SEA-UNet), nnU-Net, SwinU-Net, xLSTM-Unet, SAM-Med2D, and the proposed GTV-SAMGAN. GTV-SAMGAN achieved the highest performance, with a Dice coefficient of 83.74%, outperforming U-Net by 34.95%, the adaptive nnU-Net by 23.11%, SwinU-Net by 10.71%, SEA-UNet by 11.71%, and the second-best model, SAM-Med2D, by 7.69%. These results demonstrate the significant advantage of GTV-SAMGAN in accurately capturing detailed and global information.

Table 1

Performance of various models on the local dataset (validation set)

Method Dice (%) Specificity (%) Sensitivity (%)
U-Net [2015] 48.79§ 99.92 57.42§
UNet++ [2018] 27.23§ 99.98†,§ 23.42§
UNet++ Deep Supervision [2018] 50.94§ 99.89§ 64.05§
SwinU-Net [2021] 73.03§ 99.98†,§ 74.13§
nnU-Net [2021] 60.63§ 99.97 62.97§
SAM-Med2D [2023] 76.05 99.98 76.53
xLSTM-UNet [2024] 7.37 98.43 38.63
SEA-Net [2024] 72.03 99.96 78.26
Proposed 83.74 99.98 84.28

, the best result; , the second-best results; §, from our team’s prior work (32).

All networks showed specificity levels above 98%, indicating that the dataset mainly consisted of nontarget areas, with target areas being relatively small. This suggests that the models can effectively distinguish between the background and target regions, reducing the likelihood of false positives. Moreover, GTV-SAMGAN had an excellent sensitivity of 84.28%, surpassing U-Net by 26.86%, SwinU-Net by 10.15%, and the second-best, SEA-UNet, by 6.02%. This underscores GTV-SAMGAN’s exceptional ability to accurately identify and label GTV regions, particularly in minimizing false negatives and improving the precision of medical image analysis.

As seen in Figure 3, the segmentation results of GTV-SAMGAN closely matched the actual GTV regions, demonstrating accurate localization. In comparison, other segmentation methods exhibited weakness in localization accuracy, along with oversegmentation and undersegmentation. Specifically, as seen in Figure 3A, only GTV-SAMGAN achieved precise segmentation. In contrast, U-Net, UNet++ Deep Supervision, SwinU-Net, SEA-UNet, and SAM-Med2D produced localization inaccuracies, with nnU-Net oversegmenting significantly and U-Net++ failing to detect any target. Moreover, except for SwinU-Net, SEA-UNet, and SAM-Med2D, all models correctly located the target but still exhibited oversegmentation or undersegmentation to varying extents, including GTV-SAMGAN, which produced slight oversegmentation (Figure 3B). Aside from SAM-Med2D and GTV-SAMGAN, all the networks generated localization inaccuracies, with most producing false positives (Figure 3C). SAM-Med2D also demonstrated a certain degree of oversegmentation. Only GTV-SAMGAN accurately located the GTV and maintained excellent boundary delineation. Overall, these findings suggest that GTV-SAMGAN excels in handling intricate details and edge localization, particularly in accurately segmenting small regions.

Figure 3 Comparison of multiple model predictions with GT on the local validation set. (A-C) Three representative different patients with lung cancer GTVs. (Top panel left to right) GT, U-Net, UNet++, UNet++ (Deep Supervision), nnU-Net, and proposed (GTV-SAMGAN). (Bottom panel left to right) GT, SwinU-Net, xLSTM-UNet, SEA-UNet, SAM-Med2D, and proposed (GTV-SAMGAN). The segmented region is marked in green, and the red insets show the zoomed-in ROIs. GT, ground truth; GTV, gross tumor volume; GTV-SAMGAN, GTV segment-anything model generative adversarial network; nnU-Net, no-new U-Net; ROIs, regions of interest; SAM-Med2D, segment-anything model medical 2D images; SEA-UNet, structure-enhanced attention network.

Comparison with other methods on the NSCLC-Radiomics dataset

To validate the effectiveness and robustness of our method, we also conducted experiments on the NSCLC-Radiomics dataset. This dataset is a publicly available resource that includes a wider variety of lesions and greater complexity. Table 2 presents a comparison of the performance for the different models on the lung tumor segmentation task tested on the NSCLC-Radiomics dataset. As shown in Table 2, our proposed method, GTV-SAMGAN, outperformed the other methods in lung tumor segmentation, achieving a Dice coefficient of 82.92% and a sensitivity of 82.25% on this dataset. Compared to those of second-best model, SAM-Med2D, the Dice coefficient and sensitivity of GTV-SAMGAN were 6.68% and 9.61% greater, respectively, demonstrating the significant effectiveness of the improvements we made. Additionally, our model significantly outperformed CNN-based models (such as U-Net and nnU-Net), Transformer-based methods (such as SwinU-Net), and the ViT-CNN hybrid of Tyagi et al. (33), which amalgamates a ViT with a convolutional backbone for automatic lung-tumor segmentation, highlighting the stronger adaptability and higher accuracy of our approach in handling complex lung tumor structures, as well as its advantages in integrating global and local features.

Table 2

Performance of various models on the NSCLC-Radiomics dataset with the validation set

Method Dice (%) Specificity (%) Sensitivity (%)
U-Net 48.08 99.92 46.51
nnU-Net 59.11 99.90 62.24
SwinU-Net 29.52 99.89 28.66
SAM-Med2D 76.24 99.96 72.64
Tyagi et al. (33) 74.68
Proposed (GTV-SAMGAN) 82.92 99.95 82.25

, the best result; , the second-best results; −, corresponding evaluation metric not reported in the referenced literature. nnU-Net, no-new U-Net; GTV-SAMGAN, GTV segment-anything model generative adversarial network; NSCLC, non-small cell lung cancer; SAM-Med2D, segment-anything model medical 2D images.

As shown in Figure 4, GTV-SAMGAN demonstrated superior performance in the lung tumor segmentation task, particularly in tumor localization and edge detail handling, significantly outperforming the other models. Specifically, the other models failed to detect the lesion areas, indicating their limited ability to localize small tumors compared to GTV-SAMGAN (Figure 4A). Moreover, while the other models could detect the lesion areas, their localization accuracy was suboptimal, whereas GTV-SAMGAN was able to precisely localize and accurately segment the tumor regions (Figure 4B). Furthermore, GTV-SAMGAN had superior performance in edge detail handling (Figure 4C). In conclusion, GTV-SAMGAN exhibits higher accuracy and finer edge-processing capabilities, making it more effective in detecting and localizing tumor regions as compared to other models.

Figure 4 Comparison of multiple model predictions with GT on the NSCLC-Radiomics validation set. (A-C) Three representative different patients with lung cancer GTVs. (Columns left to right) GT, U-Net, nnU-Net, SwinU-Net, SAM-Med2D, and proposed (GTV-SAMGAN). The segmented region is marked in green. GT, ground truth; GTV, gross tumor volume; GTV-SAMGAN, GTV segment-anything model generative adversarial network; NSCLC, non-small cell lung cancer; SAM-Med2D, segment-anything model medical 2D images.

Ablation experiments

To verify the effectiveness of the GTV-MFIT module and the GAN training mode, this study designed a series of ablation experiments and evaluated the performance of different network configurations on the local dataset. The detailed results on the validation set are presented in Table 3. It is important to note that all models achieved optimal levels in the specificity metric. However, in terms of the two critical metrics, Dice coefficient and sensitivity, there were significant differences between the various model structures in terms of performance.

Table 3

Ablation experiments for the GTV-SAMGAN on the validation local dataset

Method Dice (%) Specificity (%) Sensitivity (%)
SAM-Med2D 76.05 99.98 76.53
SAM-Med2D + GTV-MFIT 81.40 99.98 82.51
SAM-Med2D + GAN 77.63 99.98 77.71
SAM-Med2D + GAN + GTV-MFIT 83.74 99.98 84.28

, the best result; , the second-best results. GAN, generative adversarial network; GTV-MFIT, gross tumor volume-minimal feature integration technology; GTV-SAMGAN, gross tumor volume segment-anything model generative adversarial network; SAM-Med2D, segment-anything model medical 2D images.

First, by comparing the experimental results of SAM-med (SAM-Med2D) and SAM-med + GTV-MFIT, we found that the introduction of the GTV-MFIT module improved the Dice coefficient and sensitivity of the original SAM-Med2D model by 5.35% and 5.98%, respectively. The results were further improved by the combinations of SAM-med + GAN and SAM-med + GAN + GTV-MFIT, with the addition of the GTV-MFIT module increasing the Dice coefficient and sensitivity by 6.11% and 6.57%, respectively. These results clearly demonstrate the significant effect of the GTV-MFIT module in enhancing global feature extraction capabilities and recognizing small target areas, thereby optimizing overall image segmentation performance.

Moreover, in comparing the experimental results of SAM-med and SAM-med + GAN, we observed that the addition of the GAN training mode improved the Dice coefficient and sensitivity improved by 1.58% and 1.18%, respectively. Similarly, in the comparison of the results for SAM-med + GTV-MFIT and SAM-med + GAN + GTV-MFIT, the model performance in terms of Dice coefficient and sensitivity improved by 2.34% and 1.77%, respectively. These findings suggest that the GAN training mode effectively enhances the model’s learning capacity, further improving its performance in complex medical image segmentation tasks.

Figure 5 shows the performance of various ablation models on the validation set. Figure 5A indicates that the models incorporating the GTV-MFIT module (SAM-med + GTV-MFIT and SAM-med + GAN + GTV-MFIT) are able to accurately locate the GTV target area, whereas models without this module (SAM-med and SAM-med + GAN) exhibit inaccurate localization and false positives. This suggests that the GTV-MFIT module plays a crucial role in enhancing the accuracy of the model’s localization of GTV target areas. Figure 5B further reveals the shortcomings of the original SAM-med model in segmenting GTV target areas, characterized by inaccurate localization and an increase in false-positive results, highlighting the model’s limitations in handling complex segmentation tasks. Figure 5C shows that incorporating the GAN training mode provides stability in segmenting very small GTV regions, indicating that the introduction of GAN significantly enhances the model’s robustness and accuracy in detecting and segmenting small regions.

Figure 5 Ablation study on the local validation set. (A-C) Three representative patients with lung cancer GTVs. (Columns left to right) GT, SAM-med, SAM-med + GTV-MFIT, SAM-med + GAN, and SAM-med + GAN + GTV-MFIT (proposed). The segmented region is marked in green, and red insets show the zoomed-in ROIs. GAN, generative adversarial network; GT, ground truth; GTV, gross tumor volume; GTV-MFIT, gross tumor volume-minimal feature integration technology; ROIs, regions of interest; SAM-med (SAM-Med2D), segment-anything model medical 2D images.

Although the selection of model architecture significantly impacts performance, the design of the loss function is equally crucial for dense prediction tasks. To validate the weighting scheme in our compound loss (Eq. [5]), we conducted an ablation study by adjusting the weights of individual components. As summarized in Table 4, four configurations were compared on the validation set: our original weights (20× focal loss + 1× DiceLoss + 1× MaskIoULoss + 1× Mean_Loss) and three variants with specific weights being reduced.

Table 4

Ablation experiments for loss weights evaluated on validation set

Weight configuration Dice (%) ΔDice
10FL + 1D + 1I + 1M 83.40 −0.34
20FL + 0.5D + 0.5I + 1M 82.54 −1.2
20FL + 1D + 1I + 0.1M 83.60 −0.14
20FL + 1D + 1I + 1M (ours) 83.74

D, DiceLoss; FL, focal loss; I, MaskIoULoss; M, Mean_Loss.

Reducing the focal loss weight from 20× to 10× decreased the Dice coefficient by 0.34%. This confirmed that high focal loss weight is essential for class-imbalance mitigation. Halving DiceLoss and MaskIoULoss weights caused the largest drop in Dice coefficient (–1.20%). Although Mean_Loss reduction minimally affected the Dice coefficient (–0.14%), it increased the convergence time and induced severe oscillation. Overall, these findings indicate that any deviation from our weighting scheme compromises either accuracy or stability, justifying our original configuration as an optimal solution for medical segmentation tasks.

To more accurately assess variations in model performance, we saved the model parameters after each training epoch and evaluated these parameters on the local dataset using the validation set, recording the Dice coefficient results each time (see Figure 6). From the line graph in Figure 6, it can be observed that the performance of the SAM-med + GTV-MFIT model begam to decline after the 10th training epoch, although this decline was not as noticeable in the SAM-Med2D model. Among all the models, the original SAM-Med2D had the weakest performance, consistently ranking at the lower positions of the line graph. Conversely, the SAM-med + GAN + GTV-MFIT model demonstrated the best training performance with stable results, and its Dice coefficient consistently remained at a high level.

Figure 6 Dice results of each ablation model over 100 training epochs on the validation local dataset. GAN, generative adversarial network; GTV-MFIT, gross tumor volume-minimal feature integration technology; SAM-med (SAM-Med2D), segment-anything model medical 2D images.

Figure 7 presents a comparison of intergroup performance among various ablation models after 100 epochs. According to the analysis, the data for all four models conformed to a normal distribution and exhibited significant intergroup differences. With the exception of the SAM-med + GTV-MFIT model, the other three models all generated outliers. In terms of mean level comparisons, the SAM-med + GAN + GTV-MFIT model had the highest mean, followed by the SAM-med + GTV-MFIT model, while the original SAM-med model had the lowest mean. From the box-violin plot (box with kernel-density overlay) in Figure 7, the IQR of the SAM-med + GAN + GTV-MFIT model is the smallest among all four models, indicating a more compact distribution; compared with the other three models, its density is more concentrated around the center.

Figure 7 Dice across 100 epochs for each ablation model (box-violin plots; center dot = mean; box = IQR; whiskers = 1.5× IQR; points = epoch-wise values). An omnibus one-way ANOVA was first performed across models; upon significance, pairwise two-sided Welch t tests were used for comparisons. Asterisks denote statistical significance (**, P<0.01; ***, P<0.001). All tests were two-sided. GAN, generative adversarial network; GTV-MFIT, gross tumor volume-minimal feature integration technology; IQR, interquartile range; SAM-med (SAM-Med2D), segment-anything model medical 2D images.

Discussion

The accurate segmentation of GTV in lung cancer is critical for medical image analysis but remains a significant challenge (34). The related difficulties arise from several key factors. First, the complex anatomical structure of the lungs and the irregular shape of tumors make it considerably difficult for automated algorithms to reliably differentiate between healthy and diseased tissue. Second, the GTV typically occupies a small portion of the image, resulting in specificity metrics consistently above 99% across different models, even when the Dice coefficient varies substantially (35). This discrepancy highlights the dominance of non-GTV regions and underscores the primary challenge in automatic segmentation. Finally, common issues such as noise, imaging artifacts, and low contrast between the tumor and surrounding tissues further complicate localization and segmentation. These factors combined make accurate GTV segmentation a task that demands continuous technological advancement and refinement.

To address the challenges of GTV segmentation in lung cancer, we initially applied the SAM-Med2D model. However, due to the input image size being 512×512 and the pretrained model automatically resizing the input to 256×256, image precision was compromised. Additionally, since the GTV region is relatively small within the image, the segmentation results were suboptimal. To address these issues, we designed the GTV-MFIT module to preprocess the data by adaptively cropping the image to an optimal size, enabling the model to focus more accurately on the target region and improving segmentation performance. Additionally, we observed a decline in model performance around the 10th epoch during training, which manifest as a gradual degradation in training results, suggesting potential overfitting or instability in the training process. To mitigate this issue, we introduced a GAN training mode to re-optimize the training process and enhance the model’s generalization ability. To further enhance the model’s stability and discriminative capacity, we incorporated the Mean_Loss into both the generator and discriminator. This design choice, validated through ablation studies, improved the discriminator’s ability to distinguish between real and generated samples, thereby contributing to the overall robustness and segmentation accuracy of the model.

The experimental results validated our hypothesis that integrating the GTV-MFIT module into SAM-Med2D and adopting a GAN training mode would improve segmentation accuracy (Dice) and sensitivity, especially for small GTVs. First, addition of the GTV-MFIT module improved upon the Dice coefficient and sensitivity of the original SAM-Med2D model by 5.35% and 5.98% on the local dataset, respectively. This indicates that the GTV-MFIT module enhances the model’s ability to capture details and edges, thereby improving its performance in segmenting complex lung structures. Additionally, the results showed that models incorporating the GAN training mode exhibited higher stability and accuracy when handling very small GTV regions. In contrast, models not using GAN were prone to overfitting and localization inaccuracies, whereas those with the GAN training mode demonstrated significant improvements in Dice coefficient and sensitivity. This suggests that the GAN training mode plays a crucial role in enhancing the model’s generalization capability, particularly when tackling complex segmentation tasks. Additionally, on the NSCLC-Radiomics dataset, GTV-SAMGAN outperformed the original model, SAM-Med2D, with improvements of 6.68% in the Dice coefficient and of 9.61% in sensitivity. These results further confirm that the model combining the GTV-MFIT module and GAN training mode provides strong generalization capability and superiority.

The differences in training performance among the models are visualized in the line graph in Figure 6. Notably, the SAM-med + GTV-MFIT model had a significant decrease in performance after the 10th epoch, highlighting the instability of the original SAM-Med2D model’s training. Moreover, the original SAM-Med2D model performed even worse, primarily due to undergoing a resolution adjustment from 512 to 256 and back to 512; this dual resizing led to substantial precision loss, further exacerbating the decline in performance. However, the introduction of the GAN component not only significantly enhanced stability but also maintained high levels of training efficiency and Dice coefficients in the SAM-med + GAN + GTV-MFIT model, thoroughly demonstrating the effectiveness of the GAN training regimen in improving model stability and accuracy.

Furthermore, the results (Figure 7) support the effectiveness of the GTV-MFIT module and the GAN training regimen. Specifically, the SAM-med + GTV-MFIT model showed no outliers, while the other three models had a varying number of outliers, significantly underscoring the GTV-MFIT module’s ability to enhance model stability. Additionally, the box plot of the SAM-med + GAN + GTV-MFIT model demonstrated a higher degree of data concentration, further confirming this model’s improvement on stability and precision. The substantial differences between all four models (P<0.01) emphasize the impact of incorporating the GTV-MFIT module and GAN on performance enhancement, fully validating the GTV-MFIT module’s effectiveness in improving data-processing quality and model prediction accuracy.

Although our model performed well in segmenting lung cancer GTV target areas, certain limitations should be addressed. First, the generalization ability of the model requires further validation. Despite significant results being achieved on the dataset used in this study, these findings may not be applicable to datasets from other sources or those with different imaging characteristics. Second, our research primarily focused on 2D image data; for handling more complex, 3D volumetric data, the model may require structural adjustments, which would result in higher computational demands. Therefore, in practical applications, further optimization and adaptation of the model may be necessary to meet the needs of different clinical environments.


Conclusions

By integrating task-oriented preprocessing of the GTV with an adversarial training paradigm, the proposed GTV-SAMGAN demonstrated a strong ability to identify and localize small and low-contrast lesions within the complex pulmonary environment. Compared with generic baseline methods, our framework provides more robust boundary delineation, improved voxel-level consistency, and enhanced resilience while maintaining favorable generalization and clinical transferability across datasets from different sources. Without an increase in annotation burden, the method streamlines the segmentation workflow and holds promise for delivering more reliable image evidence to support radiotherapy planning, treatment response evaluation, and patient follow-up. Future work will involve multicenter prospective validation, extend the approach to additional organs and lesion types, and incorporate uncertainty estimation and interpretability analysis to facilitate standardized deployment and continual optimization in real-world clinical settings.


Acknowledgments

None.


Footnote

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-370/dss

Funding: This study was supported by The Open Fund for Scientific Research of Jiangxi Cancer Hospital (Project No. 2021J15), The PhD fellowship of Nanchang Hangkong University (No. EA202008259) and National Natural Science Foundation of China (Grant No. 62261039).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-370/coif). C.Y. reports receiving payments from The PhD Fellowship of Nanchang Hangkong University (No. EA202008259). S.J. reports receiving payments from the National Natural Science Foundation of China (Grant No. 62261039). H.Z. reports receiving payments from The Open Fund for Scientific Research of Jiangxi Cancer Hospital & Institute (No. 2021J15). The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This retrospective study was approved by the Ethics Committee of Jiangxi Cancer Hospital & Institute (Date: 2023-08-01) (No. 2023KY154). The requirement for written informed consent was waived owing to the use of de-identified data and minimal risk. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. WHO. Lung cancer. Available online: https://www.who.int/news-room/fact-sheets/detail/lung-cancer. Accessed 26 June 2023.
  2. Kratzer TB, Bandi P, Freedman ND, Smith RA, Travis WD, Jemal A, Siegel RL. Lung cancer statistics, 2023. Cancer 2024;130:1330-48. [Crossref] [PubMed]
  3. Cancer IAfRo. New report on global cancer burden in 2022 by world region and human development level. Retrieved April 2024;27:2024. Available online: https://www.iarc.who.int/news-events/new-report-on-global-cancer-burden-in-2022-by-world-region-and-human-development-level/
  4. Bray F, Laversanne M, Sung H, Ferlay J, Siegel RL, Soerjomataram I, Jemal A. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2024;74:229-63. [Crossref] [PubMed]
  5. Yan T, Guo S, Zhang T, Zhang Z, Liu A, Zhang S, Xu Y, Qi Y, Zhao W, Wang Q, Shi L, Liu L. Ligustilide Prevents Radiation Enteritis by Targeting Gch1/BH(4)/eNOS to Improve Intestinal Ischemia. Front Pharmacol 2021;12:629125. [Crossref] [PubMed]
  6. Shah A, Hahn SM, Stetson RL, Friedberg JS, Pechet TT, Sher DJ. Cost-effectiveness of stereotactic body radiation therapy versus surgical resection for stage I non-small cell lung cancer. Cancer 2013;119:3123-32. [Crossref] [PubMed]
  7. Crabtree TD, Denlinger CE, Meyers BF, El Naqa I, Zoole J, Krupnick AS, Kreisel D, Patterson GA, Bradley JD. Stereotactic body radiation therapy versus surgical resection for stage I non-small cell lung cancer. J Thorac Cardiovasc Surg 2010;140:377-86. [Crossref] [PubMed]
  8. Guberina M, Santiago A, Pöttgen C, Indenkämpen F, Lübcke W, Qamhiyeh S, Gauler T, Hoffmann C, Guberina N, Stuschke M. Respiration-controlled radiotherapy in lung cancer: Systematic evaluation of the optimal application practice. Clin Transl Radiat Oncol 2023;40:100628. [Crossref] [PubMed]
  9. Caldwell CB, Mah K, Ung YC, Danjoux CE, Balogh JM, Ganguli SN, Ehrlich LE. Observer variation in contouring gross tumor volume in patients with poorly defined non-small-cell lung tumors on CT: the impact of 18FDG-hybrid PET fusion. Int J Radiat Oncol Biol Phys 2001;51:923-31. [Crossref] [PubMed]
  10. Van de Steene J, Linthout N, de Mey J, Vinh-Hung V, Claassens C, Noppen M, Bel A, Storme G. Definition of gross tumor volume in lung cancer: inter-observer variability. Radiother Oncol 2002;62:37-49. [Crossref] [PubMed]
  11. Persson GF, Nygaard DE, Hollensen C, Munck af Rosenschöld P, Mouritsen LS, Due AK, Berthelsen AK, Nyman J, Markova E, Roed AP, Roed H, Korreman S, Specht L. Interobserver delineation variation in lung tumour stereotactic body radiotherapy. Br J Radiol 2012;85:e654-60. [Crossref] [PubMed]
  12. Rios Velazquez E, Aerts HJ, Gu Y, Goldgof DB, De Ruysscher D, Dekker A, Korn R, Gillies RJ, Lambin P. A semiautomatic CT-based ensemble segmentation of lung tumors: comparison with oncologists’ delineations and with the surgical specimen. Radiother Oncol 2012;105:167-73. [Crossref] [PubMed]
  13. Vinod SK, Jameson MG, Min M, Holloway LC. Uncertainties in volume delineation in radiation oncology: A systematic review and recommendations for future studies. Radiother Oncol 2016;121:169-79. [Crossref] [PubMed]
  14. Zhou Z, Wang M, Zhao R, Shao Y, Xing L, Qiu Q, Yin Y. A multi-task deep learning model for EGFR genotyping prediction and GTV segmentation of brain metastasis. J Transl Med 2023;21:788. [Crossref] [PubMed]
  15. Wang Y, Wen Z, Su L, Deng H, Gong J, Xiang H, He Y, Zhang H, Zhou P, Pang H. Improved brain metastases segmentation using generative adversarial network and conditional random field optimization mask R-CNN. Med Phys 2024;51:5990-6001. [Crossref] [PubMed]
  16. Rhee DJ, Jhingran A, Rigaud B, Netherton T, Cardenas CE, Zhang L, Vedam S, Kry S, Brock KK, Shaw W, O’Reilly F, Parkes J, Burger H, Fakie N, Trauernicht C, Simonds H, Court LE. Automatic contouring system for cervical cancer using convolutional neural networks. Med Phys 2020;47:5648-58. [Crossref] [PubMed]
  17. Men K, Chen X, Zhang Y, Zhang T, Dai J, Yi J, Li Y. Deep Deconvolutional Neural Network for Target Segmentation of Nasopharyngeal Cancer in Planning Computed Tomography Images. Front Oncol 2017;7:315. [Crossref] [PubMed]
  18. Wang C, Tyagi N, Rimner A, Hu YC, Veeraraghavan H, Li G, Hunt M, Mageras G, Zhang P. Segmenting lung tumors on longitudinal imaging studies via a patient-specific adaptive convolutional neural network. Radiother Oncol 2019;131:101-7. [Crossref] [PubMed]
  19. Zhang F, Wang Q, Li H. Automatic Segmentation of the Gross Target Volume in Non-Small Cell Lung Cancer Using a Modified Version of ResNet. Technol Cancer Res Treat 2020;19:1533033820947484.
  20. Mohamed AA, Risse K, Schmitz L, Schlenter M, Chughtai A, Ivanciu M, Eble MJ. Clinical validation of a semi-automated segmentation algorithm for target volume definition on planning CT and CBCT in stereotactic body radiotherapy (SBRT) for peripheral lung lesions. J Med Radiat Sci 2023;70:37-47. [Crossref] [PubMed]
  21. Ronneberger O, Fischer P, Brox T, editors. U-net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, Frangi A. (eds). Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes in Computer Science; Springer, Cham.
  22. Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. Deep Learn Med Image Anal Multimodal Learn Clin Decis Support (2018) 2018;11045:3-11. [Crossref] [PubMed]
  23. Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 2021;18:203-11. [Crossref] [PubMed]
  24. Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M, editors. Swin-unet: Unet-like pure transformer for medical image segmentation. In: Karlinsky L, Michaeli T, Nishino K. (eds). Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science. Springer, Cham.
  25. Chen T, Ding C, Zhu L, Xu T, Wang Y, Ji D, Zang Y, Li Z. xLSTM-UNet can be an effective backbone for 2D & 3D biomedical image segmentation better than its Mamba counterparts. 2024 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI); Houston, TX, USA. 2024:1-8.
  26. Xiong L, Yi C, Xiong Q, Jiang S. SEA-NET: medical image segmentation network based on spiral squeeze-and-excitation and attention modules. BMC Med Imaging 2024;24:17. [Crossref] [PubMed]
  27. Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo WY, editors. Segment anything. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023.
  28. Cheng J, Ye J, Deng Z, Chen J, Li T, Wang H, Su Y, Huang Z, Chen J, Jiang L. Sam-med2d. arXiv preprint arXiv:230816184 2023.
  29. Aerts HJ, Wee L, Rios Velazquez E, Leijenaar RT, Parmar C, Grossmann P, Carvalho S, Bussink J, Monshouwer R, Haibe-Kains B. Data from NSCLC-radiomics. (No Title) 2019. Available online: https://doi.org/10.7937/K9/TCIA.2015.PF0M9REI
  30. Dice LR. Measures of the amount of ecologic association between species. Ecology 1945;26:297-302.
  31. Chang HH, Zhuang AH, Valentino DJ, Chu WC. Performance measure characterization for evaluating neuroimage segmentation algorithms. Neuroimage 2009;47:122-35. [Crossref] [PubMed]
  32. Zhang HW, Hu B, Yang J, Xiong LL, Shi HH, Xiong QL, Yi C, Jiang SF. RAU-Net for precise lung cancer GTV segmentation in radiation therapy planning. Sci Rep 2025;15:15075. [Crossref] [PubMed]
  33. Tyagi S, Kushnure DT, Talbar SN. An amalgamation of vision transformer with convolutional neural network for automatic lung tumor segmentation. Comput Med Imaging Graph 2023;108:102258. [Crossref] [PubMed]
  34. Kulkarni C, Sherkhane U, Jaiswar V, Mithun S, Mysore Siddu D, Rangarajan V, Dekker A, Traverso A, Jha A, Wee L. Comparing the performance of a deep learning-based lung gross tumour volume segmentation algorithm before and after transfer learning in a new hospital. BJR Open 2024;6:tzad008. [Crossref] [PubMed]
  35. Lok E, Liang O, Malik T, Wong ET. Computational Analysis of Tumor Treating Fields for Non-Small Cell Lung Cancer in Full Thoracic Models. Adv Radiat Oncol 2023;8:101203. [Crossref] [PubMed]
Cite this article as: Yi C, Li Y, Cao S, Xiong Q, Jiang S, Zhang H. Gross-tumor-volume segment-anything model for medical 2D images integrating gross tumor volume-minimal feature integration technology for lung cancer segmentation. Quant Imaging Med Surg 2026;16(1):59. doi: 10.21037/qims-2025-370

Download Citation