Automated segmentation of the primary tumor in nasopharyngeal carcinoma using a deep learning framework in positron emission tomography imaging: a comparative study
Original Article

Automated segmentation of the primary tumor in nasopharyngeal carcinoma using a deep learning framework in positron emission tomography imaging: a comparative study

Meina Liang1#, Chengmao Guo1#, Haiwen Li2#, Dong Wang1, Yang Jing3, Yaobin Pan4, Ziyi Huang4, Na Song4, Xuanyu Liu4, Jingxing Xiao1

1Department of Nuclear Medicine (PET-CT Center), Affiliated Hospital of Guangdong Medical University, Zhanjiang, China; 2Cancer Center, Affiliated Hospital of Guangdong Medical University, Zhanjiang, China; 3Huiying Medical Technology Co., Ltd., Beijing, China; 4Guangdong Medical University, Zhanjiang, China

Contributions: (I) Conception and design: J Xiao; (II) Administrative support: None; (III) Provision of study materials or patients: J Xiao; (IV) Collection and assembly of data: M Liang, C Guo, Y Pan, Z Huang, N Song, X Liu; (V) Data analysis and interpretation: J Xiao, M Liang, C Guo, H Li, Y Jing; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

#These authors contributed equally to this work.

Correspondence to: Jingxing Xiao, MD, PhD. Department of Nuclear Medicine (PET-CT Center), Affiliated Hospital of Guangdong Medical University, No. 57 Renmin South Road, Xiashan District, Zhanjiang 524001, China. Email: 13824828797@163.com.

Background: Nasopharyngeal carcinoma (NPC), an endemic malignancy in Southeast Asia and southern China, necessitates precise delineation of the gross tumor volume (GTV) for radiotherapy to optimize patient prognosis. Positron emission tomography (PET) imaging offers valuable metabolic insights to guide radiotherapy planning but is hampered by manual GTV segmentation, which is labor-intensive and prone to inter- and intra-observer variability. Conventional segmentation methods are inherently limited, and comparative evaluations of deep learning (DL) frameworks for PET-based primary NPC lesion segmentation remain scarce. Thus, this study aimed to optimize the delineation of primary NPC radiotherapy target volumes by leveraging PET-image-based DL segmentation technology, with the goal of enhancing both the precision and efficiency of lesion delineation.

Methods: Researchers retrospectively collected PET imaging data from 212 NPC patients at the Affiliated Hospital of Guangdong Medical University. The patients were randomly divided into training (170 patients) and testing (42 patients) datasets. A radiation oncologist and a nuclear medicine physician collaboratively delineated the cancer lesion boundaries through consensus. After data preprocessing, three DL models (Res-Unet, Nn-Unet, and Nn-Former) were used to automatically segment the lesions. Training was based on the training dataset. During evaluation, the models’ segmentation performance was comprehensively assessed using Dice similarity coefficient (DSC) and Hausdorff distance (HD). A visual analysis of the results was also conducted to intuitively understand the models’ segmentation capabilities.

Results: In the training set, Nn-Unet achieved the highest DSC of 0.869, whereas in the testing set, its DSC was 0.833. The DSC for Nn-Former hovered around 0.8 in both the training and testing sets. In contrast, Res-Unet demonstrated the lowest DSC values among the three models, particularly in the testing set (DSC =0.794). Statistical analysis revealed that the DSC of Nn-Unet was significantly higher than that of both Res-Unet and Nn-Former in the training set (P<0.01 for both comparisons). Regarding the 95% HD and average surface distance (ASD) values, Nn-Unet outperformed the other two models in both the training and testing sets. However, Res-Unet exhibited a much higher HD value in the testing set compared to the training set. The Loss curves for all models gradually decreased as training progressed, indicating that the models were learning relevant features of NPC lesions. The final Loss value for Res-Unet was approximately 0.7, whereas Nn-Unet and Nn-Former had final Loss values below 0.8.

Conclusions: The PET-image-based automatic segmentation model for the primary tumor of NPC established in this study demonstrates the clinical potential of Nn-Unet in primary lesion of NPC segmentation tasks.

Keywords: Positron emission tomography imaging (PET imaging); nasopharyngeal carcinoma (NPC); lesion segmentation; Nn-Unet; Nn-Former


Submitted May 08, 2025. Accepted for publication Sep 28, 2025. Published online Oct 23, 2025.

doi: 10.21037/qims-2025-1090


Introduction

Nasopharyngeal carcinoma (NPC) is a prevalent malignancy of the head and neck region in Southeast Asia, and its early detection and accurate diagnosis are crucial for improving patient prognosis and extending survival. According to statistics from the World Health Organization (WHO), despite a relatively low global incidence of NPC, specific regions such as southern China and Southeast Asia exhibit significant endemic trends (1). NPC often originates in the nasopharyngeal recess, a relatively concealed location with non-specific early clinical symptoms, leading to missed or misdiagnosed cases. Additionally, the pathological findings of NPC patients are typically poorly-differentiated squamous cell carcinomas, often accompanied by lymph node metastasis and sensitive to radiotherapy, making radiotherapy the preferred treatment option (2,3). Positron emission tomography/computed tomography (PET/CT), a functional and metabolic imaging modality, has been extensively applied in the diagnosis, clinical staging, and prognostic prediction for NPC patients (4-6).

Although magnetic resonance imaging (MRI) remains the gold standard for anatomical delineation in NPC radiotherapy due to its superior soft-tissue contrast (5-7), this study intentionally focused on PET-based segmentation for two key reasons:

Firstly, PET imaging captures glucose metabolism through fluorodeoxyglucose (FDG) uptake, revealing tumor heterogeneity (8). Secondly, recent studies have confirmed that PET-derived metabolic tumor volume (MTV) independently predicts survival outcomes in NPC (5-8). Our PET-centric approach thus targets functional characterization to supplement, not replace, anatomical delineation.

In both clinical practice and research, a range of semiquantitative and quantitative PET metrics derived from images are employed to augment visual interpretation. These encompass straightforward indices such as standardized uptake volume (SUV), as well as sophisticated quantitative measures extracted from PET scans. The quantification of in vivo metabolic and physiological processes offers crucial insights for the clinical diagnosis and prognosis of NPC (9,10). In clinical settings, there is a strong demand for delineating the gross tumor volume (GTV) to compute the MTV, and subsequently the total lesion glycolysis (TLG), or for planning external beam radiation therapy (11). However, manual delineation of the GTV is not only time-consuming but also susceptible to inter- and intra-observer variability, and is heavily reliant on the physician’s experience (12,13). Furthermore, achieving precise delineation in PET images is challenging due to their inherent noise, limited spatial resolution, and the subsequent partial volume effect, which can compromise the accuracy of the delineation (14-18). In the realm of PET image segmentation, machine learning algorithms such as K-nearest neighbor (19), decision tree (20), and support vector machine (21), have undergone extensive research. However, these techniques are constrained by the necessity for manual feature extraction and selection, thereby hindering their accuracy and robustness. In recent times, the advent of deep learning (DL) technology has ushered in novel advancements in the field of medical image segmentation (22,23). Notably, convolutional neural networks (CNNs) and their derivative models, including U-Net, Res-Unet, Nn-Unet, and Nn-Former, have demonstrated exceptional performance in medical image segmentation tasks (24-26). These models, through multi-layer convolutional and pooling operations, can extract multi-scale features from images, thereby enabling precise segmentation of complex structures. For instance, U-Net employs a symmetrical encoder-decoder structure to preserve image details and capture global information; Res-Unet introduces residual connections to alleviate the gradient vanishing problem in deep network training; and Nn-Unet and Nn-Former further enhance segmentation performance through adaptive network architectures and attention mechanisms.

The study aimed to establish a clinical pathway for automatic NPC lesion segmentation using PET imaging, through the development of DL models such as Res-Unet, Nn-Unet, and Nn-Former. We sought to offer valuable guidance to researchers and clinicians in NPC diagnosis and treatment. The study comprehensively evaluated model construction, training, and evaluation, aiming to provide technical insights for precise NPC lesion segmentation.


Methods

Patient data

The dataset for this study originated from the Affiliated Hospital of Guangdong Medical University, encompassing a total of 212 PET images of NPC patients, accompanied by their corresponding annotated data. Prior to data collection, all patients provided informed consent, and the study received approval from the Ethics Committee of the Affiliated Hospital of Guangdong Medical University (Ethics Approval No. JP-2023-016). This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This study utilized a non-interventional retrospective design, avoiding human experimentation or the use of human tissue samples, and ensuring that all data were anonymized.

To maintain data purity and consistency, patients with concurrent head and neck diseases or those who had undergone related treatments were excluded during the data collection process. The process of patient inclusion and exclusion is illustrated in Figure 1. The final patient cohort included 158 males and 54 females, with ages spanning from 19 to 85 years and an average age of 50.9±13.7 years. The detailed basic data characteristics of 212 NPC cases are presented in Table 1. Annotations were performed collaboratively by one radiation oncologist (H.L.) with over 10 years of experience and one nuclear medicine physician (C.G.) with 5 years of experience in delineating primary NPC lesions by utilizing a freely available semi-automated segmentation tool (27), which was implemented in 3D Slicer (28). Discrepancies exceeding 5% in contouring volume were resolved through consensus adjudication.

Figure 1 Flowchart of patient inclusion and exclusion criteria. 18F-FDG PET/CT, 18F-fluorodeoxyglucose positron emission tomography/computed tomography; NPC, nasopharyngeal carcinoma.

Table 1

Basic data characteristics of 212 cases of nasopharyngeal carcinoma

Characteristics Values
Gender
   Male 158
   Female 54
Age (years) 50.9±13.7
T staging
   T1 22
   T2 62
   T3 92
   T4 36
TNM staging
   I 5
   II 19
   III 74
   IVa 62
   IVb 52

Data are presented as n or mean ± standard deviation. TNM, tumor-node-metastasis

PET/CT imaging protocol

PET/CT scans were conducted using a Discovery PET/CT Elite 690 scanner (GE Healthcare, Chicago, IL, USA). Prior to the examination, patients fasted for at least 6 hours, and blood glucose levels were confirmed to be below 150 mg/dL. The tracer 18F-fluorodeoxyglucose (18F-FDG) was intravenously administered at a dose of 3.7 MBq/kg. After a 45-minute uptake period, imaging was performed from the skull base to the mid-thigh. CT acquisition employed automatic dose modulation at 120 kV with a tube current range of 80–160 mA. PET data were collected in three-dimensional (3D) time-of-flight mode, covering 5–7 bed positions with 120 seconds per bed. Image reconstruction utilized an adaptive statistical iterative reconstruction (ASIR) algorithm with a blending factor of 40%, an axial field of view of 70 cm, and a slice thickness of 3.27 mm.

Data preprocessing

To ensure the generalization capability of the model, we randomly divided the dataset into a training set and a test set, with the training set accounting for 80% of the data for model training and the test set accounting for 20% of the data for evaluating model performance. Specifically, the training set contained data from 170 patients, whereas the test set contained data from 42 patients.

To minimize the inter-device and inter-population variability and enhance the model’s generalization capability, we applied data preprocessing to all datasets. Prior to model training, we systematically preprocessed all PET image data, including gray-scale normalization, region of interest (ROI) cropping for the training area, and data augmentation. Gray-scale normalization was conducted using the Z-score method to eliminate gray-scale differences among images from different patients. ROI cropping aimed to extract the area containing the lesion, thereby reducing unnecessary computational load. Data augmentation was implemented through various methods, including rotation, scaling, Gaussian noise addition, brightness variation, contrast variation, low-resolution simulation, elastic transformation, and Gamma correction, to increase the diversity and quantity of training samples and enhance the robustness of the model.

Model construction

In this study, three DL models—Res-Unet (29), Nn-Unet (24), and Nn-Former (30)—were selected for NPC lesion segmentation. These models are all based on the U-Net architecture but incorporate distinct improvement mechanisms. The schematic diagram of the model is shown in Figure 2.

Figure 2 The schematic diagram of the model. PET, positron emission tomography.

Res-Unet

Res-Unet introduces residual connections into the traditional U-Net framework to address the issue of gradient vanishing in deep networks. By adding skip connections after each convolutional layer, residual connections allow the input to be directly transmitted to the output, thereby mitigating the gradient vanishing problem. Additionally, Res-Unet employs batch normalization technology, which helps to accelerate the training process and improve model stability.

Nn-Unet

Nn-Unet is an automated DL segmentation framework that can automatically complete tasks such as data preprocessing, network structure design, and training strategy selection. The core advantage of Nn-Unet lies in its adaptability and flexibility, enabling it to automatically adjust the network architecture according to different datasets and task requirements. Nn-Unet has achieved excellent results in multiple medical image segmentation competitions, demonstrating its powerful performance and broad applicability. By introducing multi-scale feature extraction and fusion strategies, Nn-Unet can better capture detailed information in images, thereby improving segmentation accuracy.

Nn-Former

Building upon Nn-Unet, Nn-Former further incorporates the Transformer structure, enhancing the model’s ability to understand global image information. The Transformer structure, through the self-attention mechanism, can effectively capture long-range dependencies in images, thereby better understanding the complex interactions between lesions and surrounding tissues. Additionally, Nn-Former adopts a multi-scale feature fusion strategy, improving the model’s segmentation accuracy for lesions of different scales by extracting and fusing features at different scales. In the self-attention module, Nn-Former uses scaled dot-product attention to calculate the similarity between each pixel and its neighboring regions, assigning higher weights to lesion-related features to suppress background interference. This design enables Nn-Former to balance the precision of lesion boundary segmentation and the integrity of tumor region detection, especially for irregularly shaped NPC lesions.

Loss function

All three models utilize a loss function that combines Dice loss and cross-entropy loss. The reason for choosing this loss function is that it can balance class imbalance issues and improve segmentation accuracy. Dice loss excels in handling class imbalance problems because it directly optimizes the overlapping region of the segmentation results, whereas cross-entropy loss optimizes at the pixel level, contributing to higher segmentation accuracy. Combining the advantages of both allows for simultaneous handling of class imbalance issues and improved segmentation accuracy.

The formula for the loss function that combines Dice loss and cross-entropy loss is expressed as:

Loss=α·Diceloss+β·Cross-entropyloss

In the formula: α and β are weight coefficients used to balance the influence of Dice loss and cross-entropy loss, respectively, and they are set to 0.5.

The formula for Dice loss is:

Diceloss=12i=1Npigii=1Npi+i=1Ngi

In the formula: N represents the total number of samples, pi denotes the probability value predicted by the model for the (i)th pixel, and gi indicates the true label (0 or 1) of the (i)th pixel.

The formula for cross-entropy loss is:

Cross-entropyloss=i=1Ngilog(pi)

In the formula: N represents the total number of samples, pi signifies the probability value predicted by the model for the (i)th pixel, and gi denotes the actual label (0 or 1) of the (i)th pixel.

Model training and evaluation

During the model training process, the specific training hyperparameters are as follows:

  • Res-Unet: the initial learning rate is set to 0.001, with a momentum of 0.9 and a batch size of 16. A learning rate decay strategy is adopted during training, where the learning rate is reduced by a factor of 0.1 every 10 epochs. The total number of training epochs is 250.
  • Nn-Unet: the initial learning rate is set to 0.001, with a momentum of 0.9 and a batch size of 8. The Adam optimizer is used, and the learning rate is dynamically adjusted during training. The total number of training epochs is 250.
  • Nn-Former: the initial learning rate is set to 0.001, with a momentum of 0.9 and a batch size of 4. The Adam optimizer is used, and a polynomial decay learning rate scheduling strategy is adopted during training. The total number of training epochs is 250.

During the post-processing of the model’s inference results, we customized the development according to the particularities of the PET-predicted NPC lesions. Due to the small size of some patients’ lesions, the predicted results showed scattered lesion areas, which, in reality, are connected. To address this, we applied certain image processing topological methods to perform connected component labeling, making the segmentation more suitable for the application of this study. To comprehensively evaluate the performance of the models, we adopt the Dice similarity coefficient (DSC) and Hausdorff distance (HD) as evaluation metrics for the segmentation results. The DSC measures the overlap between the segmentation result and the ground truth, with a value closer to 1 indicating more accurate segmentation. The HD assesses the maximum inconsistency between the segmentation boundary and the true boundary, with a smaller value indicating more precise segmentation.

The DeLong test was utilized to assess the statistical significance of the differences in segmentation efficacy, as measured by the DSC, between the various models. A P value less than 0.05 was considered to indicate a statistically significant difference. Our work was implemented using the PyTorch 1.13.1 (https://pytorch.org/) DL framework, and the models were trained on an NVIDIA 1080Ti GPU (NVIDIA, Santa Clara, CA, USA).

Statistical analysis

To rigorously assess the differences in segmentation performance between the three models (Res-Unet, Nn-Unet, Nn-Former), all quantitative indicators (DSC, HD, ASD) were first tested for normality using the Shapiro-Wilk test. Given that the data conformed to a normal distribution, parametric statistical methods were employed for subsequent analyses. A one-way analysis of variance (ANOVA) was initially conducted to evaluate the overall differences in performance indicators (DSC, HD, ASD) among the three models. When a significant overall difference was detected (P<0.05), post-hoc pairwise comparisons between each pair of models (Nn-Unet vs. Res-Unet, Nn-Unet vs. Nn-Former, Res-Unet vs. Nn-Former) were performed using the t-test. To address the issue of inflated type I error due to multiple comparisons (three pairwise tests in total), the Holm correction method was applied to adjust the critical P values. All statistical analyses were implemented using Python’s scipy and scikit-learn library.


Results

Segmentation model evaluation

Table 2 presents the DSC and HD values for three models on both the training and testing sets. Table 2 shows that Nn-Unet achieved the highest DSC on the training set (Dice =0.869), but it slightly decreased on the testing set (Dice =0.833). The DSC for Nn-Former remained around 0.8 on both sets. In contrast, Res-Unet had the lowest DSC values among the three, especially on the testing set (Dice =0.794). In terms of the 95% HD value, Nn-Unet performed better than the other two models on both training and testing sets. However, the HD value for Res-Unet on the testing set was much higher than that on the training set. In terms of the evaluation metric ASD, the test set results of the three models ranged from a minimum of 1.711 mm (Nn-Unet) to a maximum of 1.894 mm (Res-Unet). The errors in predicting the edges were significantly lower than the range of image resolution, which was specified by the Digital Imaging and Communications in Medicine (DICOM) resolution of 3.646 mm × 3.646 mm × 3.27 mm. Figure 3 illustrates the changes in Loss values for the training sets of the three models during training. The Loss curves for all models gradually decreased as training progressed, indicating that the models learned relevant features of NPC lesions. The final Loss value for Res-Unet was around 0.7, whereas those for Nn-Unet and Nn-Former were below 0.8.

Table 2

The segmentation results of the all models

Model Data grouping DSC 95% HD (mm) ASD (mm)
Nn-Unet Train dataset 0.869 (0.008) 4.873 (1.392) 0.579 (0.069)
Test dataset 0.833 (0.017) 7.249 (2.753) 1.711 (0.615)
Res-Unet Train dataset 0.809 (0.011) 5.769 (1.549) 1.136 (0.164)
Test dataset 0.794 (0.019) 9.327 (2.853) 1.894 (0.666)
Nn-Former Train dataset 0.823 (0.010) 4.934 (1.245) 0.932 (0.080)
Test dataset 0.808 (0.023) 7.367 (3.506) 1.769 (0.622)

Data are presented as mean (standard deviation). ASD, average surface distance; DSC, Dice similarity coefficient; HD, Hausdorff distance.

Figure 3 Loss curve on the training set.

Visualization of segmentation model results

Figure 4 presents the prediction results of the Nn-Unet model on the testing set. By comparing the three-view fused images of the original images, prediction outcomes, and annotated results, it can be observed that the Nn-Unet model offers highly accurate predictions for NPC lesions. Furthermore, the prediction results of the Nn-Unet model closely approximate the gold standard results. Figures 5 depicts typical cases of a primary lesion of NPC segmentation from three networks. Note that the lesions in good performing cases tend to have a larger size and better contrast. Figure 6 presents the box plot results for the Dice coefficients of all models on both the training and testing sets. From Figure 6A,6B, it can be seen that the median and interquartile range (IQR) values of the Dice coefficient for the Nn-Unet model were significantly higher than those of the other two models on both sets. There were significant differences in the Dice coefficient results between the Nn-Unet model and the other two models on the training set (Nn-Unet vs. Res-Unet, P<0.001; Nn-Unet vs. Nn-Former, P<0.001).

Figure 4 Nn-Unet model prediction results. (A) Original images in three views; (B) a merged visualization of Nn-Unet prediction results and annotated ground truth in three views, with label =1 representing the annotated ground truth by Nn-Unet, label =2 representing the prediction results by Nn-Unet, and label =3 representing the merged visualization.
Figure 5 Comparison of segmentation results of two typical primary lesions from three networks.
Figure 6 Box plots of evaluation metrics. (A,B) The box plots of the Dice coefficients for all models on the training set and test set, respectively.

Subgroup analysis of tumor T-stage

To explore model performance across clinical subgroups, we analyzed 212 NPC cases stratified by T-stage: T1–2 and T3–4. Figure 7 displays the results of a subgroup analysis. Statistical analysis of the Dice coefficient was conducted. In the T1–2 group, Nn-Unet significantly outperformed Res-Unet (P=0.016), whereas there was no significant difference between Nn-Unet and Nn-Former (P=0.141). In contrast, in the T3–4 group, Nn-Unet demonstrated significant superiority over both Res-Unet and Nn-Former (all P<0.001).

Figure 7 A comprehensive statistical comparison of the Dice coefficients for segmentation, showcasing the performance of three models using data from two subgroups categorized based on the tumor’s stage.

Discussion

NPC is a malignant tumor predominantly prevalent in East Asian countries, with radiotherapy serving as the primary treatment modality (31,32). PET/CT, a functional imaging technique, is widely used in NPC diagnosis, staging, prognosis prediction, and radiotherapy planning (33). Accurate delineation of the radiotherapy target zone on PET images is essential for formulating effective radiotherapy plans and predicting prognoses (33). However, manual delineation of the GTV is time-consuming, subjective, and heavily reliant on physician experience. PET images face challenges such as noise, limited spatial resolution, and partial volume effect, all of which can compromise delineation accuracy. NPC lesions also exhibit variable morphologies and blurry boundaries, making them particularly difficult to segment due to PET’s inherent low resolution and high noise levels. Traditional segmentation methods often struggle to achieve sufficient accuracy, and machine learning algorithms are limited by the need for manual feature extraction (19-21).

In this study, we focused on leveraging DL technology, specifically CNNs, to improve the segmentation of NPC lesions based on PET images. We employed three state-of-the-art models—Res-Unet, Nn-Unet, and Nn-Former—for comparative analysis, all built upon the U-Net architecture known for its encoder-decoder symmetry and effectiveness in capturing contextual information while preserving spatial details. Our results demonstrate that the Nn-Unet model exhibited superior performance, achieving a DSC value of 0.869 on the training set and 0.833 on the test set, indicating high segmentation accuracy. Furthermore, Nn-Unet outperformed the other two models in terms of the HD metric, particularly in accurately capturing lesion boundaries (P values <0.05 for both comparisons in training set). Furthermore, the study revealed that the Nn-Unet model exhibited the lowest ASD (1.711 mm) among the test values, with the error in predicting edges significantly lower than the range of image resolution specified by the DICOM resolution of 3.646 mm × 3.646 mm × 3.27 mm. Additionally, its ASD value was also lower than those of the Res-Unet and Nn-Former models, indicating that the Nn-Unet model possesses higher accuracy in predicting the edges of lesions. This performance is closely related to its automated preprocessing, network architecture selection, and optimization of the training strategy. Additionally, Nn-Unet may have achieved a balance in model complexity, which is neither too simplistic nor overly complex (24). These findings suggest that Nn-Unet is more robust in dealing with complex structures and variable lesion morphologies compared to Res-Unet and Nn-Former.

Zhao et al. (34) developed the MMCA-Net based on PET/CT images, enhancing modal fusion by utilizing a cross-attention mechanism between CT and PET data. Their results were validated through five-fold cross-validation on data samples from two centers, where 10 algorithms were tested, and ultimately achieved a DSC of 0.815. Despite achieving comparable segmentation accuracy to ours, their model is based on the dual modalities of PET and CT. This study demonstrated that the Nn-Unet method, using solely PET images, achieves impressive segmentation results. As illustrated in Figure 4, the predicted segmentation outcomes align closely with the gold standard results, hinting at its potential significant value in guiding the delineation of GTV for radiotherapy in NPC.

In contrast, although the Res-Unet model introduces residual connections to enhance the learning and transmission efficiency of deep features, it performed poorly in this study. Specifically, its HD value on the test set reached 9.327, significantly higher than the 7.249 of Nn-Unet, indicating limited performance in handling complex or uncommon lesions. This may be related to the need for further optimization of the Res-Unet’s model structure or training strategy.

Transformer models have demonstrated excellent capabilities in capturing such long-range information in several semantic segmentation tasks performed on medical images (35). The Nn-Former model of the study introduced a Transformer structure based on Nn-Unet to enhance its ability to understand global image information and adopts a multi-scale feature fusion strategy. However, its DSC and HD values on the test set did not significantly outperform those of Nn-Unet. A study conducted by Xiong et al. (36) also found that, when comparing transformer-based models with CNN-based network models in the context of head and neck tumor segmentation, the former did not demonstrate a significant advantage. They speculated that this might be related to the shape of the lesions. Additionally, Transformers tend to converge more slowly during training and require a larger dataset. Although Transformer-based models have the potential to achieve superior segmentation performance, achieving such improvements seems to necessitate a significantly larger training dataset. This is often a challenge in medical applications due to the limited availability of training data and the increased effort required to generate such data. Based on the tumor T-stage subgroup analysis, key insights into the three models’ clinical application emerge. Nn-Unet demonstrated superior and consistent performance in both early (T1–2) and locally advanced (T3–4) NPC—particularly notable given the greater segmentation challenges of locally advanced tumors (e.g., irregular boundaries, metabolic heterogeneity). Its adaptive network architecture and multi-scale feature fusion likely enable it to navigate such complexity effectively. Clinically, Nn-Unet’s consistent accuracy across T stages supports reliable radiotherapy target volume delineation, helping to reduce observer variability and improve treatment plan standardization—valuable for centers with limited NPC radiotherapy experience.

Despite the accomplishments of this study, there remain some limitations. Firstly, the dataset comprising only 212 PET images and annotation data is relatively small, potentially impacting the model’s generalization ability. Secondly, although these findings suggest Nn-Unet’s robustness, its clinical potential requires validation through external datasets and integration into clinical workflows. Thirdly, this study limited its analysis to three DL models. Future endeavors can explore more advanced segmentation algorithms and model architectures, such as those based on graph convolutional networks (GCNs) and generative adversarial networks (GANs). Additionally, the utilization of multimodal imaging data, such as PET-CT fusion images, can enhance lesion segmentation by leveraging complementary information from different modalities, thereby improving accuracy and robustness. Lastly, although this study primarily emphasized the accuracy and robustness of lesion segmentation, future research can delve deeper into the clinical utility of segmentation results in decision support, including applications in tumor volume measurement and radiotherapy planning based on segmentation outcomes.


Conclusions

This study focused on the application of automatic and rapid segmentation of PET images in delineating NPC radiotherapy target volumes. By comprehensively evaluating the performance of three DL models, it revealed the technical potential of Nn-Unet in primary tumor segmentation tasks within single-center datasets. This provides a methodological foundation for PET image-guided delineation, pending further validation of clinical utility through external multicenter trials and assessment by radiation oncology teams.


Acknowledgments

The authors thank patients and clinical staff.


Footnote

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1090/dss

Funding: This study was funded by the Guangdong Basic and Applied Basic Research Fund (No. 2021A1515220052), the Enhanced Research Initiation Program for Elite Scientists at the Affiliated Hospital of Guangdong Medical University (No. GCC2023030), the 2024 Key Research Platforms and Projects of Ordinary Universities in Guangdong Province (No. 2024KTSCX163), Zhanjiang Science and Technology Projects (No. 2021A05054), and the Big Data Platform of Affiliated Hospital of Guangdong Medical University (No. 20250312).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1090/coif). Y.J. reports employment by Huiying Medical Technology Co., Ltd. The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study received approval from the Ethics Committee of the Affiliated Hospital of Guangdong Medical University (Ethics Approval No. JP-2023-016). Informed consent was obtained from all individual participants.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Chen YP, Chan ATC, Le QT, Blanchard P, Sun Y, Ma J. Nasopharyngeal carcinoma. Lancet 2019;394:64-80. [Crossref] [PubMed]
  2. Qu S, Liang ZG, Zhu XD. Advances and challenges in intensity-modulated radiotherapy for nasopharyngeal carcinoma. Asian Pac J Cancer Prev 2015;16:1687-92. [Crossref] [PubMed]
  3. Liu Z, Chen Y, Su Y, Hu X, Peng X. Nasopharyngeal Carcinoma: Clinical Achievements and Considerations Among Treatment Options. Front Oncol 2021;11:635737. [Crossref] [PubMed]
  4. Chang MC, Chen JH, Liang JA, Yang KT, Cheng KY, Kao CH. Accuracy of whole-body FDG-PET and FDG-PET/CT in M staging of nasopharyngeal carcinoma: a systematic review and meta-analysis. Eur J Radiol 2013;82:366-73. [Crossref] [PubMed]
  5. Gihbid A, Cherkaoui Salhi G, El Alami I, Belgadir H, Tawfiq N, Bendahou K, El Mzibri M, Cadi R, El Benna N, Guensi A, Khyatti M. Pretreatment [18F]FDG PET/CT and MRI in the prognosis of nasopharyngeal carcinoma. Ann Nucl Med 2022;36:876-86.
  6. Mohandas A, Marcus C, Kang H, Truong MT, Subramaniam RM. FDG PET/CT in the management of nasopharyngeal carcinoma. AJR Am J Roentgenol 2014;203:W146-57. [Crossref] [PubMed]
  7. Lin L, Dou Q, Jin YM, Zhou GQ, Tang YQ, Chen WL, Su BA, Liu F, Tao CJ, Jiang N, Li JY, Tang LL, Xie CM, Huang SM, Ma J, Heng PA, Wee JTS, Chua MLK, Chen H, Sun Y. Deep Learning for Automated Contouring of Primary Tumor Volumes by MRI for Nasopharyngeal Carcinoma. Radiology 2019;291:677-86. [Crossref] [PubMed]
  8. Gu B, Zhang J, Ma G, Song S, Shi L, Zhang Y, Yang Z. Establishment and validation of a nomogram with intratumoral heterogeneity derived from (18)F-FDG PET/CT for predicting individual conditional risk of 5-year recurrence before initial treatment of nasopharyngeal carcinoma. BMC Cancer 2020;20:37. [Crossref] [PubMed]
  9. Yoon HI, Kim KH, Lee J, Roh YH, Yun M, Cho BC, Lee CG, Keum KC. The Clinical Usefulness of (18)F-Fluorodeoxyglucose Positron Emission Tomography (PET) to Predict Oncologic Outcomes and PET-Based Radiotherapeutic Considerations in Locally Advanced Nasopharyngeal Carcinoma. Cancer Res Treat 2016;48:928-41. [Crossref] [PubMed]
  10. Fei Z, Xu T, Hong H, Xu Y, Chen J, Qiu X, Ding J, Huang C, Li L, Liu J, Chen C. PET/CT standardized uptake value and EGFR expression predicts treatment failure in nasopharyngeal carcinoma. Radiat Oncol 2023;18:33. [Crossref] [PubMed]
  11. Zaidi H, Karakatsanis N. Towards enhanced PET quantification in clinical oncology. Br J Radiol 2018;91:20170508. [Crossref] [PubMed]
  12. Norouzi A, Rahim MSM, Altameem A, Saba T, Rad AE, Rehman A, Uddin M. Medical Image Segmentation Methods, Algorithms, and Applications. IETE Technical Review 2014;31:199-213.
  13. Patil DD, Deore SG. Medical image segmentation: a review. Int J Comput Sci Mob Comput 2013;2:22-7.
  14. Xu Z, Gao M, Papadakis GZ, Luna B, Jain S, Mollura DJ, Bagci U. Joint solution for PET image segmentation, denoising, and partial volume correction. Med Image Anal 2018;46:229-43. [Crossref] [PubMed]
  15. Foster B, Bagci U, Mansoor A, Xu Z, Mollura DJ. A review on segmentation of positron emission tomography images. Comput Biol Med 2014;50:76-96. [Crossref] [PubMed]
  16. Day E, Betler J, Parda D, Reitz B, Kirichenko A, Mohammadi S, Miften M. A region growing method for tumor volume segmentation on PET images for rectal and anal cancer patients. Med Phys 2009;36:4349-58. [Crossref] [PubMed]
  17. Hatt M, Lee JA, Schmidtlein CR, Naqa IE, Caldwell C, De Bernardi E, et al. Classification and evaluation strategies of auto-segmentation approaches for PET: Report of AAPM task group No. 211. Med Phys 2017;44:e1-e42. [Crossref] [PubMed]
  18. Zaidi H, El Naqa I. PET-guided delineation of radiation therapy treatment volumes: a survey of image segmentation techniques. Eur J Nucl Med Mol Imaging 2010;37:2165-87. [Crossref] [PubMed]
  19. Yu H, Caldwell C, Mah K, Poon I, Balogh J, MacKenzie R, Khaouam N, Tirona R. Automated radiation targeting in head-and-neck cancer using region-based texture analysis of PET and CT images. Int J Radiat Oncol Biol Phys 2009;75:618-25. [Crossref] [PubMed]
  20. Berthon B, Evans M, Marshall C, Palaniappan N, Cole N, Jayaprakasam V, Rackley T, Spezi E. Head and neck target delineation using a novel PET automatic segmentation algorithm. Radiother Oncol 2017;122:242-7. [Crossref] [PubMed]
  21. Kawata Y, Arimura H, Ikushima K, Jin Z, Morita K, Tokunaga C, Yabu-Uchi H, Shioyama Y, Sasaki T, Honda H, Sasaki M. Impact of pixel-based machine-learning techniques on automated frameworks for delineation of gross tumor volume regions for stereotactic body radiation therapy. Phys Med 2017;42:141-9. [Crossref] [PubMed]
  22. Bradshaw TJ, McMillan AB. Anatomy and Physiology of Artificial Intelligence in PET Imaging. PET Clin 2021;16:471-82. [Crossref] [PubMed]
  23. Yousefirizi F, Jha AK, Brosch-Lenz J, Saboury B, Rahmim A. Toward High-Throughput Artificial Intelligence-Based Segmentation in Oncological PET Imaging. PET Clin 2021;16:577-96. [Crossref] [PubMed]
  24. Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 2021;18:203-11. [Crossref] [PubMed]
  25. Shaukat Z, Farooq QUA, Tu S, Xiao C, Ali S. A state-of-the-art technique to perform cloud-based semantic segmentation using deep learning 3D U-Net architecture. BMC Bioinformatics 2022;23:251. [Crossref] [PubMed]
  26. Bhandary S, Kuhn D, Babaiee Z, Fechter T, Benndorf M, Zamboglou C, Grosu AL, Grosu R. Investigation and benchmarking of U-Nets on prostate segmentation tasks. Comput Med Imaging Graph 2023;107:102241. [Crossref] [PubMed]
  27. Beichel RR, Van Tol M, Ulrich EJ, Bauer C, Chang T, Plichta KA, Smith BJ, Sunderland JJ, Graham MM, Sonka M, Buatti JM. Semiautomated segmentation of head and neck cancers in 18F-FDG PET scans: A just-enough-interaction approach. Med Phys 2016;43:2948-64. [Crossref] [PubMed]
  28. Fedorov A, Beichel R, Kalpathy-Cramer J, Finet J, Fillion-Robin JC, Pujol S, Bauer C, Jennings D, Fennessy F, Sonka M, Buatti J, Aylward S, Miller JV, Pieper S, Kikinis R. 3D Slicer as an image computing platform for the Quantitative Imaging Network. Magn Reson Imaging 2012;30:1323-41. [Crossref] [PubMed]
  29. Diakogiannis FI, Waldner F, Caccetta P, Wu C. ResUNet-a: a deep learning framework for semantic segmentation of remotely sensed data. ISPRS Journal of Photogrammetry and Remote Sensing 2020;162:94-114.
  30. Zhou HY, Guo J, Zhang Y, Han X, Yu L, Wang L, Yu Y. nnFormer: Volumetric Medical Image Segmentation via a 3D Transformer. IEEE Trans Image Process 2023;32:4036-45. [Crossref] [PubMed]
  31. Tang LL, Chen WQ, Xue WQ, He YQ, Zheng RS, Zeng YX, Jia WH. Global trends in incidence and mortality of nasopharyngeal carcinoma. Cancer Lett 2016;374:22-30. [Crossref] [PubMed]
  32. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 2021;71:209-49. [Crossref] [PubMed]
  33. Li H, Kong Z, Xiang Y, Zheng R, Liu S. The role of PET/CT in radiotherapy for nasopharyngeal carcinoma. Front Oncol 2022;12:1017758. [Crossref] [PubMed]
  34. Zhao W, Huang Z, Tang S, Li W, Gao Y, Hu Y, Fan W, Cheng C, Yang Y, Zheng H, Liang D, Hu Z. MMCA-NET: A Multimodal Cross Attention Transformer Network for Nasopharyngeal Carcinoma Tumor Segmentation Based on a Total-Body PET/CT System. IEEE J Biomed Health Inform 2024;28:5447-58. [Crossref] [PubMed]
  35. Li GY, Chen J, Jang SI, Gong K, Li Q. SwinCross: Cross-modal Swin transformer for head-and-neck tumor segmentation in PET/CT images. Med Phys 2024;51:2096-107. [Crossref] [PubMed]
  36. Xiong X, Smith BJ, Graves SA, Graham MM, Buatti JM, Beichel RR. Head and Neck Cancer Segmentation in FDG PET Images: Performance Comparison of Convolutional Neural Networks and Vision Transformers. Tomography 2023;9:1933-48. [Crossref] [PubMed]
Cite this article as: Liang M, Guo C, Li H, Wang D, Jing Y, Pan Y, Huang Z, Song N, Liu X, Xiao J. Automated segmentation of the primary tumor in nasopharyngeal carcinoma using a deep learning framework in positron emission tomography imaging: a comparative study. Quant Imaging Med Surg 2025;15(11):11522-11533. doi: 10.21037/qims-2025-1090

Download Citation