Comparing deep-learning, radiomics, and fusion models for parathyroid tumor classification using ultrasound: a multicenter retrospective study
Introduction
Primary hyperparathyroidism (pHPT) is a prevalent endocrine disorder characterized by long-lasting hypercalcemia resulting from excessive parathyroid hormone secretion (1). The majority (85%) of cases stems from benign solitary parathyroid adenomas (PA), manageable with minimally invasive surgery. In contrast, parathyroid carcinoma (PC), a malignant neoplasm, and atypical parathyroid tumor (APT), a neoplasm of uncertain malignant potential, both warrant careful surgical strategies. Radical en bloc resection is recommended for PC because of its invasiveness and high risk of recurrence, whereas APT requires appropriate surgery and prolonged follow-up (2). Recent research has reported that the incidence rate of APT/PC is 0.5% to 5% in Western countries, but can be as high as 6% to 11.5% in Asian populations (3-5). Due to the relative rarity of these tumors and the consequent lack of validated diagnostic biomarkers, accurate preoperative differentiation between PA and APT/PC remains a substantial clinical challenge (6,7).
Clinically, APT often presents with symptoms indistinguishable from those of adenomas. The widely adopted “<3 + <3 rule” (benign tumor diameter <3 cm and calcium <3 mmol/L) lacks specificity (SPE), as benign adenomas may exceed 3 cm (8,9). Although clinical markers (e.g., palpable neck mass, severe hypercalcemia, or osteoporosis) raise suspicion for PC (10), some patients with PC may present with normocalcemia or remain asymptomatic (11). Invasive procedures such as fine-needle aspiration are of limited diagnostic value due to their inaccuracy and potential risks (8). Consequently, noninvasive preoperative tools are urgently needed. While ultrasound is the primary imaging modality for localizing parathyroid lesions, conventional visual assessment relies heavily on the radiologist’s experience, and although certain features like irregular shape or infiltration are suggestive of APT/PC, macroscopic evaluation alone is insufficient for reliable diagnosis (6,12).
To address this, artificial intelligence (AI) approaches like radiomics and deep learning (DL) have emerged. Radiomics quantifies tumor heterogeneity through handcrafted features, offering objectivity and interpretability (13-15), yet such features may inadequately capture high-dimensional patterns and are susceptible to manual extraction biases (16). DL autonomously learns complex features from raw imaging data, excelling in tasks like parathyroid lesion segmentation and detection (17,18). However, DL demands large datasets, risks overfitting, and lacks interpretability, which constrains clinical adoption. Crucially, radiomics and DL offer complementary strengths: radiomics provides stability against spatial transformations, while DL captures abstract patterns. Multidomain fusion models leveraging both paradigms have outperformed single approaches in other malignancies (16,19). However, few studies have systematically explored the application of radiomics, ML, or fusion model in characterizing PA and APT/PC using ultrasound. To address this gap, we compared three models to develop a more accurate, objective tool for distinguishing PA from APT/PC, potentially assisting in preoperative risk stratification, improving early clinical decisions, and enhancing surgical outcomes in pHPT patients. We present this article in accordance with the TRIPOD+AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2652/rc).
Methods
Study design
A total of 1,122 patients with parathyroid neoplasms who underwent surgical treatment at two Chinese hospitals from January 1, 2016, to April 30, 2025, were retrospectively collected. Among them, 577 patients were consecutively enrolled from Hospital 1 (Nanjing Drum Tower Hospital), and 545 patients from Hospital 2 (The First Affiliated Hospital of Nanjing Medical University). The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the ethics committees of Nanjing Drum Tower Hospital (No. 2024-611-01). Given the retrospective design using fully anonymized secondary data, the requirement for informed consent was formally waived by the ethics committee. The First Affiliated Hospital of Nanjing Medical University was also informed of and agreed to the study.
Inclusion criteria included (I) preoperative ultrasound examination within 7 calendar days prior to surgical intervention; (II) availability of complete clinical data, including demographic characteristics, preoperative laboratory results such as serum calcium, phosphorus, and intact parathyroid hormone (iPTH) levels, and surgical records; (III) pathologically confirmed diagnosis of PA, APT, or PC based on the 2022 World Health Organization (WHO) criteria, with all pathological findings evaluated by senior endocrine pathologists; and (IV) a follow-up period of at least 6 months. Exclusion criteria included (I) incomplete clinical or pathological diagnosis; (II) patients with ultrasound-invisible mass, defined as no identifiable abnormal parathyroid tissue on preoperative ultrasound, confirmed by two independent senior ultrasound physicians with ≥5 years of experience in neck endocrine imaging; (III) recurrent pHPT, known multiple endocrine neoplasia; (IV) unsatisfactory image quality for analysis, defined as images with insufficient resolution to distinguish parathyroid from adjacent tissues, or incomplete coverage of the parathyroid region of interest (ROI); and (V) patients who underwent fine needle aspiration (FNA) of the neck lesions at any time before surgery.
The flowchart with detailed patient selection is shown in Figure 1. The rates of APT/PC and PA were 6.6% (74/1,122) and 93.4% (1,048/1,122), respectively. Given the limited number of APT/PC cases (n=55 from Hospital 1; n=19 from Hospital 2), data from both centers were pooled to increase statistical power. To ensure balanced model training and evaluation, stratified sampling based on pathological diagnosis was applied to divide the cohort into a training set (70%, n=786), validation set (15%, n=168), and test set (15%, n=168). Specifically, the number of APT/PC cases in the training datasets, validation sets, and test sets were 53, 10, and 11, respectively. To enhance model robustness and avoid overfitting while addressing class imbalance due to the small number of APT/PC cases, data augmentation techniques (horizontal and vertical flipping, Gaussian noise addition) were applied to expand APT/PC images tenfold (resulting in 530 and 100 samples in the training and validation sets, respectively) without augmenting PA images.
Ultrasound imaging acquisition
A flowchart of this study is shown in Figure 2. All ultrasound examinations were performed by four board-certified radiologists with over 5 years of experience in superficial tissue ultrasound imaging. The equipment used included LOGIQ E9 (GE Healthcare, Milwaukee, WI, USA), EPIQ 5 (Philips, Amsterdam, Netherlands), and Resona 7 (Mindray, Shenzhen, China) with high-frequency probes. The image settings, including time-gain compensation, focal position, and dynamic range, adhered to the manufacturer’s recommendations. The patients were placed in the supine position with extended necks. Scanning commenced at the thoracic apex and proximal common carotid arteries, progressing to the carotid bifurcation and submental region, covering the bilateral neck areas. Ultrasound imaging was performed in the transverse and longitudinal planes, capturing the maximal cross-sectional and longitudinal views for analysis.
Hand-crafted radiomics workflow
Image preprocessing
The ROIs of parathyroid neoplasms were manually delineated by two experienced readers, L.C. (with 7-year experience) and X.H. (with 10-year experience) in parathyroid ultrasound imaging. Working in a blinded manner, they independently used the ImageJ software to outline the tumor boundaries. To assess the reliability and consistency of the ROI delineations, 100 images were randomly selected and re-annotated by the same readers 2 months later. Intraclass correlation coefficient (ICC) analysis was conducted based on these repeated measurements. ICC >0.80 demonstrates good reliability for tumor segmentation and feature extraction. The outlined ROIs were subsequently re-evaluated and validated by a senior radiologist (W.S.). In cases of disagreement, the radiologists engaged in collaborative discussions to reach a mutually agreed-upon resolution, thereby ensuring consistency in the findings.
Based on the ROI of each lesion, the top, bottom, left, and right boundary points were automatically generated to create a bounding box. The rectangular bounding box was then cropped from the original image, resized to 500×500 pixels, and normalized.
Radiomics feature extraction
PyRadiomics was used to extract radiomics features. A total of 567 radiomics features were obtained, including first-order, shape-based, and texture features. The specific parameters used for radiomics feature extraction were elaborated on the official PyRadiomics website (https://pyradiomics.readthedocs.io/en/latest/).
Feature selection and model construction
Feature selection is detailed in Appendix 1. From an initial 567 ultrasound features, this process retained 77 key features for modeling (Figure S1A). All machine learning models were implemented in Python. The classifiers evaluated in this study include logistic regression (LR), support vector machine (SVM), K-nearest neighbors (KNN), random forest (RF), multi-layer perceptron (MLP), extreme gradient boosting algorithm (XGBoost), and light gradient boosting machine (LightGBM). We used the two-tailed DeLong test to select the model with the highest area under the receiver operating characteristic (ROC) curve (AUC) as the optimal model.
DL workflow
DL features were extracted using ResNet101 (20). Two methods were used for ResNet101. Specifically, the input layer of one-channel ResNet101 (1ch_ResNet101) involved all original parathyroid ultrasound images without any manual segmentation. The input layer of the two-channel ResNet101 (2ch_ResNet101) was the connection between the original image and the corresponding ROI.
After training, 2,048 features were extracted from the second-to-last average-pooling layers of both the 1ch_ResNet101 and 2ch_ResNet101 models. To ensure consistency, all features underwent z-score normalization. After normalization, Spearman’s correlation, ICC, and least absolute shrinkage and selection operator (LASSO) analyses were performed to identify the most relevant features. Finally, 42 and 65 features for the 1ch_ResNet101 (Figure S1B) and 2ch_ResNet101 models (Figure S1C), respectively, were retained for subsequent analysis.
Construction of the fusion model
The study employed an early fusion strategy, which involved connecting all features from radiomics and DL into a single feature vector. Radiomics features were extracted using PyRadiomics, and DL features were obtained using 1ch_ResNet101 and 2ch_ResNet101, as previously described. Radiomics and 1ch_ResNet101 were combined to create fusion model 1, known as the Merged model 1. The radiomics and 2ch_ResNet101 were called fusion model 2, known as Merged model 2. Construction of the fusion model is detailed in Appendix 2.
Statistical analysis
All statistical analyses were performed using R (version 3.4.3) and the scikit-learn package (version 0.18) in Python 3.8.0. For continuous variables, the normality of the data distribution was assessed using the Shapiro-Wilk test. Normally distributed variables were compared using the independent t-test and presented as mean ± standard deviation, while non-normally distributed variables were compared using the Mann-Whitney U test and presented as median with interquartile range. The performance of the radiomics, DL, and merged models was evaluated using AUC, with histopathological examination serving as the reference standard. We also calculated some metrics based on the confusion matrix, such as sensitivity (SEN), SPE, accuracy (ACC), positive predictive value (PPV), and negative predictive value (NPV). To compare model performance, the DeLong test was employed to evaluate the statistical significance of differences in AUCs. Statistical significance was set at P<0.05.
Results
Patient characteristics
This study analyzed data from 1122 patients (270 men and 852 women; mean age, 54.2±13.7 years) with parathyroid neoplasms. The detailed information is summarized in Table 1. Notably, statistically significant differences were observed in age at diagnosis between the training and validation sets, as well as between the validation and test sets (both P<0.001). Additionally, serum phosphate levels differed significantly between the training and test sets (P=0.022).
Table 1
| Characteristics | Total (n=1,122) | Training cohort (n=786) | Validation cohort (n=168) | Test cohort (n=168) | P† | P‡ | P§ |
|---|---|---|---|---|---|---|---|
| Sex | 0.859 | 0.756 | 0.798 | ||||
| Male | 270 (24.0) | 183 (23.3) | 38 (22.6) | 41 (24.4) | |||
| Female | 852 (76.0) | 603 (80.7) | 130 (77.4) | 127 (75.6) | |||
| Age at diagnosis (years) | 54.2±13.7 | 54.1±14.0 | 59.4±12.1 | 53.2±13.0 | <0.001 | 0.423 | <0.001 |
| Serum iPTH (pmol/L) | 18.5 (10.9, 23.6) | 19.0 (8.0, 22.6) | 17.8 (9.6, 27.8) | 18.1 (9.6, 25.4) | 0.621 | 0.687 | 0.822 |
| Serum calcium (mmol/L) | 2.7 (2.5, 3.0) | 2.6 (2.5, 2.9) | 2.6 (2.4, 3.1) | 2.7 (2.5, 3.0) | 0.725 | 0.281 | 0.144 |
| Serum phosphate (mmol/L) | 0.5 (0.3, 0.9) | 0.5 (0.4, 0.9) | 0.5 (0.3, 0.9) | 0.5 (0.3, 0.9) | 0.245 | 0.022 | 0.322 |
| Pathological subtype | 0.708 | 0.986 | 0.822 | ||||
| PA | 1,048 (93.4) | 733 (93.3) | 158 (94.0) | 157 (93.4) | |||
| APT/PC | 74 (6.6) | 53 (6.7) | 10 (6.0) | 11 (6.6) |
Data are presented as n (%), mean ± SD, or median (Q1, Q3). †, training set vs. validation set; ‡, training set vs. test set; §, validation set vs. test set. APT, atypical parathyroid tumor; iPTH, intact parathyroid hormone; PA, parathyroid adenoma; PC, parathyroid carcinoma; Q1, 1st quartile; Q3, 3rd quartile; SD, standard deviation.
Diagnostic performance based on radiomics
The AUCs of LR, SVM, KNN, RF, MLP, XGBoost, and LightGBM are presented in Figure 3. All models showed a higher AUC in the training set, with RF achieving the best (0.968). However, its test set performance decreased significantly (0.885), suggesting potential overfitting. LightGBM generalized best in validation (0.953) but performed worst in testing (0.858), indicating instability. LR was the top performer in the testing set (0.905). DeLong’s test revealed that LR achieved the highest test AUC with no significant difference from SVM (0.905 vs. 0.901, P=0.754) and XGBoost (0.905 vs. 0.883, P=0.067), but significantly outperformed LightGBM (0.905 vs. 0.858, P<0.001) in the testing set. These results emphasize the considerable capacity of the LR model to discern between PA and APT/PC, leveraging the selected radiomic features. On the test data, the radiomics-based LR exhibited a SEN of 0.736, SPE of 0.911, PPV of 0.853, NPV of 0.831, ACC of 0.839, and an F1-score of 0.790 (Table 2).
(A-C) ROC curve of radiomics model based on different machine learning methods. (D-F) Heatmap showing the statistical comparison of model performance via two-tailed DeLong test. AUC, area under the receiver operating characteristic curve; CI, confidence interval; KNN, K-nearest neighbors; LightGBM, light gradient boosting machine; LR, logistic regression; MLP, multi-layer perceptron; RF, random forest; ROC, receiver operating characteristic; SVM, support vector machine; XGBoost, extreme gradient boosting algorithm.
Table 2
| Model | AUC (95% CI) | ACC | SEN | SPE | PPV | NPV |
|---|---|---|---|---|---|---|
| Radiomics based LR | ||||||
| Training | 0.959 (0.949–0.968) | 0.895 | 0.868 | 0.914 | 0.880 | 0.905 |
| Validation | 0.925 (0.892–0.953) | 0.864 | 0.800 | 0.905 | 0.842 | 0.877 |
| Test | 0.905 (0.864–0.941) | 0.839 | 0.736† | 0.911 | 0.853 | 0.831† |
| 1ch_ResNet101 | ||||||
| Training | 0.998 (0.997–0.999) | 0.985 | 0.974 | 0.993 | 0.990 | 0.981 |
| Validation | 0.825 (0.773–0.876) | 0.729 | 0.410 | 0.930 | 0.789 | 0.714 |
| Test | 0.804 (0.748–0.863) | 0.753 | 0.464 | 0.956 | 0.879 | 0.718 |
| 2ch_ResNet101 | ||||||
| Training | 0.998 (0.997–0.999) | 0.986 | 0.985 | 0.986 | 0.981 | 0.989 |
| Validation | 0.964 (0.941–0.982) | 0.892 | 0.800 | 0.950 | 0.909 | 0.882 |
| Test | 0.874 (0.830–0.915) | 0.779 | 0.546 | 0.943 | 0.870 | 0.748 |
| Merged model 1 based XGBoost | ||||||
| Training | 0.993 (0.990–0.995) | 0.954 | 0.938 | 0.966 | 0.952 | 0.956 |
| Validation | 0.912 (0.874–0.945) | 0.833 | 0.610 | 0.975 | 0.939 | 0.798 |
| Test | 0.926 (0.893–0.957) | 0.805 | 0.582 | 0.962† | 0.914 | 0.767 |
| Merged model 2 based LightGBM | ||||||
| Training | 0.997 (0.995–0.998) | 0.967 | 0.951 | 0.978 | 0.969 | 0.965 |
| Validation | 0.980 (0.965–0.992) | 0.868 | 0.720 | 0.962 | 0.923 | 0.844 |
| Test | 0.933 (0.902–0.960) | 0.854† | 0.700 | 0.962† | 0.928† | 0.821 |
†, the highest metric among all three models. ACC, accuracy; AUC, area under the receiver operating characteristic curve; CI, confidence interval; LR, logistic regression; NPV, negative predictive value; PPV, positive predictive value; SEN, sensitivity; SPE, specificity; XGBoost, extreme gradient boosting algorithm.
Diagnostic performance based on DL
The 1ch_ResNet101 model achieved AUC values of 0.998, 0.824, and 0.804 on the training, validation, and test sets, respectively. In comparison, the 2ch_ResNet101 model attained AUC values of 0.998, 0.964, and 0.874 on the training, validation, and test sets, respectively. In the three sets, the 2ch_ResNet101 model achieved higher performance than the 1ch_ResNet101 (P<0.001). On the test data, 2ch_ResNet101 exhibited a SEN of 0.546, SPE of 0.943, PPV of 0.870, and NPV of 0.748 (Table 2). Figure S2 presents the epoch-AUC curve for both the 1ch_ResNet101 and 2ch_ResNet101 models, which demonstrates that our network training process was effective and did not exhibit signs of overfitting.
Diagnostic performance based on fusion model
The AUC values of Merged model 1 and Merged model 2 based on six machine learning algorithms, along with their comparative analyses, are detailed in Appendix 3. On the test set, Merged model 1 based XGBoost (0.926) and LightGBM based Merged model 2 (0.933) achieved the highest test AUC, respectively. The diagnostic performance is illustrated in Table 2.
Comparison of diagnostic performance of radiomics, DL, and fusion models
Figure 4 presents the ROC curves, performance radar plots, and DeLong test for all models in the training, validation, and test sets. The Merged model 1 based XGBoost and Merged model 2 based LightGBM achieved higher AUC (0.912 and 0.980) on the validation and test sets (0.926 and 0.933), demonstrating superior predictive capability (Figure 4A,4D,4G). Based on the confusion matrix, we calculated the diagnostic performance of each model (Table 2, Figure 4B,4E,4H). On the test set, Merged model 2 exhibited the highest ACC (0.854), SPE (0.962), and PPV (0.928), while the radiomics model exhibited the highest SEN (0. 736) and NPV (0. 831). The DeLong test results (Figure 4C,4F,4I) for the AUCs on the test set (Figure 4I) showed that Merged models 1 and 2 showed comparable performance to radiomics-based LR (P=0.171 and 0.059) while significantly surpassing 1ch_ResNet101 (P<0.001) and 2ch_ResNet101 (P<0.001). While, radiomics-based LR (AUC =0.905) outperformed both 1ch_ResNet101 (0.804) and 2ch_ResNet101 (0.874) with statistical significance (P<0.001). The AUC of Merged model 2 was higher than that of Merged model 1 (0.933 vs. 0.926, P=0.593). Figure 5 shows the application of the model in the test set. First, original ultrasound images of the parathyroid glands were obtained, and professional physicians manually delineated ROIs. These ROIs were then fed into different models to generate classification predictions, enabling a direct visual comparison of each model’s clinical performance.
Discussion
Early screening for APT/PC is of utmost importance in patients with pHPT, as initial radical surgery can significantly improve patient prognosis. Nonetheless, current diagnostic challenges persist, including the absence of specific criteria to differentiate APT/PC from PA and the fact that nearly one-third of cases can only be confirmed histopathologically after surgery (21). Therefore, there is an urgent need to develop noninvasive imaging biomarkers and address the inherent limitations of preoperative diagnosis of these lesions. In this study, we compared the performance of radiomics and DL models via ultrasound imaging, as well as two fusion models, to differentiate APT/PC from PA. The radiomics and fusion models displayed promising prediction results and indicated a potential advantage for predicting parathyroid tumors.
Accurate preoperative diagnosis of PC remains clinically challenging (22). While machine learning has been widely applied to ultrasound-based classification of various diseases, a systematic exploration of radiomics and DL specifically for characterizing parathyroid lesions remains lacking. Previous studies on the application of machine learning in the parathyroid glands have primarily focused on functional prediction (15), identification (18), and localization (23). However, only a few have investigated the classification of benign vs. malignant parathyroid lesions. For instance, Krupinova et al. (9) constructed a two-step CatBoost gradient boosting model by analyzing clinical, laboratory, and ultrasonographic data from 242 patients with pHPT. The first model distinguished PA from the combined PC/APT groups, whereas the second differentiated PC from APT. The key predictors included serum iPTH, calcium levels, neoplasm volume/diameter, and bone/kidney complications. However, this study was limited by its small sample size and did not incorporate radiomics-related content.
In our study, among the various radiomics-based models evaluated, the LR demonstrated the most robust and consistent performance across the test dataset. Additionally, LR maintained a robust performance across datasets, with AUC values declining by merely 5.6% from training (0.959) to testing (0.905), demonstrating exceptional generalizability. LR demonstrated superior computational efficiency, achieving 10- to 20-fold faster training times compared to SVM, a particularly advantageous characteristic for high-dimensional data applications such as medical image analysis (24). Moreover, the linear architecture of the model permits the direct interpretation of feature weights, providing clinically meaningful insights that enhance translational utility in medical decision-making. Accordingly, LR balances performance, generalization, and interpretability, making it an optimal radiomics model for identifying APT/PC from PA.
ResNet101, a seminal deep residual network, has become a cornerstone of medical image analysis (20). In this study, we implemented two ResNet101 variants: single-channel (1ch_ResNet101) and dual-channel (2ch_ResNet101) architectures. The dual-channel model demonstrated superior classification performance, which can be attributed to the integration of raw images with segmentation masks. This multimodal approach enhances contextual understanding through three synergistic mechanisms: (I) providing complementary discriminative features; (II) increasing data variability to mitigate overfitting; and (III) resolving image ambiguities through spatial correspondence optimization. In contrast, 1ch-ResNet101 offers practical advantages in clinical workflows by enabling autonomous feature learning and reducing the clinician workload. However, both architectures share the inherent limitations of pure DL approaches, such as interpretability. Both networks were trained in an end-to-end manner as complete systems; however, they can alternatively serve as hierarchical feature extractors. Specifically, intermediate-layer features can be fused with radiomics descriptors through carefully designed integration pipelines to create hybrid models.
Our study showed ResNet101 has poorer classification and generalizability than radiomics-based models with handcrafted features. Pathologically, malignant parathyroid lesions feature internal heterogeneity (e.g., irregular margins, intratumoral heterogeneity, distinct calcifications vs. benign adenomas), which radiomics captures more sensitively than DL (often focusing on external contours/global textures). Previous studies (25-27) noted that radiomics models have better calibration in specific clinical settings, possibly due to robustness with limited samples and standardized imaging; radiomics also offers superior interpretability (28). Yet DL has shown higher predictive performance in some medical imaging studies (19,29,30), as it autonomously extracts high-dimensional nonlinear patterns from raw data, excelling in complex feature discovery and large-scale multimodal analysis. However, this requires substantial computational resources and rigorously annotated datasets to ensure generalizability.
Fusion models, a dominant precision medicine paradigm (27), optimize performance via synergistic multimodal integration, combining radiomics’ quantifiable stability with DL’s contextual awareness for superior ACC (31). Our two fusion models (Merged models 1 and 2) outperformed the DL model (P<0.001) but matched radiomics (P>0.05). Compared to prior research (19,30,31), integrated models generally demonstrate superior diagnostic efficacy relative to radiomics (likely due to differing fusion methods). Ours excelled over radiomics in training/validation but not test sets, possibly from higher real-world test data noise and biological sample heterogeneity. A meta-analysis (32) comparing radiomics, DL, and multimodal fusion in biomedical studies found fusion models to be superior in 63% of studies, underperforming in 25%, and comparable to single-modality approaches in 13%. This variation reflects the differential feature complementarity and methodological disparities across the study designs.
Notably, fusion is not without disadvantages. The predictive capability of the fusion model is contingent on the global model, creating a hierarchical dependency in which the global model typically requires precise segmentation outputs as prerequisites. This cascading reliance fundamentally undermines the primary strength of deep neural networks, their capacity for automated feature learning through end-to-end optimization. Moreover, the fusion model necessitates the tuning of substantially more trainable parameters. This proliferation of parameters not only increases the computational overhead but also elevates the model’s susceptibility to overfitting, particularly when the training data is limited.
This study had several limitations. First, this retrospective analysis requires validation via prospective multicenter studies. Lack of long-term follow-up for all patients may have missed key outcomes; future research needs larger cohorts with longitudinal assessments. Second, there was a significant class imbalance in the dataset, with 74 cases of APT/PC vs. 1,048 cases of PA, corresponding to a ratio of 14.2:1. This imbalance reflects the real-world prevalence disparities of parathyroid lesions, but it also risks biasing the algorithms toward the predominant subtype. As a result, the SEN of detecting rare pathologies could be reduced, which constitutes a common challenge in the development of AI-based endocrine imaging systems. Despite balancing strategies (synthetic data, oversampling, feature transformations), residual biases may persist. Third, reliance solely on radiography, without serology or molecular profiling, may have diminished prognostic ACC; future iterations should integrate these to enhance utility. Fourth, the study cohort exclusively included patients with pHPT and excluded recurrent cases. While this selection strategy is rational for optimizing the model’s training efficiency and SPE for primary lesions, it restricts the generalizability of our findings to clinical scenarios where recurrent pHPT is encountered. Fifth, semi-automated feature engineering and segmentation limit scalability, with model parameters tied to domain-specific patterns, reducing generalizability to novel datasets. Future work should develop robust feature extraction and explore transfer learning using larger, diverse datasets.
Conclusions
In conclusion, the developed system shows promise for aiding clinical management of parathyroid lesions. Our results confirm that ultrasound-based DL, radiomics, and fusion models can identify APT/PC with acceptable performance. Fusion models outperformed DL models and matched radiomics-based LR. Timely identification of potentially malignant parathyroid tumors and subsequent surgical intervention are clinically important. Future studies should integrate clinical and demographic predictors into the decision-making process.
Acknowledgments
The authors thank Dr. Mengjie Wu from The First Affiliated Hospital of Nanjing Medical University for her assistance in data collection at Hospital 2.
Footnote
Reporting Checklist: The authors have completed the TRIPOD+AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2652/rc
Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2652/dss
Funding: This work was supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2652/coif). J.H. serves as an unpaid editorial board member of Quantitative Imaging in Medicine and Surgery. The other authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the ethics committees of Nanjing Drum Tower Hospital (No. 2024-611-01). Given the retrospective design using fully anonymized secondary data, the requirement for informed consent was formally waived by the ethics committee. The First Affiliated Hospital of Nanjing Medical University was also informed of and agreed to the study.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Walker MD, Shane E. Hypercalcemia: A Review. JAMA 2022;328:1624-36. [Crossref] [PubMed]
- Erickson LA, Mete O, Juhlin CC, Perren A, Gill AJ. Overview of the 2022 WHO Classification of Parathyroid Tumors. Endocr Pathol 2022;33:64-89. [Crossref] [PubMed]
- Rodrigo JP, Hernandez-Prera JC, Randolph GW, Zafereo ME, Hartl DM, Silver CE, Suárez C, Owen RP, Bradford CR, Mäkitie AA, Shaha AR, Bishop JA, Rinaldo A, Ferlito A. Parathyroid cancer: An update. Cancer Treat Rev 2020;86:102012. [Crossref] [PubMed]
- Bilezikian JP, Cusano NE, Khan AA, Liu JM, Marcocci C, Bandeira F. Primary hyperparathyroidism. Nat Rev Dis Primers 2016;2:16033. [Crossref] [PubMed]
- Zhao L, Liu JM, He XY, Zhao HY, Sun LH, Tao B, Zhang MJ, Chen X, Wang WQ, Ning G. The changing clinical patterns of primary hyperparathyroidism in Chinese patients: data from 2000 to 2010 in a single clinical center. J Clin Endocrinol Metab 2013;98:721-8. [Crossref] [PubMed]
- Liu C, Li M, Li W, Xue H, Zhang Y, Wei S, He J, Yao J, Zhou Z. A retrospective study on a nomogram combining clinical and ultrasound parameters for differentiating solitary parathyroid adenoma from carcinoma or atypical tumors. Front Endocrinol (Lausanne) 2025;16:1538361. [Crossref] [PubMed]
- Marini F, Marcucci G, Giusti F, Arvat E, Benvenga S, Bondanelli M, et al. Parathyroid carcinoma and atypical parathyroid tumor: analysis of an Italian database. Eur J Endocrinol 2024;191:416-25. [Crossref] [PubMed]
- Schulte KM, Talat N. Diagnosis and management of parathyroid cancer. Nat Rev Endocrinol 2012;8:612-22. [Crossref] [PubMed]
- Krupinova JA, Elfimova AR, Rebrova OY, Voronkova IA, Eremkina AK, Kovaleva EV, Maganeva IS, Gorbacheva AM, Bibik EE, Deviatkin AA, Melnichenko GA, Mokrysheva NG. Mathematical model for preoperative differential diagnosis for the parathyroid neoplasms. J Pathol Inform 2022;13:100134. [Crossref] [PubMed]
- Wei CH, Harari A. Parathyroid carcinoma: update and guidelines for management. Curr Treat Options Oncol 2012;13:11-23. [Crossref] [PubMed]
- Campennì A, Ruggeri RM. Early diagnosis of parathyroid carcinoma: A challenging for physicians. Clin Endocrinol (Oxf) 2023;98:273-4. [Crossref] [PubMed]
- Liu R, Xia Y, Chen C, Ye T, Huang X, Ma L, Hu Y, Jiang Y. Ultrasound combined with biochemical parameters can predict parathyroid carcinoma in patients with primary hyperparathyroidism. Endocrine 2019;66:673-81. [Crossref] [PubMed]
- Yan D, Li Q, Lin CW, Shieh JY, Weng WC, Tsui PH. Hybrid QUS Radiomics: A Multimodal-Integrated Quantitative Ultrasound Radiomics for Assessing Ambulatory Function in Duchenne Muscular Dystrophy. IEEE J Biomed Health Inform 2024;28:835-45. [Crossref] [PubMed]
- Liu H, Zou L, Xu N, Shen H, Zhang Y, Wan P, Wen B, Zhang X, He Y, Gui L, Kong W. Deep learning radiomics based prediction of axillary lymph node metastasis in breast cancer. NPJ Breast Cancer 2024;10:22. [Crossref] [PubMed]
- Zhou W, Zhou Y, Zhang X, Huang T, Zhang R, Li D, Xie X, Wang Y, Xu M. Development and Validation of an Explainable Machine Learning Model for Identification of Hyper-Functioning Parathyroid Glands from High-Frequency Ultrasonographic Images. Ultrasound Med Biol 2024;50:1506-14. [Crossref] [PubMed]
- Li X, Yang L, Jiao X. Comparison of Traditional Radiomics, Deep Learning Radiomics and Fusion Methods for Axillary Lymph Node Metastasis Prediction in Breast Cancer. Acad Radiol 2023;30:1281-7. [Crossref] [PubMed]
- Bera K, Braman N, Gupta A, Velcheti V, Madabhushi A. Predicting cancer outcomes with radiomics and artificial intelligence in radiology. Nat Rev Clin Oncol 2022;19:132-46. [Crossref] [PubMed]
- Wang Y, Mao L, Yu MA, Wei Y, Hao C, Dong D. Automatic recognition of parathyroid nodules in ultrasound images based on fused prior pathological knowledge features. IEEE Access 2021;9:69626-34.
- Wang W, Liang H, Zhang Z, Xu C, Wei D, Li W, Qian Y, Zhang L, Liu J, Lei D. Comparing three-dimensional and two-dimensional deep-learning, radiomics, and fusion models for predicting occult lymph node metastasis in laryngeal squamous cell carcinoma based on CT imaging: a multicentre, retrospective, diagnostic study. EClinicalMedicine 2024;67:102385. [Crossref] [PubMed]
- He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE; 2016:770-8.
- Minisola S, Arnold A, Belaya Z, Brandi ML, Clarke BL, Hannan FM, Hofbauer LC, Insogna KL, Lacroix A, Liberman U, Palermo A, Pepe J, Rizzoli R, Wermers R, Thakker RV. Epidemiology, Pathophysiology, and Genetics of Primary Hyperparathyroidism. J Bone Miner Res 2022;37:2315-29. [Crossref] [PubMed]
- Liao H, Yuan J, Liu C, Zhang J, Yang Y, Liang H, Jiang S, Chen S, Li Y, Liu Y. Feasibility and effectiveness of automatic deep learning network and radiomics models for differentiating tumor stroma ratio in pancreatic ductal adenocarcinoma. Insights Imaging 2023;14:223. [Crossref] [PubMed]
- Apostolopoulos ID, Papandrianos NI, Papageorgiou EI, Apostolopoulos DJ. Artificial Intelligence methods for identifying and localizing abnormal Parathyroid Glands: A review study. Mach Learn Knowl Extr 2022;4:814-26.
- Dong X, Yang J, Zhang B, Li Y, Wang G, Chen J, Wei Y, Zhang H, Chen Q, Jin S, Wang L, He H, Gan M, Ji W. Deep Learning Radiomics Model of Dynamic Contrast-Enhanced MRI for Evaluating Vessels Encapsulating Tumor Clusters and Prognosis in Hepatocellular Carcinoma. J Magn Reson Imaging 2024;59:108-19. [Crossref] [PubMed]
- Sun K, Wang Y, Shi R, Wu S, Wang X. An ensemble machine learning model assists in the diagnosis of gastric ectopic pancreas and gastric stromal tumors. Insights Imaging 2024;15:225. [Crossref] [PubMed]
- Xia X, Gong J, Hao W, Yang T, Lin Y, Wang S, Peng W. Comparison and Fusion of Deep Learning and Radiomics Features of Ground-Glass Nodules to Predict the Invasiveness Risk of Stage-I Lung Adenocarcinomas in CT Scan. Front Oncol 2020;10:418. [Crossref] [PubMed]
- Gan Y, Hu Q, Shen Q, Lin P, Qian Q, Zhuo M, Xue E, Chen Z. Comparison of Intratumoral and Peritumoral Deep Learning, Radiomics, and Fusion Models for Predicting KRAS Gene Mutations in Rectal Cancer Based on Endorectal Ultrasound Imaging. Ann Surg Oncol 2025;32:3019-30. [Crossref] [PubMed]
- Xiang Y, Dong X, Zeng C, Liu J, Liu H, Hu X, Feng J, Du S, Wang J, Han Y, Luo Q, Chen S, Li Y. Clinical Variables, Deep Learning and Radiomics Features Help Predict the Prognosis of Adult Anti-N-methyl-D-aspartate Receptor Encephalitis Early: A Two-Center Study in Southwest China. Front Immunol 2022;13:913703. [Crossref] [PubMed]
- Yang Y, Han K, Xu Z, Cai Z, Zhao H, Hong J, Pan J, Guo L, Huang W, Hu Q, Xu Z. Development and Validation of Multiparametric MRI-based Interpretable Deep Learning Radiomics Fusion Model for Predicting Lymph Node Metastasis and Prognosis in Rectal Cancer: A Two-center Study. Acad Radiol 2025;32:2642-54. [Crossref] [PubMed]
- Li W, Li Y, Wang L, Yang M, Iikubo M, Huang N, Kojima I, Ye Y, Zhao R, Dong B, Chen J, Liu Y. Evaluating fusion models for predicting occult lymph node metastasis in tongue squamous cell carcinoma. Eur Radiol 2025;35:5228-38. [Crossref] [PubMed]
- Xia C, Zuo M, Lin Z, Deng L, Rao Y, Chen W, Chen J, Yao W, Hu M. Multimodal Deep Learning Fusing Clinical and Radiomics Scores for Prediction of Early-Stage Lung Adenocarcinoma Lymph Node Metastasis. Acad Radiol 2025;32:2977-89. [Crossref] [PubMed]
- Demircioğlu A. Are deep models in radiomics performing better than generic models? A systematic review. Eur Radiol Exp 2023;7:11.

