Diagnosis of thyroid nodules using ultrasound images based on deep learning features: online dynamic nomogram and gradient-weighted class activation mapping
Introduction
Thyroid nodules (TNs) are highly prevalent in clinical practice, affecting about 68% of the general population (1). The incidence of TNs has been steadily increasing, including malignant nodules (2). Although most TNs are benign and progress slowly, some exhibit aggressive behavior, with lymph node metastasis already present at the time of diagnosis as papillary thyroid carcinoma (3,4). Thyroid cancer accounts for about 5% of all TNs and is the most common endocrine malignancy, contributing approximately 2.1% of all cancer diagnoses worldwide (5,6). Once diagnosed, some thyroid cancers require surgical treatment (7). Therefore, accurate diagnosis of TNs is crucial.
The role of ultrasound (US) imaging as a non-invasive diagnostic tool in TN screening is well-established (8). The American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS) guidelines emphasize the importance of US evaluation for TN diagnosis and risk stratification (9). However, recent studies have shown that although the guidelines provide useful diagnostic criteria, radiologists still face challenges in applying them effectively, with diagnostic accuracy often lower than that of senior radiologists (10,11). The biopsy technique utilizing fine-needle aspiration (FNA) has been widely implemented as a diagnostic modality to achieve enhanced accuracy in the evaluation of TNs (12). However, this method has its drawbacks, such as being invasive and requiring skilled operators (13,14). It is therefore necessary to explore whether there are more effective methods that can improve diagnostic accuracy and help less experienced radiologists.
Artificial intelligence (AI) is a branch of computer science that includes machine learning (ML), deep learning (DL), transfer learning (TL), and convolutional neural networks (15). In recent years, AI technologies have contributed to earlier detection and diagnosis, reducing medical errors, healthcare costs, and both morbidity and mortality rates (16). AI techniques extract features from medical images that are related to clinical outcomes and biological endpoints (17). Specifically, DL technologies have made significant progress in disease diagnosis, prediction, and prognosis (18). US combined with DL (USDL) techniques can better identify potential disease risk factors, thus improving diagnostic accuracy and precision (15).
Training a deep neural network model and optimizing its parameters typically requires significant computational resources and time. TL is a machine learning technique aimed at addressing data scarcity by leveraging knowledge contained in related datasets (19). By transferring knowledge from a source task to a target task, TL improves model performance and saves time and data. In recent years, numerous studies have combined TL with convolutional neural networks, achieving favorable outcomes in medical image analysis by training on non-medical ImageNet datasets (20). However, due to the lack of interpretability of features extracted by DL, it is often difficult to understand the areas of focus in disease classification. Gradient-weighted class activation mapping (Grad-CAM) is employed to generate heatmaps, enabling the identification of which regions in the US images are emphasized by the DL model, thus highlighting the areas crucial for prediction. Additionally, to explore the prediction decision for a specific sample in the comprehensive model, SHapley Additive Explanations (SHAP) plots are used to explain the impact of each feature on the decision-making process.
Therefore, this study aimed to develop a DL model that analyzes US images, generates Grad-CAM-based heatmaps and malignancy probabilities, and integrates these outputs with ACR TI-RADS features to enhance diagnostic accuracy. We present this article in accordance with the CLEAR reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-159/rc).
Methods
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This multicenter study received centralized ethical approval from The First Affiliated Hospital of Anhui Medical University (No. PJ 2023-07-11); the Affiliated Hospital of Integration Chinese and Western Medicine with Nanjing University of Traditional Chinese Medicine was also informed and agreed to conduction of the study. The requirement for individual consent for this retrospective analysis was waived.
Patients and grouping
This study is a retrospective analysis, and the image data used to develop and validate the DL model were obtained from hospitals in Sétif, Algeria. A total of 1,501 US images of TNs were collected from the Kaggle platform (https://www.kaggle.com/datasets), including 796 benign and 705 malignant nodules. The Algerian dataset was randomly divided into a training set (1,051 cases) and a validation set (450 cases) using a 7:3 ratio. Additionally, retrospective data from two hospitals were collected as an external test set: 331 patients from Anhui Medical University First Affiliated Hospital between January 2022 and September 2023, and 210 patients from the Affiliated Hospital of Integration Chinese and Western Medicine with Nanjing University of Traditional Chinese Medicine between January 2021 and June 2022. The external test set consisted of 541 patients, including 269 with benign and 272 with malignant nodules. The inclusion criteria were as follows: (I) diagnosis of malignancy or benignity confirmed by FNA or surgery at affiliated hospitals; (II) complete US and pathological data; and (III) clear US image quality. The exclusion criteria were as follows: (I) patients who had undergone biopsy, microwave ablation, or surgery before US examination; (II) patients with other malignant tumors; (III) poor image quality; and (IV) duplicate images or normal thyroid images from public databases. The detailed inclusion and exclusion criteria for recruiting patients are provided in Figure S1.
Clinical data and US images
Clinical and pathological information such as gender, age, the maximum diameter of TNs, location, and pathological results were collected from hospital patients. US images of TNs were acquired using Mindray Resona 7S (Mindray, Shenzhen, China) and Samsung RS80A (Samsung, Suwon, South Korea) US devices. Video images and cross-sectional images of the nodule’s maximum diameter were retained during the examination and stored in the Picture Archiving and Communication System (PACS) for data retrieval by researchers and clinicians. The computing environment was configured with Windows 11 (Microsoft, Redmond, WA, USA), Intel (R) Core (TM) i5-12500H CPU, NVIDIA RTX 3060 GPU (NVIDIA, Santa Clara, CA, USA), and 6 GB RAM.
To minimize unnecessary information and protect patient privacy, the 541 images were cropped to remove background borders and saved as JPG files. Since the public dataset used does not contain clinical information and only includes images of benign and malignant nodules with classifications, the test set data were aligned with the platform’s format, including only the pathological classification, US images, and TI-RADS scores. The ACR TI-RADS scoring system includes features such as composition, echogenicity, shape, margin, and echogenic foci. All images were scored by radiologists with 3 years of experience. The shape scoring item is another expression of the aspect ratio, which was measured using the open-source software ITK-SNAP (http://www.itksnap.org). All images were independently evaluated by two radiologists blinded to pathological results, and discrepancies were resolved through consensus to ensure data reliability.
DL models construction and training
Five DL models were constructed: MobileNet-V2, DenseNet201, ResNet50, VGG19, and Xception. The models were developed in the Python (http://www.python.org) environment using TensorFlow version 2.4.0 and Keras version 2.4.3. ImageNet pre-trained models were utilized as the source model to enhance performance and generalization capability. ImageNet is a large-scale hierarchical image database that aids models in learning useful features and knowledge (21). The pre-trained weights of the source model were frozen, with only the final fully connected layer replaced by a new layer with random weights, which was the only layer trained. Figure 1 illustrates the TL process using MobileNet-V2 as an example.
For image preprocessing, training set images were resized to 224×224 and underwent data augmentation, including image scaling, translation, normalization, and contrast changes to prevent overfitting. The Adam optimizer was used, with sparse categorical cross-entropy as the loss function. A learning rate of 0.0001 and 100 iterations were implemented during training. The processed US images were input into the five models for training, and the TL model was saved. The same preprocessing method was applied to test set images, which were input into the trained models for prediction. The model outputs a malignancy score for each patient, which was used to evaluate the performance of the five models on the training, validation, and test sets, ultimately selecting the best DL model. To explore the interpretability of the DL model, Grad-CAM was used to visualize the feature heatmaps for each patient and observe the areas of interest that the model focused on.
Statistical analysis
Statistical analysis was performed using the software Python 3.6.2 (Python Software Foundation, Wilmington, DE, USA) and SPSS 24.0 (IBM Corp., Armonk, NY, USA), with the receiver operating characteristic (ROC) curve plotting and analysis conducted using MedCalc (version 20.1; MedCalc Software, Ostend, Belgium). All statistical significance tests were two-sided, with a P value of less than 0.05 and a 95% confidence interval (CI).
Model construction
Two radiologists reviewed all US images and used the TI-RADS guidelines to construct the US model. Additionally, to improve the application and interpretability of the DL model, Grad-CAM was used to generate heatmaps for the best DL model on each US image from the test set. The two radiologists, blinded to the pathological results, re-evaluated the test set images based on the heatmaps and DL prediction scores (DL-scores), reclassifying each patient’s image. The Net Reclassification Index (NRI) was used to assess whether reclassification improved diagnostic performance.
Univariate analysis was performed on all variables in the training set. Categorical variables were tested using Chi-squared tests, and continuous variables were tested using t-tests to identify risk factors associated with benign and malignant nodules. The best DL model was selected based on the DL-score output. A multi-factor logistic regression analysis was then conducted on the DL-score and independent risk factors from TI-RADS to identify variables that are meaningful for distinguishing benign and malignant nodules. Based on the variables selected by logistic regression in the training set, the integrated USDL model was built. To observe the impact of each feature on the model’s average prediction, SHAP was used to generate feature importance plots. In addition, to facilitate doctors in calculating the probability of malignant TNs, we have visualized the integrated model as an online dynamic nomogram (https://webnomogram.shinyapps.io/My_DynNom/).
Model evaluation
The calibration of the USDL model for the training, validation, and test sets was evaluated using calibration curves and the Hosmer-Lemeshow test. The area under the ROC curve (AUC) was used to estimate model performance, and DeLong’s test was used to compare AUC differences between different models. Decision curve analysis (DCA) was employed to evaluate the clinical utility of the USDL model at different thresholds.
Results
Baseline characteristics
The information for the training, validation, and external test sets is summarized in Table 1. Malignant TNs accounted for 47.6% (500/1,051) of the training set and 45.6% (205/450) of the validation set. There were no significant statistical differences between the training and validation cohorts for any of the features. Among the five DL models, MobileNet-V2 was selected as the best model for outputting heatmaps and DL-scores because it demonstrated the best predictive performance in the training set, which was also confirmed in the validation and test sets (Figure 2). Detailed statistical indicators are presented in Table S1.
Table 1
| Characteristics | Training dataset (n=1,051) | Validation dataset (n=450) | P value | Test dataset (n=541) |
|---|---|---|---|---|
| Pathology | 0.47 | |||
| Benign | 551 (52.4) | 245 (54.4) | 269 (49.7) | |
| Malignant | 500 (47.6) | 205 (45.6) | 272 (50.3) | |
| TI-RADS | 0.23 | |||
| Level 1 | 47 (4.5) | 20 (4.4) | 9 (1.7) | |
| Level 2 | 54 (5.1) | 41 (9.1) | 7 (1.3) | |
| Level 3 | 173 (16.5) | 66 (14.7) | 29 (5.4) | |
| Level 4 | 362 (34.4) | 151 (33.6) | 213 (39.4) | |
| Level 5 | 415 (39.5) | 172 (38.2) | 283 (52.3) | |
| Composition | 0.32 | |||
| Cystic or spongiform | 37 (3.5) | 17 (3.8) | 13 (2.4) | |
| Mixed cystic and solid | 124 (11.8) | 63 (14.0) | 26 (4.8) | |
| Solid | 890 (84.7) | 370 (82.2) | 502 (92.8) | |
| Echogenicity | 0.90 | |||
| Anechoic | 69 (6.6) | 28 (6.2) | 14 (2.6) | |
| Hyperechoic or isoechoic | 378 (36.0) | 163 (36.2) | 91 (16.8) | |
| Hypoechoic | 604 (57.5) | 259 (57.6) | 435 (80.4) | |
| Very hypoechoic | 0 | 0 | 1 (0.2) | |
| Shape | 0.16 | |||
| Wider-than-tall | 806 (76.7) | 360 (80.0) | 423 (78.2) | |
| Taller-than-wide | 245 (23.3) | 90 (20.0) | 118 (21.8) | |
| Margin | 0.12 | |||
| Smooth or ill-defined | 621 (59.1) | 245 (54.4) | 212 (39.2) | |
| Lobulated or irregular | 364 (34.6) | 176 (39.1) | 297 (54.9) | |
| Extra-thyroidal extension | 66 (6.3) | 29 (6.4) | 32 (5.9) | |
| Echogenic foci | 0.08 | |||
| None or large comet tail artifacts | 642 (61.1) | 304 (67.6) | 276 (51.0) | |
| Macrocalcifications | 71 (6.8) | 16 (3.6) | 55 (10.2) | |
| Peripheral(rim) calcifications | 43 (4.1) | 18 (4.0) | 21 (3.9) | |
| Punctate echogenic foci | 295 (28.1) | 112 (24.9) | 189 (34.9) |
Data are presented as n (%). P values correspond to comparisons between training and validation cohorts. TI-RADS, Thyroid Imaging, Reporting and Data System; TNs, thyroid nodules.
Six features with significant discriminative power between benign and malignant nodules were identified through univariate analysis in the training set: composition, echogenicity, shape, margin, strong echogenicity, and DL-score (Table 2). Based on the risk features identified from the univariate analysis, a multivariate logistic regression analysis was performed to construct the integrated USDL model. Notably, the margin feature did not show significance in the regression analysis and was excluded (P=0.508).
Table 2
| Characteristics | Training dataset | P value | |
|---|---|---|---|
| Malignant (n=500) | Benign (n=551) | ||
| Composition | <0.001* | ||
| Cystic or spongiform | 0 | 37 (6.7) | |
| Mixed cystic and solid | 12 (2.4) | 112 (20.3) | |
| Solid | 488 (97.6) | 402 (73.0) | |
| Echogenicity | <0.001* | ||
| Anechoic | 2 (0.4) | 67 (12.2) | |
| Hyperechoic or isoechoic | 96 (19.2) | 282 (51.2) | |
| Hypoechoic | 402 (80.4) | 202 (36.7) | |
| Very hypoechoic | 0 | 0 | |
| Shape | 0.034* | ||
| Wider-than-tall | 347 (69.4) | 459 (83.3) | |
| Taller-than-wide | 153 (30.6) | 92 (16.7) | |
| Margin | 0.029* | ||
| Smooth or ill-defined | 227 (45.4) | 394 (71.5) | |
| Lobulated or irregular | 225 (45.0) | 139 (25.2) | |
| Extra-thyroidal extension | 48 (9.6) | 18 (3.3) | |
| Echogenic foci | <0.001* | ||
| None or large comet-tail artifacts | 224 (44.8) | 418 (75.9) | |
| Macrocalcifications | 35 (7.0) | 36 (6.5) | |
| Peripheral(rim) calcifications | 23 (4.6) | 20 (3.6) | |
| Punctate echogenic foci | 218 (43.6) | 77 (14.0) | |
| DL-score | 0.56 (0.45–0.71) | 0.33 (0.25–0.43) | <0.001* |
Data are presented as n (%) or median (interquartile range). *, P<0.05. DL-score, deep learning prediction score; TN, thyroid nodule.
To evaluate whether the DL model can provide auxiliary diagnostic support, radiologists conducted a second diagnostic classification after observing the heatmap and predicted probability values output by the best DL model for the test set data. An improvement in malignancy sample classification of approximately 12.50% and in benign sample classification of about 20.45% was demonstrated by the NRI when the second diagnostic classification was compared with the first, leading to an overall reclassification improvement of 32.95% being achieved (Table 3). This improvement will help to avoid unnecessary FNA or surgery for patients.
Table 3
| Diagnosis | First classification | Classification after using the DL model | |||
|---|---|---|---|---|---|
| Malignant (n=272) | Benign (n=269) | Malignant (n=272) | Benign (n=269) | ||
| TI-RADS | |||||
| Level 1 | 0 | 7 (2.6) | 0 | 7 (2.6) | |
| Level 2 | 0 | 9 (3.3) | 0 | 9 (3.3) | |
| Level 3 | 1 (0.4) | 28 (10.4) | 0 | 57 (21.2) | |
| Level 4 | 72 (26.5) | 141 (52.4) | 39 (14.3) | 138 (51.3) | |
| Level 5 | 199 (73.2) | 84 (31.2) | 233 (85.7) | 58 (21.6) | |
| NRI (%) | |||||
| Malignant | 12.50 | ||||
| Benign | 20.45 | ||||
| Overall | 32.95 | ||||
Data are presented as n (%) or percentage improvements. The NRI values represent the improvement in diagnostic classification after incorporating the DL model compared to initial assessment. DL, deep learning; NRI, Net Reclassification Index; TI-RADS, Thyroid Imaging Reporting and Data System.
Establishment and evaluation of the integrated model
Multivariate logistic regression analysis showed that five variables were significantly associated with the differentiation of TNs: composition, echogenicity, shape, strong echogenicity, and DL-score (Table 4). The integrated USDL model was built using these five risk factors, and SHAP was used to visualize the model. The SHAP values reflect the influence of each feature on the model, as well as the direction (positive or negative) of their impact (Figure 3A). The calibration curves in Figure 3B show the calibration performance of the USDL model in predicting TNs for the training, validation, and independent test sets, with Hosmer-Lemeshow P values indicating good calibration (training set: 0.903, validation set: 0.814, test set: 0.210). Figure 3C demonstrates the interface of the online nomogram.
Table 4
| Intercept and variable | US model | USDL model | |||||
|---|---|---|---|---|---|---|---|
| β | Odds ratio (95% CI) | P value | β | Odds ratio (95% CI) | P value | ||
| Intercept | −5.050 | −10.224 | |||||
| Composition | 1.294 | 3.648 (1.907–6.978) | <0.001* | 1.414 | 4.056 (1.784–9.223) | <0.001* | |
| Echogenicity | 1.191 | 3.291 (2.390–4.532) | <0.001* | 1.267 | 3.441 (2.285–5.180) | <0.001* | |
| Shape | 1.493 | 1.130 (1.009–1.266) | 0.034* | 0.164 | 1.173 (1.018–1.351) | 0.022* | |
| Echogenic foci | 0.420 | 1.522 (1.363–1.701) | <0.001* | 0.450 | 1.555 (1.351–1.791) | <0.001* | |
| Margin | 0.154 | 1.166 (1.016–1.339) | 0.029* | 0.060 | 1.062 (0.889–1.268) | 0.508 | |
| DL-score | NA | NA | NA | 10.828 | 2.947 (2.541–3.418) | <0.001* | |
In multivariate logistic regression modeling, the P value of the variable margin >0.05 and is excluded. *, P<0.05. CI, confidence interval; DL-score, deep learning prediction score; US, ultrasound; USDL, ultrasound combined with deep learning.
Table 5 shows the performance of US model and USDL model in predicting the malignancy of TNs. The AUC values for the USDL model in the training set, validation set, and test set were 0.922 (95% CI: 0.904–0.938), 0.947 (95% CI: 0.922–0.966), and 0.907 (95% CI: 0.880–0.931), respectively, indicating excellent discriminatory ability across all datasets. In the training set (Figure 4A), the comparison of AUC values for the USDL model vs. the US model was 0.922 vs. 0.802 (P<0.001), and the USDL model vs. the DL model was 0.922 vs. 0.868 (P<0.001). In the validation set (Figure 4B), the comparison of AUC values for the USDL model vs. the US model was 0.947 vs. 0.799 (P<0.001), and the USDL model vs. the DL model was 0.947 vs. 0.919 (P<0.001). In the test set (Figure 4C), the comparison of AUC values for the USDL model vs. the US model was 0.907 vs. 0.787 (P<0.001), and the USDL model vs. the DL model was 0.907 vs. 0.875 (P<0.001). These results demonstrate that the USDL diagnostic model outperforms both the US and DL models in terms of discriminatory power (Table S2). Figure 4D shows the DCA curves for all patients under different models. At various threshold values, the USDL model consistently outperformed the US and DL models for diagnosing TNs.
Table 5
| Variables | US model | USDL model | |||||
|---|---|---|---|---|---|---|---|
| Training cohort | Validation cohort | Test cohort | Training cohort | Validation cohort | Test cohort | ||
| AUC (95% CI) | 0.802 (0.777–0.826) | 0.799 (0.759–0.835) | 0.787 (0.750–0.821) | 0.922 (0.904–0.938) | 0.947 (0.922–0.966) | 0.907 (0.880–0.931) | |
| SEN (%) | 72.0 | 71.7 | 72.4 | 82.0 | 83.9 | 80.1 | |
| SPE (%) | 71.5 | 69.0 | 69.9 | 85.8 | 88.2 | 82.2 | |
| PPV (%) | 69.6 | 65.9 | 70.9 | 84.0 | 85.6 | 82.0 | |
| NPV (%) | 73.8 | 74.4 | 71.5 | 84.0 | 86.7 | 80.4 | |
| ACC (%) | 71.7 | 70.2 | 71.2 | 84.0 | 86.2 | 81.1 | |
| F1-score | 0.708 | 0.687 | 0.716 | 0.830 | 0.847 | 0.810 | |
ACC, accuracy; AUC, area under the receiver operating characteristic curve; CI, confidence interval; NPV, negative predictive value; PPV, positive predictive value; SEN, sensitivity; SPE, specificity; TN, thyroid nodule; US, ultrasound; USDL, ultrasound combined with deep learning.
Interpretability of the DL model
To explore how the DL model classifies TNs, we visualized the model’s decision-making process using Grad-CAM. The visualized heatmaps indicate the areas of interest for the model when predicting malignant TNs. Each case produces four images: the original image, heatmap, heatmap with a color bar, and an overlay image. The color scale next to the heatmap indicates malignancy values, with areas closer to red signifying higher malignancy and areas closer to blue representing lower malignancy. Regions with no color represent areas not focused on by the model. Examples of US images and heatmaps for TNs are shown in Figure 5.
Discussion
In this study, we retrospectively collected two sets of data: 1,501 thyroid US images from Algeria, which were divided into training and test sets, and 541 US images from TN patients in an affiliated hospital as an external test set. In the training set, univariate analysis was performed on the TI-RADS scoring items and DL-score to identify important variables, followed by multivariate logistic regression to establish an integrated diagnostic USDL model. SHAP was used to visualize the model, providing insight into the influence of each variable on nodule classification.
Thyroid cancer diagnosis requires accurate assessment of TNs by clinicians. However, the features of TNs on US images are often subjective and experience-dependent (10). For instance, the “margin” feature in the ACR TI-RADS guideline presents in distinguishing blurred from irregular margins, particularly for less experienced clinicians. DL can be utilized as a method for quantitative feature extraction, facilitating the development of robust classifier models that enhance the effectiveness of cancer screening and early detection processes (17,22). Therefore, DL has advantages in objectivity, as the features learned from thyroid US images are not constrained by the guideline features used by radiologists (23). Specifically, DL can enhance the diagnostic accuracy for thyroid diseases and better inform clinical decision-making (24).
Moreover, the integration of multimodal DL and interpretability can uncover new insights. Radiomics has been expanded beyond the exclusive focus on imaging features to include the incorporation of diverse clinical factors, thereby enabling a more personalized and comprehensive approach to patient diagnosis and treatment (25). For the successful implementation of DL in routine patient care, the clinical validation of explainable DL methods is considered essential (26). In our study, when constructing the multivariate logistic regression model, the “margin” feature was excluded due to its lack of significance, likely because it was highly correlated with the DL-score variable. Through the application of Grad-CAM visualization techniques for heatmap analysis, it has been demonstrated that the DL model incorporates margin features as a significant diagnostic factor in the evaluation of TNs. For example, the US image shown in Figure 5A displays a nodule with well-defined margins and slightly hypoechoic appearance, whereas the US image in Figure 5B depicts a nodule with irregular margins and extrathyroidal invasion. Figure 5C shows a nodule with a clear but lobulated margin that is isoechoic. The heatmap areas of the cases in Figure 5A-5C align with the edge feature points in the TI-RADS guideline. This suggests that the USDL model alleviates the difficulty radiologists face in defining the nodule’s margin to some extent.
In this study, MobileNet-V2 was identified as the optimal pretrained model for image prediction, likely due to its adaptation to the dataset. The MobileNet-V2 model was originally developed to achieve higher accuracy with fewer parameters, which allows the AI system to be integrated into US equipment to assist doctors in speeding up the decision-making process (27). Pramanik et al. achieved an accuracy of 93.92% using the MobileNet-V2 model for cervical cancer detection (28). Zhao et al. also achieved the best performance using MobileNet-V2 in their study on differentiating subtypes of non-small cell lung cancer (29). Anilkumar et al. reported 100% accuracy in leukemia detection using various TL models, including MobileNet-V2 (30). The MobileNet-V2 model used in this study demonstrated reduced computational load and training time, achieving excellent diagnostic performance. If integrated into portable US devices, this system could provide flexible monitoring of disease development and progress, enhancing the ability of radiologists or primary care doctors to manage high-risk thyroid cancer populations. However, the applicability of such a system requires further prospective studies.
Currently, to reduce the risk of bias in DL system performance evaluation, it is recommended to assess the system using external cohorts (31). Therefore, in this study, an independent external test set was used to evaluate the applicability of the deep TL model. Previous studies have used traditional feature extraction methods or DL techniques for distinguishing benign and malignant TNs in US images. For instance, Xia et al. applied features extracted from US images using extreme ML and achieved an accuracy of 87% in distinguishing benign and malignant nodules (32). ML modeling often requires more accurate image segmentation, which is typically performed manually. Du et al. utilized DL combined with radiomics methods to extract features from US images. The integrated model achieved AUCs of 0.947, 0.917, and 0.929 in differentiating benign and malignant TNs across different cohorts, respectively (11). Although these studies show promising results in distinguishing benign and malignant nodules, they are limited by small sample sizes and a lack of multicenter data.
This study also has certain limitations. First, since the cases from Algeria did not include data on gender, age, and specific pathological results, we did not perform statistical analysis on the clinical information of patients to maintain consistency. This may have led to the omission of some clinical factors. Secondly, in clinical practice, surgical resection for histological examination is not routinely performed on nodules with benign characteristics, except for larger benign nodules causing compression symptoms or hemorrhage that require intervention. Consequently, the pathological results for some TN cases in our external test set were obtained through FNA, which may have introduced certain selection biases and data imbalance. Future studies involving multi-center collaborations are recommended to enhance model generalizability through expanded datasets. Although all participating radiologists in our study had completed specialized training in thyroid imaging, their limited clinical experience of 3 years may be considered insufficient to represent the diagnostic capabilities of more experienced practitioners. Therefore, the observed diagnostic performance improvements may not be generalizable to all clinicians. Finally, it should be noted that the TI-RADS scoring and image processing in this study were based on the analysis of retrospectively collected static US images, without comprehensive evaluation of the entire lesion. This methodological approach may introduce potential bias in the interpretation of diagnostic outcomes.
In conclusion, this study has established a comprehensive model based on US imaging features and developed an interactive web-based platform for predicting the malignancy risk of TNs.
Conclusions
The online dynamic nomogram developed in this study, based on US imaging characteristics, enables radiologists to perform real-time calculation of malignant probability in TNs and provides visual decision support, offering a novel solution for the precise diagnosis of TNs. Furthermore, the malignant probability output by the explainable DL model and the heatmaps generated by Grad-CAM have demonstrated superior performance compared to young radiologists using the ACR TI-RADS guidelines alone. This approach contributes to enhancing the diagnostic capability for differentiating between benign and malignant TNs.
Acknowledgments
We would like to thank Mi Ao and Kaggle for their help and data support in this study.
Footnote
Reporting Checklist: The authors have completed the CLEAR reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-159/rc
Funding: This study was supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-159/coif). W.L. reports receiving a grant from the Postgraduate Innovation Research and Practice Program of Anhui Medical University (No. YJS20240028) during the conduct of this study. C.Z. reports receiving grants from the Anhui Provincial Natural Science Foundation (No. 2308085MH278) and the Health Research Program of Anhui (No. AHWJ2023A10017) during the conduct of this study. The other authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This multicenter study received centralized ethical approval from the First Affiliated Hospital of Anhui Medical University (No. PJ 2023-07-11), and the Affiliated Hospital of Integration Chinese and Western Medicine with Nanjing University of Traditional Chinese Medicine was also informed and agreed to conduction of the study. The requirement for individual consent for this retrospective analysis was waived.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Guth S, Theune U, Aberle J, Galach A, Bamberger CM. Very high prevalence of thyroid nodules detected by high frequency (13 MHz) ultrasound examination. Eur J Clin Invest 2009;39:699-706. [Crossref] [PubMed]
- Kitahara CM, Sosa JA. The changing incidence of thyroid cancer. Nat Rev Endocrinol 2016;12:646-53. [Crossref] [PubMed]
- Kilfoy BA, Zheng T, Holford TR, Han X, Ward MH, Sjodin A, Zhang Y, Bai Y, Zhu C, Guo GL, Rothman N, Zhang Y. International patterns and trends in thyroid cancer incidence, 1973-2002. Cancer Causes Control 2009;20:525-31. [Crossref] [PubMed]
- Kim SK, Chai YJ, Park I, Woo JW, Lee JH, Lee KE, Choe JH, Kim JH, Kim JS. Nomogram for predicting central node metastasis in papillary thyroid carcinoma. J Surg Oncol 2017;115:266-72. [Crossref] [PubMed]
- Hegedüs L. Clinical practice. The thyroid nodule. N Engl J Med 2004;351:1764-71. [Crossref] [PubMed]
- Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, Parkin DM, Forman D, Bray F. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int J Cancer 2015;136:E359-86. [Crossref] [PubMed]
- Coca-Pelaz A, Shah JP, Hernandez-Prera JC, Ghossein RA, Rodrigo JP, Hartl DM, et al. Papillary Thyroid Cancer-Aggressive Variants and Impact on Management: A Narrative Review. Adv Ther 2020;37:3112-28. [Crossref] [PubMed]
- Liu R, Zhang B. Role of Ultrasound in the Management of Thyroid Nodules and Thyroid Cancer. Zhongguo Yi Xue Ke Xue Yuan Xue Bao 2017;39:445-50. [PubMed]
- Tessler FN, Middleton WD, Grant EG, Hoang JK, Berland LL, Teefey SA, Cronan JJ, Beland MD, Desser TS, Frates MC, Hammers LW, Hamper UM, Langer JE, Reading CC, Scoutt LM, Stavros AT. ACR Thyroid Imaging, Reporting and Data System (TI-RADS): White Paper of the ACR TI-RADS Committee. J Am Coll Radiol 2017;14:587-95. [Crossref] [PubMed]
- Chen Y, Gao Z, He Y, Mai W, Li J, Zhou M, Li S, Yi W, Wu S, Bai T, Zhang N, Zeng W, Lu Y, Liu H. An Artificial Intelligence Model Based on ACR TI-RADS Characteristics for US Diagnosis of Thyroid Nodules. Radiology 2022;303:613-9. [Crossref] [PubMed]
- Du H, Chen F, Li H, Wang K, Zhang J, Meng J, Li H, Xu X, Qu J, Wu R, Li J, Zhang M, Zhang F, Zhu X. Deep-learning radiomics based on ultrasound can objectively evaluate thyroid nodules and assist in improving the diagnostic level of ultrasound physicians. Quant Imaging Med Surg 2024;14:5932-45. [Crossref] [PubMed]
- Sosa JA, Hanna JW, Robinson KA, Lanman RB. Increases in thyroid nodule fine-needle aspirations, operations, and diagnoses of thyroid cancer in the United States. Surgery 2013;154:1420-6; discussion 1426-7. [Crossref] [PubMed]
- Özel D, Özel BD, Özkan F. Potential causes for obtaining non-diagnostic results from fine needle aspiration biopsy of thyroid nodules. Radiol Med 2016;121:510-4. [Crossref] [PubMed]
- Cibas ES, Ali SZ. The 2017 Bethesda System for Reporting Thyroid Cytopathology. Thyroid 2017;27:1341-6. [Crossref] [PubMed]
- Shen YT, Chen L, Yue WW, Xu HX. Artificial intelligence in ultrasound. Eur J Radiol 2021;139:109717. [Crossref] [PubMed]
- Mintz Y, Brodie R. Introduction to artificial intelligence in medicine. Minim Invasive Ther Allied Technol 2019;28:73-81. [Crossref] [PubMed]
- Avanzo M, Wei L, Stancanello J, Vallières M, Rao A, Morin O, Mattonen SA, El Naqa I. Machine and deep learning methods for radiomics. Med Phys 2020;47:e185-202. [Crossref] [PubMed]
- Huang S, Yang J, Fong S, Zhao Q. Artificial intelligence in cancer diagnosis and prognosis: Opportunities and challenges. Cancer Lett 2020;471:61-71. [Crossref] [PubMed]
- Zhu Z, Lin K, Jain AK, Zhou J. Transfer Learning in Deep Reinforcement Learning: A Survey. IEEE Trans Pattern Anal Mach Intell 2023;45:13344-62. [Crossref] [PubMed]
- Morid MA, Borjali A, Del Fiol G. A scoping review of transfer learning research on medical image analysis using ImageNet. Comput Biol Med 2021;128:104115. [Crossref] [PubMed]
- Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge. Int J Comput Vis 2015;115:211-52. [Crossref]
- Gillies RJ, Schabath MB. Radiomics Improves Cancer Screening and Early Detection. Cancer Epidemiol Biomarkers Prev 2020;29:2556-67. [Crossref] [PubMed]
- Li X, Zhang S, Zhang Q, Wei X, Pan Y, Zhao J, et al. Diagnosis of thyroid cancer using deep convolutional neural network models applied to sonographic images: a retrospective, multicohort, diagnostic study. Lancet Oncol 2019;20:193-201. [Crossref] [PubMed]
- Yang WT, Ma BY, Chen Y. A narrative review of deep learning in thyroid imaging: current progress and future prospects. Quant Imaging Med Surg 2024;14:2069-88. [Crossref] [PubMed]
- Mayerhoefer ME, Materka A, Langs G, Häggström I, Szczypiński P, Gibbs P, Cook G. Introduction to Radiomics. J Nucl Med 2020;61:488-95. [Crossref] [PubMed]
- Tran KA, Kondrashova O, Bradley A, Williams ED, Pearson JV, Waddell N. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Med 2021;13:152. [Crossref] [PubMed]
- Yu CJ, Yeh HJ, Chang CC, Tang JH, Kao WY, Chen WC, Huang YJ, Li CH, Chang WH, Lin YT, Sufriyana H, Su EC. Lightweight deep neural networks for cholelithiasis and cholecystitis detection by point-of-care ultrasound. Comput Methods Programs Biomed 2021;211:106382. [Crossref] [PubMed]
- Pramanik R, Biswas M, Sen S, Souza Júnior LA, Papa JP, Sarkar R. A fuzzy distance-based ensemble of deep models for cervical cancer detection. Comput Methods Programs Biomed 2022;219:106776. [Crossref] [PubMed]
- Zhao H, Su Y, Lyu Z, Tian L, Xu P, Lin L, Han W, Fu P. Non-invasively Discriminating the Pathological Subtypes of Non-small Cell Lung Cancer with Pretreatment (18)F-FDG PET/CT Using Deep Learning. Acad Radiol 2024;31:35-45. [Crossref] [PubMed]
- Anilkumar KK, Manoj VJ, Sagi TM. Automated detection of leukemia by pretrained deep neural networks and transfer learning: A comparison. Med Eng Phys 2021;98:8-19. [Crossref] [PubMed]
- Kleppe A, Skrede OJ, De Raedt S, Liestøl K, Kerr DJ, Danielsen HE. Designing deep learning studies in cancer diagnostics. Nat Rev Cancer 2021;21:199-211. [Crossref] [PubMed]
- Xia J, Chen H, Li Q, Zhou M, Chen L, Cai Z, Fang Y, Zhou H. Ultrasound-based differentiation of malignant and benign thyroid Nodules: An extreme learning machine approach. Comput Methods Programs Biomed 2017;147:37-49. [Crossref] [PubMed]

