Multi-type classification of lung nodules based on CT radiomics and ensemble learning for diversity weighting

Guozhi Tang; Lingyan Du; Shihai Ling; Yue Che; Xin Chen

doi:10.21037/qims-24-1315

Original Article

Multi-type classification of lung nodules based on CT radiomics and ensemble learning for diversity weighting

Guozhi Tang^1,2, Lingyan Du^1,2, Shihai Ling^1,2, Yue Che^1,2, Xin Chen³

¹School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin, China; ²Artificial Intelligence Key Laboratory of Sichuan Province, Yibin, China; ³Department of Integrated Traditional Chinese and Western Medicine, Zigong First People’s Hospital, Zigong, China

Contributions: (I) Conception and design: G Tang, L Du; (II) Administrative support: L Du; (III) Provision of study materials or patients: Y Che, S Ling; (IV) Collection and assembly of data: S Ling, G Tang; (V) Data analysis and interpretation: Y Che, G Tang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Lingyan Du, PhD. School of Automation and Information Engineering, Sichuan University of Science and Engineering, No. 1 Baita Road, Yibin 644000, China; Artificial Intelligence Key Laboratory of Sichuan Province, Yibin, China. Email: dulingyan@suse.edu.cn.

Background: The accurate classification of lung nodules is critical to achieving personalized lung cancer treatment and prognosis prediction. The treatment options for lung cancer and the prognosis of patients are closely related to the type of lung nodules, but there are many types of lung nodules, and the distinctions between certain types are subtle, making accurate classification based on traditional medical imaging technology and doctor experience challenging. This study adopts a novel approach, using computed tomography (CT) radiomics to analyze the quantitative features in CT images to reveal the characteristics of lung nodules, and then employs diversity-weighted ensemble learning to enhance the accuracy of classification by integrating the predictive results of multiple models.

Methods: We extracted lung nodules from the Lung Image Database Consortium image collection (LIDC-IDRI) dataset and derived radiomics features from the nodules. For the classification tasks of seven types of lung nodules, each was split into binary classifications. Two model-building methods were employed: M1 (equal-weighted voting ensemble classifier) and M2 (diversity-weighted voting ensemble classifier). Models were evaluated using 10-fold cross-validation with metrics including the area under the receiver operating characteristic curve (AUC), accuracy, specificity, and sensitivity.

Results: Both methods effectively completed classification tasks. The M2 method outperformed M1, particularly in classifying texture, calcification, and the benign and malignant nature of lung nodules. The AUC values of the M2 method in the four subclassifications of texture types of lung nodules were 0.9913, 0.8838, 0.9525, and 0.8845, with the corresponding accuracies of 0.9651, 0.8116, 0.9000, and 0.8284, respectively. In the classification of the degree of calcification of lung nodules, the AUC value of the M2 method was 0.9775 with an accuracy of 0.9642. In the classification of the benign and malignant nature of lung nodules, the AUC value of the M2 method was 0.8953 with an accuracy of 0.8168. The combination of CT radiomics and diversity-weighted ensemble learning effectively identifies lung nodule types, providing a novel method for the precise classification of lung nodules and aiding personalized lung cancer treatment and prognosis prediction.

Conclusions: The combination of CT radiomics and ensemble learning for diversity weighting can be well realized to identify the type of lung nodules.

Keywords: Radiomics; lung nodule classification; ensemble classifier; medical images; machine learning

Submitted Jun 28, 2024. Accepted for publication Oct 09, 2024. Published online Nov 29, 2024.

doi: 10.21037/qims-24-1315

Introduction

The global prevalence of lung cancer makes it one of the most lethal malignancies (1). It begins in the lungs and initially develops as a single or multiple nodules that can eventually spread to other organs and tissues in the body. Lung nodules can be categorized into different nodule types based on their shape, margins, and internal features, such as lobulated nodules, spiculated nodules, and ground-glass nodules (GGNs) (2). Determining the type of nodules is crucial for doctors to assess the risk of the nodule becoming cancerous and to choose the appropriate personalized treatment for the patient (3). In the past two decades, there has been a remarkable surge in artificial intelligence (AI) technology, leading to an increasing number of researchers focusing on investigating computer-assisted diagnostic (CAD) systems that integrate AI technology. For example, Ni et al. (4) developed an artificial neural network (ANN)-based model for the classification of eight types of lung nodules using computed tomography (CT) images. However, training deep learning models with CT images requires significant resources and costs, especially for three-dimensional (3D) medical images. Furthermore, the efficacy of deep learning methods heavily relies on large amounts of quality data, which are often limited by privacy and ethical concerns in medicine.

Radiomics is a technique for analyzing lesions by employing digital image processing methods to extract high-throughput features from medical images that are imperceptible to the human eyes (5,6). In contrast to image data, radiomics feature data are commonly stored in a tabular format, which offers a more straightforward and organized data structure. This format aligns well with the mature utilization of traditional machine learning algorithms, ensuring enhanced speed and efficiency in data processing. Compared to deep learning, traditional machine learning algorithms can be effectively trained on smaller datasets. For example, Rundo et al. (7) employed a combination of radiomics and machine learning techniques, which demonstrated relatively low training data requirements, training time, and computational power costs. Their approach successfully enabled the classification of solid versus sub-solid lung nodules as well as non-solid versus partially solid lung nodules. The mean area under the receiver operating characteristic (ROC) curve (AUC) for these two classifications in this study reached 0.89±0.02 and 0.80±0.18, respectively. This study confirms that radiomics combined with machine learning can achieve the classification of texture types of lung nodules. Based on this, we consider whether the combination of radiomics and machine learning can also achieve the classification of more lung nodule types, like the study by Ni et al. (4). Recent advancements have also explored the broad applications of radiomics in lung cancer management. Although the primary focus of this study is the classification of nodule morphology categories, it is worth noting that radiomics and the morphological characteristics of lung nodules have also shown potential in other related areas, such as predicting lung tumor growth intervals (8,9) and predicting lung adenocarcinoma and its subtypes (10,11). Additionally, accurately predicting the type of lung nodule (e.g., solid or subsolid) further contributes to the advancement of such studies. These related studies highlight the versatility of radiomics in offering deeper insights into lung cancer behavior, thereby supporting early detection and personalized treatment approaches.

Accurate classification of different nodule types requires the construction of multitasking models with high generalisability and robustness. Ensemble learning is a powerful strategy in machine learning that improves model performance by building and combining multiple learners. Ensemble learning is a powerful strategy in machine learning that improves model performance by building and combining multiple learners. It can effectively integrate the advantages of different models when dealing with multiple different tasks, thus demonstrating superior generalization ability and robustness than a single model.

In summary, the objective of this study is to integrate CT radiomics and ensemble learning methods for precise classification of seven distinct types of lung nodules, thereby providing more efficient and accurate computer-aided decision support for lung cancer diagnosis. We present this article in accordance with the TRIPOD reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-24-1315/rc).

Methods

Dataset

The Lung Image Database Consortium image collection (LIDC-IDRI) (12) is an international web-based resource containing images and lesion annotations for lung diagnosis, lung cancer screening, and chest CT scans, specifically designed for the development, training, and evaluation of CAD techniques for lung cancer detection and diagnosis. The dataset includes 1,018 chest CT scans, with a peak voltage of 120 to 140 kV and a peak current of 40 to 624 mA during CT image acquisition. The CT scan images are in Digital Imaging and Communications in Medicine (DICOM) format, which is the standard image format in the medical field. The image size is 512×512 pixels, and the pixel values are expressed in Hounsfield units (HU). The CT scans that make up the dataset were acquired using the following slice thickness settings: 0.6, 0.75, 0.9, 1.0, 1.25, 1.5, 2.0, 2.5, 3.0, 4.0, and 5.0 mm. The lesions range from 3 to 30 mm in diameter, and each lesion was independently labeled by experienced radiologists. There is no shortage of multinodular cases in the CT data, and each nodule has been assessed in detail for the type of features (13), including malignant potential, sphericity, margin definition, spiculation, lobulation, texture, calcification, and internal structure. Considering the differences in nodule labeling by different physicians, we extracted nodules that obtained consensus from at least three physicians in the current study. After this screening, the final number of nodules identified was 1,426. Specific data on nodule types are shown in Table 1. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Table 1

Distribution of nodule types in the LIDC-IDRI dataset

Annotation of nodule type in the dataset	Description	Nodule type grading	Quantities
Sphericity	3D shape of the nodule	1: linear	1
		2: ovoid/linear	67
		3: ovoid	373
		4: ovoid/round	598
		5: round	387
Lobulation	Degree of lobulation	1: no lobulation	730
		2: nearly no lobulation	363
		3: medium lobulation	201
		4: near marked lobulation	90
		5: marked lobulation	42
Spiculation	Degree of spiculation	1: no spiculation	843
		2: nearly no spiculation	334
		3: medium spiculation	127
		4: near marked spiculation	69
		5: marked spiculation	53
Texture	Nodule texture traits (solid, ground glass, or mixed)	1: non-solid/GGO	205
		2: non-solid/mixed	65
		3: part solid/mixed	115
		4: solid/mixed	214
		5: solid	827
Margin	Description of how well-defined the nodule margin is	1: poorly defined	115
		2: near poorly defined	194
		3: medium margin	186
		4: near sharp	399
		5: sharp	532
Calcification	Pattern of calcification	1: popcorn	0
		2: laminated	0
		3: solid	100
		4: non-central	4
		5: central	10
		6: absent	1,312
Malignancy	Subjective assessment of the likelihood of malignancy	1: highly unlikely	127
		2: moderately unlikely	164
		3: indeterminate	685
		4: moderately suspicious	276
		5: highly suspicious	174
Internal structure	Internal composition of the nodule	1: soft tissue	1,417
		2: fluid	3
		3: fat	0
		4: air	6

LIDC-IDRI, the lung image database consortium image collection; 3D, three-dimensional; GGO, ground glass opacity.

To address the imbalance in the number of type categories among the same lung nodule types in the LIDC-IDRI dataset, we adopted a hierarchical stratification processing strategy. This approach first segments the type hierarchically based on the number of nodules. Then, it assigns binary classification labels based on the characteristics associated with the type itself. For example, based on the annotations provided in the original dataset for lobulated lung nodules, there are five lobulation grades: grade 1 (no lobulation), grade 2 (nearly no lobulation), grade 3 (medium lobulation), grade 4 (nearly marked lobulation), and grade 5 (marked lobulation). A hierarchical framework was constructed based on these grades, where grades 1 and 2 were categorized as “lowly lobulated nodules”, grades 4 and 5 were categorized as “highly lobulated nodules”, and grade 3 was categorized as “moderately lobulated nodules”. Furthermore, to achieve precise binary classification, we established three sets of binary classification labels based on the aforementioned hierarchical stratification outcomes. The labels “3 [0]–12 [1]” represent the classification task between moderately lobulated and lowly lobulated lung nodules, “45 [0]–12 [1]” represent the classification task between highly lobulated and lowly lobulated lung nodules, and “45 [0]–3 [1]” represents the classification task between highly lobulated and moderately lobulated lung nodules. Table 2 for further elaboration.

Table 2

Binary classification labels

Nodule’s types	Nodule type grading [binary label]	Quantities	Code name
Sphericity	123 [0]–45 [1]	441–985	Sph1
Lobulation	3 [0]–12 [1]	201–1,093	Lob1
	45 [0]–12 [1]	132–1,093	Lob2
	45 [0]–3 [1]	132–201	Lob3
Spiculation	3 [0]–12 [1]	127–1,177	Spi1
	45 [0]–12 [1]	122–1,177	Spi2
	45 [0]–3 [1]	122–127	Spi3
Texture	1 [0]–5 [1]	205–827	Tex1
	1 [0]–234 [1]	205–394	Tex2
	23 [0]–5 [1]	180–827	Tex3
	4 [0]–5 [1]	214–827	Tex4
Margin	3 [0]–12 [1]	186–309	Mar1
	12 [0]–45 [1]	309–931	Mar2
	3 [0]–45 [1]	186–931	Mar3
Calcification	345 [0]–6 [1]	114–1,312	Cal1
Malignancy	123 [0]–45 [1]	976–450	Mal1
Internal structure	–	–	–

A code name is a surrogate name for that classification in this article.

However, even after stratification, the problem of imbalance in the number of classification classes among classification labels remains prominent. For this reason, we further subdivide the classes with larger sample sizes into multiple small subsets and train on these subsets separately. These small subsets are collectively called the ‘training subset’. For example, in the lung nodules sphericity classification task, the number of classification classes for binary classification is 411 [0] and 985 [1], respectively, with a ratio close to 1:2. Therefore, during training, we randomly divide the class labeled ‘1’ in this classification dataset into two equal parts, forming two small training subsets together with the class labeled ‘0’. The number of classification classes for the two small training subsets is 441:493 and 441:491, respectively. The classification performance is finalized by calculating the average of the training results of each subset. This approach aims to mitigate the impact of imbalance in classification label categories on classification performance.

Feature processing

CT image pre-processing and region‑of‑interest segmentation

To eliminate batch effects in CT images from the LIDC-IDRI dataset due to differences in institutions and equipment, this study employed various standardization and correction techniques on the original CT images. First, intensity normalization (normalizing all CT image intensity values to the range of 0 to 1) and voxel resampling (resampling all slice thicknesses to 1 mm) were applied to ensure image uniformity. Next, the region of interest (ROI) masks required for radiomics feature extraction were obtained from the true nodule contour annotations in the dataset. Due to variations in contour annotations by different physicians, a 50% consistency criterion was used to extract the ROI masks. In practical applications, 3D Slicer (www.slicer.org) can be used for semi-automatic segmentation of ROI slices to achieve effective nodule ROI extraction.

Feature extraction

In this study, we used the PyRadiomics (5) tool to extract radiomics features from lung CT images provided by the LIDC-IDRI dataset. This process involves using nodule masks and actual nodule information from the dataset. We extracted a total of 1,064 features from each case of nodule, which were categorized into three main groups based on the filtering status of the image: (I) unfiltered features of the original image, 124 in total; (II) features obtained by applying Laplacian of Gaussian filtering to the original image, 188 in total; and (III) features obtained by applying wavelet filtering to the original image, 752 in total. Further, these features are subdivided into first-order statistical features (222 in total), 3D shape features (17 in total), and higher-order texture features (825 in total). The higher-order texture features include gray level co-occurrence matrix (GLCM) features (14,15), gray level size zone matrix (GLSZM) features (16), gray level run length matrix (GLRLM) features (17), neighbouring gray tone difference matrix (NGTDM) features (18), gray level dependence matrix (GLDM) features (19). More detailed information about additional radiomics features and their extracted reproducibility can be found at the following link: https://pyradiomics.readthedocs.io/en/latest/.

For ease of presentation in this thesis, all feature names are abbreviated. For example: the feature name ‘wavelet-LLL_gldm_DependenceNonUniformity’ is abbreviated to ‘wav-LLL_gldm_DN’. The feature name consists of three parts: (I) filtering status, (II) feature type, and (III) name.

Image filtering includes diagnostics (CT voxel statistics in the region of ROI), abbreviated as diag; original (unfiltered original data), abbreviated as org; wavelet-LLL (wavelet filtering. All possible combinations of applying either a high or a low pass filter in each of the three dimensions, respectively. Such as LLL, HHH, HHL...) abbreviated as wav-LLL; log-sigma-3-mm-3D (Laplacian of Gaussian filtering. Sigma is set to 3 to improve fine textures), abbreviated as log-sigma-3; log-sigma-5-mm-3D (Laplacian of Gaussian filtering. Sigma is set to 5 to improve rough textures), abbreviated as log-sigma-5.

Feature types include: image-original (CT image before resampling), mask-original (nodule mask before resampling), image-interpolated (CT image after resampling), mask-interpolated (nodule mask after resampling), first-order, shape, GLCM, GLSZM, NGTDM, GLDM.

Names include the name of each feature, including Energy, Entropy, etc. The abbreviations of the names refer to the official PyRadiomics documentation.

Feature pre-processing

The features need to be preprocessed before model training (20).

Z-score normalization: after extracting radiomics features, Z-score normalization (subtracting the mean and dividing by the standard deviation) was performed to ensure that different features were on the same scale. This approach also effectively mitigates the differences introduced by varying equipment or imaging protocols.
Near-zero variance analysis: near-zero variance analysis aims to screen out features that do not contribute to the model’s prediction of the target category. The key to this approach lies in calculating the variances of features and eliminating those with minimal variances, thereby enhancing both the computational efficiency and predictive accuracy of the model.
Redundant feature analysis: the purpose of performing redundant feature analysis is to eliminate redundant features to simplify the model. First, we construct the Spearman correlation coefficient matrix between features and filter out the feature pairs with a more than 90% correlation. Linear regression was then utilized to assess the predictive power of these features for the categorical variable. Finally, we retain the features with higher AUC values in the feature pairs. This step helps to improve the performance and predictive accuracy of the model.

Feature selection

To reduce the complexity of the model, improve the computational efficiency, and enhance the prediction accuracy of the model, this study applied the least absolute shrinkage and selection operator (LASSO) algorithm (21) to perform the dimensionality reduction and selection of the features. The core of the LASSO algorithm lies in introducing the L1 penalty term in the loss function, which enables the constraint and compression of the variable weights in the model. In the implementation process, we have chosen the LassoCV tool provided by the Sklearn library (22). The tool calculates the weight coefficients of each feature through the LASSO model with a value range between −1 and 1. The size of the absolute value of the feature weights directly reflects the importance of their contribution to the model; the larger the absolute value, the higher the contribution of the feature in the model. Finally, the downscaling and feature selection process is completed by retaining only the features with non-zero weights.

Classifier

The classification of the given data into specific classes necessitates the utilization of classifiers. Choosing the right classifier or the right combination of classifiers is an important part of building a classification model. The present study employed a weighted voting mechanism to integrate five distinct classifiers, thereby establishing an ensemble classifier. This ensemble approach aims to synthesize the unique advantages of each base classifier, and by weighing their prediction results, it improves the overall classification performance while allowing it to be better adapted to the seven different classification tasks in this paper.

Base classifiers

In the field of machine learning, support vector machine (SVM) (23) is a supervised learning algorithm for data classification that finds the optimal hyperplane. The core principle is to construct one or more hyperplanes to achieve linear differentiability of data in the feature space.

The K-nearest neighbors (KNN) algorithm (24), as an instance-based learning method, is based on the core principle of predicting the class of a sample by considering the classification information from its k nearest neighbors in the feature space.

Linear discriminant analysis (LDA) (25), also known as Fisher’s discriminant analysis, aims to achieve maximum separability between classes. This method introduces a projection surface. Samples are mapped onto this surface, and their category is determined based on their resulting projection points.

Random forest (RF) (26), an ensemble learning algorithm, predicts by building multiple decision trees from bootstrapped dataset samples. Each tree splits on a subset of features selected randomly at each node, with the final prediction derived by aggregating individual tree outcomes through voting or averaging. This method extends and optimizes the Bagging approach (27).

eXtreme gradient boosting tree (XGBoost) (28), an advanced form of gradient boosting decision tree (GBDT) (29,30), enhances the traditional framework by adding regularization to prevent overfitting. Unlike GBDT, which uses a greedy algorithm to explore all split points, XGBoost employs an approximate greedy algorithm, improving efficiency by sorting eigenvalues and selecting splits based on quartiles. With parallel processing for faster split finding, XGBoost excels in speed and performance on large datasets.

Table 3 shows the parameters commonly used for the five base classifiers and the values at which these parameters were set in this paper.

Table 3

Base classifier parameter settings

Classifiers	Parameter	Value
SVM	Regularization parameter: C	exp⁻³–exp³*
	Kernel	linear, rbf, sigmoid*
	Class_weight	balanced
	Gama	scale, auto*
KNN	N_neighbors	1–21*
	Power parameter for the Minkowski metric: P	1, 2*
	Weight	uniform, distance*
LDA	Solver	lsqr, eigen*
LDA	Shrinkage	auto
RF	N_estimators	50–1,000*
	Max_depth	1–20*
	Criterion	gini, entropy*
	Max_features	sqrt, log2*
XGBoost	Colsample_bytree	0.6–1*
	Subsample	0.6–1*
	Gamma	0–0.5*
	Learning_rate	0.01–0.15*
	Max_depth	1–13*
	Min_child_weight	1–10*
	N_estimators	50–1,000*
	Objective	Binary: logistic

* the parameter was optimized using grid search in the experiment. SVM, support vector machine; KNN, k-nearest neighbors; LDA, linear discriminant analysis; RF, random forest; XGBoost, extreme gradient boosting tree.

Diversity-ensemble classifier

The fundamental concept of ensemble learning (31) is to construct and integrate multiple classifiers to accomplish diverse tasks. With ensemble learning, it is often possible to achieve superior performance over a single classifier. In ensemble learning, a good ensemble strategy is crucial, and different learning tasks may require different ensemble strategies. For numerical outputs, common ensemble methods include simple averaging and weighted averaging. On the other hand, for classification tasks, the voting method is typically employed for the ensemble. Voting methods can be further subdivided into absolute majority, relative majority, and weighted voting methods. Among them, the voting for class label classification is called hard voting, while the voting for class probability is called soft voting. In soft voting, the weighted voting method is often used. In this study, we adopted the weighted voting method as the ensemble strategy for the ensemble classifier.

In the weighted voting method, accurately determining the weighting factor is the most critical issue. In common ensemble learning algorithms, the confidence method and error rate weighting method are commonly employed to assign weights to the base classifiers, such as the RF algorithm, which is a representative example of an error rate weighting method. These methods can be collectively referred to as objective weighting methods. There is also a subjective weighting method, such as the hierarchical weighting method. Due to the differences in the mechanism and parameter settings of different base classifiers, it is more difficult for the subjective weighting method to assess the importance of each base classifier accurately. The objective weighting method mainly sets the weights based on the fluctuation of the performance index, but the uncertainty of the output of the base classifiers is significant, which leads to the lack of precision of the weights obtained by this type of method. Especially when the performance difference between base classifiers is slight, it is difficult for the confidence and error rate weighting methods to perform effective weighting. To compensate for these shortcomings, we used diversity weighting in this study. The diversity weighting method assigns weights by evaluating the uniqueness of individual base classifiers and their complementary roles in the overall decision. The diversity weighting method focuses on the performance differences between base classifiers and avoids simply favoring the best-performing base classifiers.

The effectiveness of the diversity weighting method in ensemble learning has been demonstrated with favorable outcomes. For example, Yang et al. (32) proposed an ensemble classification method based on accuracy and diversity. Experiments on the UC Irvine Machine Learning Repository show that the method can achieve good classification performance.

In this study, we have chosen five classifiers, namely SVM, KNN, LDA, RF, and XGBoost, as the base classifiers and use the weighted voting method to integrate them and build the ensemble classifiers. In the weighting process, we use the diversity expressed by the disagreement measure (33) as the weights of the weighted voting method. The weights are calculated (34) as follows: $ω (c_{i}) = \sum_{j = 1}^{n} d v a l_{i j} / \sum_{i = 1}^{n} \sum_{j = 1}^{n} d v a l_{i j}$ where c_i denotes the base classifier, n denotes the number of base classifiers, and dval_ij denotes the number of c_j classified incorrectly and c_i classified correctly.

Model building and evaluation

Model building

This study aims to enhance the model’s performance in classifying different types of lung nodules by integrating radiomics features with a diversity-weighted voting ensemble classifier. Based on this, we used two different model construction methods to build the corresponding nodule-type classification models. And analyzed and compared the modeling results of these models. The two modeling methods are notated as M1: the selected features are inputted into the equal-weighted voting ensemble classifier for classification (this method is used as a baseline to compare with improved methods); M2: the selected features are inputted into the diversity-weighted voting ensemble classifier for classification.

The overall flowchart of this study is shown in Figure 1. During the experimental process, we employed a 10-fold cross-validation method. Specifically, the training subset for each task was randomly divided into 10 equal subsets to ensure consistency in data distribution. In ten iterations, one subset was selected as the test set in each iteration, while the remaining nine subsets were combined to serve as the training set. Within the training set, 10% of the data was randomly chosen as the validation set, which was used to generate weighted voting weights. It should be noted that both the validation set and the test set were excluded from feature selection; their roles were confined to weight generation and model performance validation. The performance metrics of the model on the test set were recorded in each iteration. Finally, by calculating the average of the model performance metrics across all test sets, a comprehensive evaluation of the model performance was obtained.

Figure 1 Experimental process. SVM, support vector machine; LDA, linear discriminant analysis; KNN, K-nearest neighbors; RF, random forest; XGBoost, extreme gradient boosting tree; ROC, receiver operating characteristic; AUC, area under the ROC curve; PIS, Permutation Importance score; LC, LASSO coefficient; MIS, mutual information score; L, low; H, high; LASSO, least absolute shrinkage and selection operator.

Evaluation metrics

In this study, we used accuracy (ACC), AUC, specificity (SP), and sensitivity (SN) as evaluation metrics to assess the performance of the classification model quantitatively. Accuracy is the most intuitive performance metric, which is the ratio of the number of correctly classified samples to the total number of samples. The specificity metric evaluates the model’s capacity to correctly identify negative samples (non-target classes), while sensitivity measures its ability to accurately recognize positive samples (target classes). These metrics constitute a comprehensive evaluation system to ensure a comprehensive quantitative assessment of the accuracy and effectiveness of the classification model. According to the confusion matrix (35), they are calculated as follows:

$A C C = \frac{t p + t n}{t p + f p + t n + f n}$ [2]

$S p e c i f i c i t y = \frac{t n}{t n + f p}$ [3]

$S e n s i t i v i t y = \frac{t p}{t p + f n}$ [4]

where tp (true positive) denotes an actual positive sample and a positive prediction; fp (false positive) denotes an actual negative sample but a positive prediction; fn (false negative) denotes an actual positive sample but a negative prediction; and tn (true negative) denotes an actual negative sample and a negative prediction as well.

AUC measures the overall ability of a classifier to discriminate between positive and negative samples (36), and its value ranges from 0.5 to 1. The closer the AUC approaches 0.5, the model exhibits limited discriminatory capacity among samples; conversely, as the AUC approaches 1, the model demonstrates a robust ability to discriminate between samples. It is calculated by the following formula:

$A U C = \frac{s_{p} - n_{p} (n_{p} + 1) / 2}{n_{p} n_{n}}$ [5]

where s_p denotes the rank sum of positive samples, n_p and n_n denote the number of positive and negative samples.

In this study, to evaluate the statistical significance of the differences in the AUC between the two models, the DeLong test (37) was utilized. This test is specifically designed to compare the AUCs of correlated ROC curves, thereby providing a robust method for assessing the performance disparities in models concerning their ability to discriminate between binary outcomes. Furthermore, to assess the significance of differences in accuracy, sensitivity, and specificity between the two models, the Wilcoxon signed-rank test (38) was employed. This non-parametric test is favorable for paired data and is particularly useful in situations where the normality assumption may not hold. By implementing these statistical tests, the study ensures a rigorous evaluation of the models’ comparative performance metrics. In the context of assessing the statistical significance between models, a P value threshold of less than 0.05 was adopted. Should the P value derived from the DeLong or Wilcoxon signed-rank tests fall below this threshold, it is interpreted as indicative of a statistically significant difference in the respective performance metrics under consideration between the two models.

To comprehensively assess the importance of input features for classifiers, we adopted three distinct evaluation methods: permutation importance, LASSO coefficient (LC), and mutual information approaches.

Permutation importance: this method involves randomly shuffling the values of each feature and then calculating the impact of this perturbation on model performance (such as a decrease in accuracy) to assess the importance of features. If the model’s performance significantly deteriorates after shuffling a particular feature, then that feature is considered important.
LC: in this study, we utilized the coefficients generated by the LASSO algorithm during the feature selection process. The larger the coefficient, the greater the contribution of the feature to the model; hence, the feature is deemed more important.
Mutual information: mutual information measures the mutual dependence between two variables. In evaluating feature importance, the importance of a feature is assessed by calculating the mutual information value between each feature and the target variable. A high mutual information value indicates a strong mutual dependence between the feature and the target variable, thus making the feature very important for predicting the target variable.

In this paper, the normalized importance scores obtained by these three methods are referred to as Permutation Importance score (PIS), LC, and mutual information score (MIS), respectively.

Operations such as feature extraction, model establishment, and statistical comparison were all based on the Python (https://www.python.org/), Scikit-learn (https://scikit-learn.org/), and PyRadiomics libraries (https://pyradiomics.readthedocs.io/en/latest/index.html) of the Anaconda3 software platform (https://www.anaconda.com).

Results

Tables 4-10 demonstrate the evaluation results of the lung nodule type classification models constructed by the M1 and M2 methods. Additionally, we use the Hosmer-Lemeshow test statistic (P_hl) to show the model’s calibration. If P_hl is >0.05, it indicates that the model is well-calibrated. Part A in Figures 2-8 displays the ROC curves of M1 and M2 for each classification task. Part B in Figures 2-8 displays the feature importance scores of the top five features ranked according to the average of the three values, PIS, LC, and MIS, in each classification task. If the average value of a feature is greater than 0.75, we consider it to have high feature importance and mark it with an asterisk (*). Figure 8C demonstrates the decision curve analysis in the benign and malignant classification of lung nodules.