Comparison of artificial intelligence (AI) services for Breast Imaging-Reporting and Data System (BI-RADS) classification on mammograms
Introduction
Breast cancer (BC), a type of cancer that develops from breast tissue, is a major medical and socioeconomic concern. It ranks as the most common oncological disease and the leading cause of cancer-related deaths among women (1). Early detection of BC through screening programs significantly reduces morbidity and mortality (2). It also improves patient quality of life due to less invasive treatments for early-stage cancers (3). Mammography remains the gold standard and the only established screening method for BC. It exhibits widespread availability, proven quality criteria, and strong evidence from numerous prospective randomized studies and meta-analyses (4,5), while mammography screening has demonstrably lowered BC mortality by 20–49% (6,7). However, challenges exist with traditional mammography screening. These include understaffed radiology departments, long work shifts, and high workload for staff—partly due to a shortage of subspecialized breast radiologists and elevated risk of burnout (8,9). Additionally, interpreting mammographic images can be subjective and show variability even among experienced radiologists (10). Furthermore, mammography sensitivity and specificity can be suboptimal, especially for women with dense breast tissue. Addressing these limitations and embracing innovative approaches are crucial to enhancing screening efficiency.
Ever-improving artificial intelligence (AI) applications have become valuable tools to support radiologists. Studies demonstrate that deep learning algorithms can interpret mammograms with accuracy and speed comparable (11,12) or even exceeding those of human radiologists (13). These algorithms hold promise for supporting medical decision-making, reducing workload, and independently interpreting and classifying mammograms (14-16). Commercial-grade AI applications are already integrated into routine clinical practice (17). A noteworthy example is the Experiment on the use of innovative technologies in the field of computer vision for the analysis of medical images and further use in the healthcare system of Moscow (https://mosmed.ai/ai/). As of early 2024, the Experiment has endorsed 52 AI applications across 29 imaging modalities, including three specifically dedicated to mammography. This three-year project has demonstrated the high diagnostic accuracy and technical stability of the AI applications, leading to the expansion of the mandatory health insurance system with a new service titled: “Description and interpretation of mammographic examination data using artificial intelligence”. However, the full potential of AI applications remains underexplored due to extensive variety across use case scenarios and uneven maturity of different AI solutions. The current research landscape often focuses on the ability of AI to detect malignant tumours without using the BI-RADS classification system (18) or relying on its simplified version (19). Breast Imaging-Reporting and Data System (BI-RADS), developed by the American College of Radiology, is a standardized reporting system for breast imaging with defined malignancy risk percentages: BI-RADS 1 (0%), BI-RADS 2 (0%), BI-RADS 3 (0–2%), BI-RADS 4 (2–95%), and BI-RADS 5 (≥95%). In some cases, researchers may only limit themselves to analysing lesions classified as BI-RADS 4 (20-23) and 5 (24-28). The reason is that such changes are either suspicious (BI-RADS 4) or highly suggestive of malignancy (BI-RADS 5), both requiring biopsy. The core value of this approach centers on mammographic AI’s primary purpose as a tool for detecting malignant changes. However, this approach limits assessment of AI accuracy in other common clinical scenarios, including mammograms showing no abnormalities (BI-RADS 1), benign findings (BI-RADS 2), or probably benign changes (BI-RADS 3).
Goal
Comparison the diagnostic accuracy of three mammographic AI services in predicting individual BI-RADS categories and definition of opportunity integration of AI into routine clinical practice. We present this article in accordance with the STARD reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1658/rc).
Methods
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments, and approved by the Independent Ethics Committee of the Moscow Regional Branch of the Russian Society of Radiologists and Radiographers (IEC MRO RORR) (approval number 2, date of approval 20.02.2020). Clinical trial: NCT04489992. Informed consent was signed during the clinical routine.
We assessed the diagnostic accuracy of three mammographic AI applications: Celsius (OOO Medicinskie Skrining Sistemy, celsus.ai/products-mammography), Trio DM (AO MTL, www.mtl.ru/products/mammography), and Third Opinion Mammography (OOO Platforma Tret’e Mnenie, thirdopinion.ai/mmg). Mammographic AI applications they passed the calibration test and were then validated on the studied data. Below, these applications are anonymized and randomized as AI-1, AI-2, and AI-3. The architecture of all AI models under study is a trade secret.
Validation testing consists of several stages, including an assessment of documentation, technical quality, and diagnostic accuracy. The diagnostic accuracy is assessed using a dataset consisting of 100 studies with binary markup (presence/absence of pathological signs) and a 50/50 representation of classes. The evaluation metrics included the area under the curve (AUC), accuracy, sensitivity, specificity. The threshold value for passing the test is 0.81 (29). Table 1 shows the performance values of the services declared by their developers.
Table 1
| Parameter | AI-1 | AI-2 | AI-3 |
|---|---|---|---|
| Pathology | Breast cancer | Breast cancer | Breast cancer |
| ROC AUC | 0.93 | 0.88 | 0.94 |
| Sensitivity | 0.88 | 0.83 | 0.91 |
| Specificity | 0.94 | 0.83 | 0.91 |
| Overall accuracy | 0.91 | 0.83 | 0.91 |
AI, artificial intelligence; AUC, area under the curve; ROC, receiver operating characteristic.
Diagnostic accuracy was assessed using data from screening mammograms analyzed by the AI applications in clinical settings and the calibration test results. Screening mammograms were performed in outpatient clinics affiliated with the Moscow Health Care Department on digital mammography systems from various manufacturers (GE, Siemens, Hologic, etc.) certified for use in medical institutions in Moscow that comply with Moscow screening standards (30). Standard screening mammography in Moscow includes four projections (R/L CC/MLO) per study. Breast implant studies were included as they are part of routine screening. The images were interpreted by batch reading by radiologists from Moscow polyclinics with an average experience of more than two years in interpreting mammograms. BI-RADS categories were extracted at the exam level from radiology reports and aggregated as the highest category across both breasts. Anonymized studies were sent to the AI applications and then interpreted by radiologists (15). Figure 1 shows a simplified diagram of the life cycle of an AI service. AI mammography services are included in the experiment with the stated metrics (Table 1), which were calculated by the developers after checking the AI services based on a clinic (clinical and technical testing). Next, we perform calibration testing of these services on the same 100 mammograms, then analyse the data obtained from the services in real clinical practice and make a comparative analysis of the diagnostic accuracy of the three mammographic AI services in determining individual categories of BI-RADS.
The real studies were extracted from the Unified Radiological Information Service of the Unified Medical Information and Analytical System of Moscow (ERIS EMIAS). Inclusion criteria: screening mammogram, radiology report from an AI and a human radiologist, age patients 40–75 years (Order of the Ministry of Health of Russia dated 04/27/2021 N 404n). Exclusion criteria: mammograms without BI-RADS categories (31); BI-RADS 0 (incomplete assessment requiring additional imaging/unsuitable quality); BI-RADS 6 (known biopsy-confirmed malignancy that does not require diagnostic evaluation by AI). The data collection took four months (July 2023–October 2023). Calibration testing utilized a closed dataset containing 100 studies: 50 with histologically confirmed BC and 50 without malignancy (confirmed by follow-up of at least one year and two or more imaging modalities) (29). Since the number of studies assigned with a particular BI-RADS category may be low and insufficient for reliable assessment of diagnostic accuracy, the included mammographic studies were combined into two groups: “target pathology is absent” and “target pathology is present”. The target pathology was considered BC according to mammography data.
Since BI-RADS 3 changes are “probably benign” (0–2% risk of BC) (16), they cannot be definitively assigned to either group. Therefore, two allocation strategies were employed: at first, BI-RADS 3 was considered as “target pathology is absent” (which includes BI-RADS 1 and BI-RADS 2), then as “target pathology is present” (along with BI-RADS 4 and BI-RADS 5). According to the Moscow guidelines for the use of BI-RADS (30), BI-RADS 3 changes are carried out with a 6-month follow-up, rather than an immediate biopsy. However, recognizing that the American College of Radiology (ACR) BI-RADS considers category 3 to be “negative”, we conducted an analysis using both approaches: considering BI-RADS 3 as “benign” (grouped with BI-RADS 1–2) and as “possibly malignant” (grouped with BI-RADS 4–5) to evaluate the AI performance in different classification paradigms (this is a research abstraction). In addition, although BI-RADS 1 and 2 differ in the presence of findings, they are combined, since both indicate the absence of suspicious changes requiring intervention.
Statistical analysis
Diagnostic accuracy values [i.e., sensitivity, specificity, overall accuracy, positive predictive value (PPV), negative predictive value (NPV)] were calculated using conclusions from human radiologists as the gold standard and the calibration test data. To determine these, we built contingency tables containing the number of true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN) results. 95% confidence intervals (CI) for operating characteristics were calculated using the Clopper-Pearson interval. The AI applications were compared using the Chi-square test. All calculations were performed in R language using R Project 4.3.1.
Results
The study sample comprised 81,895 mammograms (81,598 from routine clinical practice and 100 from test datasets) (Figure 2). AI-1 did not process 3 studies for technical reasons. In routine clinical practice, AI services processed a different number of mammograms, because the services were connected to work at different times, one mammogram is processed by one service. The test dataset was created for calibration testing of AI services. At least one hundred studies of a certain type are sent from the information system ERIS EMIAS to the AI service for analysis, then the service sequentially analyzes these studies and sends the analysis results to the ERIS EMIAS. After the results of the research analysis are returned to the information system, a table is compiled containing the identification numbers of the studies and the probability of pathology. Radiologists evaluate the calibration test results based on the table. This sample, being random, is representative of the target population in terms of prevalence of the target pathology.
Table 2 presents the diagnostic accuracy values for three AI applications for a sample of examinations from routine clinical practice (based on radiologist conclusions).
Table 2
| Operational metrics | Application | BI-RADS 1 | BI-RADS 2 | BI-RADS 3 | BI-RADS 4 | BI-RADS 5 |
|---|---|---|---|---|---|---|
| Sensitivity (95% CI) | AI-1 (n=57,394) | 46.0% (45.1–47.0%) | 45.4% (45.2–45.6%) | 39.1% (36.9–41.3%) | 42.3% (39.3–45.3%) | 66.1% (56.9–74.3%) |
| AI-2 (n=20,861) | 21.9% (20.7–23.2%) | 60.6% (60.2–60.9%) | – | 65.1% (60.4–69.6%) | 82.5% (67.1–92.0%) | |
| AI-3 (n=3,280) | 3.3% (2.0–5.3%) | 57.2% (56.3–58.1%) | 36.6% (28.7–45.1%) | 52.4% (39.7–64.8%) | 20.0% (1.1–69.7%) | |
| Specificity (95% CI) | AI-1 (n=57,394) | 77.2% (77.0–77.3%) | 65.4% (64.6–66.2%) | 77.8% (77.7–77.8%) | 94.4% (94.4–94.5%) | 98.7% (98.6–98.7%) |
| AI-2 (n=20,861) | 85.5% (85.2–85.7%) | 42.0% (40.7–43.3%) | – | 77.4% (77.3–77.5%) | 99.3% (99.3–99.4%) | |
| AI-3 (n=3,280) | 96.0% (95.8–96.4%) | 47.6% (44.2–51.1%) | 77.9% (77.6–78.3%) | 84.6% (84.4–84.9%) | 98.9% (98.8–98.9%) | |
| Overall accuracy (95% CI) | AI-1 (n=57,394) | 72.4% (72.1–72.7%) | 49.5% (49.2–49.8%) | 76.5% (76.3–76.6%) | 93.5% (93.4–93.6%) | 98.6% (98.6–98.6%) |
| AI-2 (n=20,861) | 74.9% (74.5–75.3%) | 56.4% (55.9–57.0%) | – | 77.2% (77.0–77.4%) | 99.3% (99.3–99.4%) | |
| AI-3 (n=3,280) | 83.3% (83.0–83.9%) | 55.3% (53.9–56.7%) | 76.2% (75.6–76.9%) | 84.0% (83.5–84.5%) | 98.8% (98.7–98.9%) | |
| PPV (95% CI) | AI-1 (n=57,394) | 26.8% (26.2–27.3%) | 83.4% (83.1–83.8%) | 5.7% (5.4–6.0%) | 11.7% (10.9–12.5%) | 9.2% (7.9–10.4%) |
| AI-2 (n=20,861) | 22.9% (21.6–24.2%) | 78.5% (78.0–79.0%) | – | 5.5% (5.1–5.9%) | 19.5% (15.9–21.8%) | |
| AI-3 (n=3,280) | 11.8% (7.0–18.8%) | 81.4% (80.2–82.7%) | 6.6% (5.2–8.1%) | 6.3% (4.7–7.7%) | 2.6% (0.1–9.2%) | |
| NPV (95% CI) | AI-1 (n=57,394) | 88.7% (88.5–88.9%) | 23.8% (23.5–24.0%) | 97.4% (97.3–97.5%) | 98.9% (98.9–99.0%) | 99.9% (99.9–99.9%) |
| AI-2 (n=20,861) | 84.7% (84.4–84.9%) | 23.4% (22.7–24.1%) | – | 99.1% (99.0–99.2%) | 100.0% (99.9–100.0%) | |
| AI-3 (n=3,280) | 86.2% (86.0–86.5%) | 21.7% (20.1–23.3%) | 96.7% (96.2–97.1%) | 98.9% (98.6–99.2%) | 99.9% (99.8–100.0%) |
AI, artificial intelligence; BI-RADS, Breast Imaging-Reporting and Data System; CI, confidence interval; NPV, negative predictive value; PPV, positive predictive value.
Table 3 presents the diagnostic metrics (DM) for the three AI applications for binary classification performed using two calculation methods - inclusion of BI-RADS 3 in the group “target pathology is absent” or “target pathology is present” (for a sample of examinations from routine clinical practice based on radiologist conclusions).
Table 3
| DM | Application | DM value for (0—BI-RADS 1, 2 or 3; 1—BI-RADS 4 or 5) | DM value for (0—BI-RADS 1 or 2; 1—BI-RADS 3, 4 or 5) |
|---|---|---|---|
| Sensitivity (95% CI) | AI-1 (n=57,433) | 93.5% (93.4–93.5%) | 71.8% (71.7–71.9%) |
| AI-2 (n=20,882) | 77.0% (76.9–77.1%) | 97.2% (97.0–97.4%) | |
| AI-3 (n=3,283) | 83.8% (83.6–84.1%) | 62.2% (61.7–62.6%) | |
| Specificity (95% CI) | AI-1 (n=57,433) | 63.3% (60.35–66.1%) | 71.6% (70.0–73.2%) |
| AI-2 (n=20,882) | 77.5% (73.4–81.2%) | 91.3% (90.7–91.8%) | |
| AI-3 (n=3,283) | 67.6% (55.3–78.1%) | 71.8% (65.2–77.6%) | |
| Overall accuracy (95% CI) | AI-1 (n=57,433) | 92.9% (92.8–93.0%) | 71.8% (71.7–72.0%) |
| AI-2 (n=20,882) | 77.0% (76.8–77.2%) | 95.8% (95.5–96.1%) | |
| AI-3 (n=3,283) | 83.5% (83.0–83.9%) | 62.8% (62.0–63.5%) | |
| PPV (95% CI) | AI-1 (n=57,433) | 99.2% (99.2–99.3%) | 97.9% (97.7–98.0%) |
| AI-2 (n=20,882) | 99.4% (99.2–99.5%) | 97.2% (97.0–97.4%) | |
| AI-3 (n=3,283) | 99.2% (98.9–99.5%) | 97.1% (96.4–97.7%) | |
| NPV (95% CI) | AI-1 (n=57,433) | 15.9% (15.2–16.6%) | 12.3% (12.1–12.6%) |
| AI-2 (n=20,882) | 7.0% (6.6–7.3%) | 91.3% (90.7–91.8%) | |
| AI-3 (n=3,283) | 8.1% (6.7–9.4%) | 11.1% (10.1–12.0%) |
AI, artificial intelligence; BI-RADS, Breast Imaging-Reporting and Data System; CI, confidence interval; DM, diagnostic metrics; NPV, negative predictive value; PPV, positive predictive value.
Table 4 presents the DM for the three AI applications for binary classification performed using calibration testing.
Table 4
| DM | Application | DM value for (0—BI-RADS 1, 2 or 3; 1—BI-RADS 4 or 5) | DM value for (0—BI-RADS 1 or 2; 1—BI-RADS 3, 4 or 5) |
|---|---|---|---|
| Sensitivity (95% CI) | AI-1 (n=97) | 86.4% (78.3–92.5%) | 64.7% (54.8–72.1%) |
| AI-2 (n=100) | 66.1% (58.3–71.0%) | 67.3% (57.9–74.5%) | |
| AI-3 (n=100) | 87.1% (79.5–92.5%) | 65.5% (57.2–69.4%) | |
| Specificity (95% CI) | AI-1 (n=97) | 71.1% (58.4–80.4%) | 82.6% (71.6–90.8%) |
| AI-2 (n=100) | 86.8% (74.1–94.8%) | 80% (68.6–88.8%) | |
| AI-3 (n=100) | 78.9% (66.6–87.7%) | 93.3% (83.2–98.2%) | |
| Overall accuracy (95% CI) | AI-1 (n=97) | 80.4% (70.5–87.7%) | 73.2% (62.8–81.0%) |
| AI-2 (n=100) | 74.0% (64.3–80.0%) | 73.0% (62.7–80.9%) | |
| AI-3 (n=100) | 84.0% (74.6–90.7%) | 78.0% (68.9–82.4%) | |
| PPV (95% CI) | AI-1 (n=97) | 82.3% (74.5–88.0%) | 80.5% (68.2–89.7%) |
| AI-2 (n=100) | 89.1% (78.6–95.7%) | 80.4% (69.2–89.1%) | |
| AI-3 (n=100) | 87.1% (79.5–92.5%) | 92.3% (80.6–97.9%) | |
| NPV (95% CI) | AI-1 (n=97) | 77.1% (63.4–87.3%) | 67.9% (58.9–74.6%) |
| AI-2 (n=100) | 61.1% (52.1–66.7%) | 66.7% (57.1–74.0%) | |
| AI-3 (n=100) | 78.9% (66.6–87.7%) | 68.9% (61.4–72.4%) |
AI, artificial intelligence; BI-RADS, Breast Imaging-Reporting and Data System; CI, confidence interval; DM, diagnostic metrics; NPV, negative predictive value; PPV, positive predictive value.
Table 5 presents a comparison of the diagnostic accuracy values for the three AI applications (based on radiologist conclusions).
Table 5
| DM | Application | P value for (0—BI-RADS 1, 2 or 3; 1—BI-RADS 4 or 5) | P value for (0—BI-RADS 1 or 2; 1—BI-RADS 3, 4 or 5) |
|---|---|---|---|
| Sensitivity | AI-1 and AI-2 | <0.001 | <0.001 |
| AI-1 and AI-3 | 0.033 | 0.003 | |
| AI-2 and AI-3 | <0.001 | <0.001 | |
| Specificity | AI-1 and AI-2 | 0.019 | <0.001 |
| AI-1 and AI-3 | 0.738 | 0.9862 | |
| AI-2 and AI-3 | 0.505 | 0.029 | |
| Overall accuracy | AI-1 and AI-2 | <0.001 | <0.001 |
| AI-1 and AI-3 | 0.042 | 0.007 | |
| AI-2 and AI-3 | 0.004 | <0.001 | |
| PPV | AI-1 and AI-2 | 0.927 | 0.624 |
| AI-1 and AI-3 | 0.986 | 0.814 | |
| AI-2 and AI-3 | 0.955 | 0.972 | |
| NPV | AI-1 and AI-2 | <0.001 | <0.001 |
| AI-1 and AI-3 | 0.011 | 0.234 | |
| AI-2 and AI-3 | 0.334 | <0.001 |
AI, artificial intelligence; BI-RADS, Breast Imaging-Reporting and Data System; DM, diagnostic metrics; NPV, negative predictive value; PPV, positive predictive value.
Table 6 presents the comparison of the AI applications based on calibration testing.
Table 6
| DM | Application | P value for (0—BI-RADS 1, 2 or 3; 1—BI-RADS 4 or 5) | P value for (0—BI-RADS 1 or 2; 1—BI-RADS 3, 4 or 5) |
|---|---|---|---|
| Sensitivity | AI-1 and AI-2 | 0.334 | 0.899 |
| AI-1 and AI-3 | 0.978 | 0.970 | |
| AI-2 and AI-3 | 0.315 | 0.928 | |
| Specificity | AI-1 and AI-2 | 0.562 | 0.918 |
| AI-1 and AI-3 | 0.764 | 0.691 | |
| AI-2 and AI-3 | 0.779 | 0.619 | |
| Overall accuracy | AI-1 and AI-2 | 0.700 | 0.990 |
| AI-1 and AI-3 | 0.837 | 0.769 | |
| AI-2 and AI-3 | 0.552 | 0.759 | |
| PPV | AI-1 and AI-2 | 0.779 | 0.998 |
| AI-1 and AI-3 | 0.829 | 0.677 | |
| AI-2 and AI-3 | 0.935 | 0.666 | |
| NPV | AI-1 and AI-2 | 0.491 | 0.953 |
| AI-1 and AI-3 | 0.948 | 0.960 | |
| AI-2 and AI-3 | 0.436 | 0.913 |
AI, artificial intelligence; BI-RADS, Breast Imaging-Reporting and Data System; DM, diagnostic metrics; NPV, negative predictive value; PPV, positive predictive value.
Table 7 presents the comparison of the AI performance in clinical setting and based on calibration testing. The different threshold value is related to the different volume of samples on which the DM are calculated.
Table 7
| DM | Application in clinical setting | Application on calibration testing | P value for (0—BI-RADS 1, 2 or 3; 1—BI-RADS 4 or 5) | P value for (0—BI-RADS 1 or 2; 1—BI-RADS 3, 4 or 5) |
|---|---|---|---|---|
| Sensitivity | AI-1 (n=57,433) | AI-1 (n=97) | 0.683 | 0.640 |
| AI-2 (n=20,882) | AI-2 (n=100) | 0.449 | 0.082 | |
| AI-3 (n=3,283) | AI-3 (n=100) | 0.839 | 0.813 | |
| Specificity | AI-1 (n=57,433) | AI-1 (n=97) | 0.654 | 0.519 |
| AI-2 (n=20,882) | AI-2 (n=100) | 0.646 | 0.556 | |
| AI-3 (n=3,283) | AI-3 (n=100) | 0.618 | 0.274 | |
| Overall accuracy | AI-1 (n=57,433) | AI-1 (n=97) | 0.343 | 0.904 |
| AI-2 (n=20,882) | AI-2 (n=100) | 0.795 | 0.077 | |
| AI-3 (n=3,283) | AI-3 (n=100) | 0.969 | 0.157 | |
| PPV | AI-1 (n=57,433) | AI-1 (n=97) | 0.320 | 0.403 |
| AI-2 (n=20,882) | AI-2 (n=100) | 0.613 | 0.391 | |
| AI-3 (n=3,283) | AI-3 (n=100) | 0.489 | 0.828 | |
| NPV | AI-1 (n=57,433) | AI-1 (n=97) | <0.001 | <0.001 |
| AI-2 (n=20,882) | AI-2 (n=100) | <0.001 | 0.144* | |
| AI-3 (n=3,283) | AI-3 (n=100) | <0.001 | <0.001 |
*, AI-2 did not assign BI-RADS 3 to any study. AI, artificial intelligence; BI-RADS, Breast Imaging-Reporting and Data System; DM, diagnostic metrics; NPV, negative predictive value; PPV, positive predictive value.
The following examples illustrate the findings. Figure 3 demonstrates optimal AI model performance in a 51-year-old woman’s mammogram. The model correctly segmented and classified a cluster of suspicious microcalcifications in the upper inner quadrant of the left breast as BI-RADS 4. Simultaneously, isolated benign calcifications in the right breast were appropriately segmented and classified as BI-RADS 2. Axillary lymph nodes were also accurately segmented.
Figure 4 demonstrates AI model limitations in a 58-year-old woman’s mammogram. Postoperative fibrotic changes in the right breast were incorrectly classified as BI-RADS 4 by the AI but correctly reclassified as BI-RADS 2 by the radiologist. This discrepancy occurred because the radiologist had access to the patient’s surgical history, unlike the AI model. Despite this error, the AI model correctly segmented skin thickening in the right breast. In addition, on the oblique craniocaudal view, a mass in the posterior upper quadrant of the left breast was correctly segmented by the AI but misclassified as asymmetry (BI-RADS 3). The radiologist classified this finding as a fibrotic change (BI-RADS 2) based on comparison with the prior mammogram (not shown). This discrepancy arose because the radiologist, unlike the AI model, had access to prior imaging studies for comparison.
Discussion
This study evaluates the diagnostic accuracy of three mammographic AI services. The amount of data we study is an advantage of the work (81,895 mammograms), since reliable statistical conclusions are obtained based on a large amount of data.
Most AI service metrics for individual BI-RADS categories (Table 2) were suboptimal for practical use (median accuracy–76.9%). This may be attributed to the reliance on radiologist conclusions, as known variability in BI-RADS interpretations among radiologists is 26–57% (32). Previous studies have shown that AI-radiologist agreement is 84.1%, with AI tending to assign higher BI-RADS categories (33). AI-2 achieved the highest accuracy for BI-RADS 5 (99.3%). A notable finding is the high NPV for BI-RADS 2 (78.5–83.4%), suggesting that all three AI applications can be recommended as a means for reliable confirmation of the corresponding changes that are most common in clinical practice. The median NPV was 11.75%, which was slightly lower than literature data (18.6% for screening mammography) (34). For BI-RADS 4, low PPV values (5.5–11.7%) were observed, differing from literature data (34–97%) (35). A notable finding is the high NPV for BI-RADS 1, BI-RADS 3, BI-RADS 4 and BI-RADS 5 (over 84.7%), suggesting that all three AI applications can be recommended as a means for reliably ruling out the corresponding changes. Both NPV and PPV are visually relevant to practising radiologists as they reflect the probability of correctly classifying a study into the “with pathology” and “without pathology” groups, respectively (36).
To evaluate the binary classification results by the AI applications (presence/absence of malignant changes), individual BI-RADS categories were combined into two groups (Table 3). This approach is commonly found in the literature (33,37). The median accuracy was slightly higher (80.5%) which is confirmed by literature data (83.95%) (38). High NPV values were observed for all AI applications when classifying BI-RADS 3 as both normal and pathological (median value 98.6%), aligning with literature data (35). However, low NPV values were noted for most services when classifying BI-RADS 3 as “with pathology,” likely due to the absence of BI-RADS 3 classifications by AI-2.
Comparing the obtained metrics with calibration test results (Table 4), the median accuracy was slightly higher (80.5%) than in clinical practice (76.0%). This might be attributed to the careful curation of the calibration datasets (29) and the larger number of mammograms acquired from clinical practice (81,598 studies) compared to the test dataset (297 studies). The median NPV of 84.7% was also slightly lower.
Metrics from all three AI applications were conditionally similar, with overlapping CIs and no statistically significant differences (P>0.05) (Tables 5-7). No significant differences in NPV were observed across AI applications (Table 5), which means they can be recommended to confirm the absence of pathology. When comparing metrics from clinical practice data and calibration tests (Table 6), P values were low only for the NPV metric. Low NPV values can be typical for radiologists due to challenges in interpreting mammography, such as superimposition of normal breast tissue (39).
We provide the following example of a recommendation for the use of an AI application for a radiologist as shown by AI-1. Let group “0” contain BI-RADS 1, BI-RADS 2 or BI-RADS 3; group “1”—BI-RADS 4 or BI-RADS 5. If a new image is classified by AI-1 into group “0,” it will be assigned a BI-RADS category pertaining to group “0” with a probability of 99.2% (99.2%; 99.3%) (Table 3).
Given that only 1% who underwent screening mammograms suffer from BC (40), AI applications can be considered for pre-screening or “first reading” to rule out studies without pathological findings. This might alleviate the shortage of specialists and allow radiologists to focus on suspicious and complex cases.
Conclusions
Most AI service metrics for individual BI-RADS categories were suboptimal for practical use (median accuracy–76.9%), which can be attributed to the limitations of this study. AI-2 achieved the highest accuracy for BI-RADS 5 (99.3%). Successful integration of AI into routine clinical practice requires consideration of various diagnostic accuracy assessment methods, tailored to specific use cases.
Study limitations
- The ground truth for the clinical practice dataset was radiologist conclusions, not pathomorphological findings.
- Only AI performance was analyzed, not the diagnostic accuracy of radiologists or physicians using AI.
- The AI application analyzed only mammographic images without clinical data.
- Developers provided no information on AI training datasets, potentially limiting the generalizability of AI results to other populations.
- The prevalence of the condition of interest in this study is believed to reflect that in the population.
Study prospects
- This study demonstrates AI’s significant potential for widespread use in mammography analysis. Clinical implementation requires further evaluation of effectiveness and reliability, along with additional research, including large-scale prospective studies and addressing ethical, legal, and social concerns.
- Comparative analysis of AI applications that consider both radiological and clinical data is promising.
- Future studies with histopathology would provide more certain accuracy indicators.
- An AI analysis stratified by breast density categories (ACR A-D) would provide valuable information on the effect of tissue density on diagnostic accuracy.
- Pairwise testing on identical datasets would provide more reliable comparative data.
Acknowledgments
The authors thank A.A. Romanov and V.G. Klyashtorny for their valuable recommendations and advice during manuscript preparation.
Footnote
Reporting Checklist: The authors have completed the STARD reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1658/rc
Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1658/dss
Funding: This work was supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1658/coif). All authors report that this work was supported by the autonomous non-profit organization “Moscow Center for Innovative Technologies in Healthcare” (No. USIS 123031400006-0). The authors have no other conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments, and was approved by the Independent Ethics Committee of the Moscow Regional Branch of the Russian Society of Radiologists and Radiographers (approval number 2, date of approval 20.02.2020). Informed consent form was signed during the clinical routine.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Xu Y, Gong M, Wang Y, Yang Y, Liu S, Zeng Q. Global trends and forecasts of breast cancer incidence and deaths. Sci Data 2023;10:334. [Crossref] [PubMed]
- Engel JM, Stankowski-Drengler TJ, Stankowski RV, Liang H, Doi SA, Onitilo AA. All-cause mortality is decreased in women undergoing annual mammography before breast cancer diagnosis. AJR Am J Roentgenol 2015;204:898-902. [Crossref] [PubMed]
- James TA, Wade JE, Sprague BL. The impact of mammographic screening on the surgical management of breast cancer. J Surg Oncol 2016;113:496-500. [Crossref] [PubMed]
- Duffy SW, Tabár L, Yen AM, Dean PB, Smith RA, Jonsson H, et al. Mammography screening reduces rates of advanced and fatal breast cancers: Results in 549,091 women. Cancer 2020;126:2971-9. [Crossref] [PubMed]
- Nickson C, Mason KE, English DR, Kavanagh AM. Mammographic screening and breast cancer mortality: a case-control study and meta-analysis. Cancer Epidemiol Biomarkers Prev 2012;21:1479-88. [Crossref] [PubMed]
- Myers ER, Moorman P, Gierisch JM, Havrilesky LJ, Grimm LJ, Ghate S, Davidson B, Mongtomery RC, Crowley MJ, McCrory DC, Kendrick A, Sanders GD. Benefits and Harms of Breast Cancer Screening: A Systematic Review. JAMA 2015;314:1615-34. [Crossref] [PubMed]
- Duffy SW, Tabár L, Yen AM, Dean PB, Smith RA, Jonsson H, Törnberg S, Chiu SY, Chen SL, Jen GH, Ku MM, Hsu CY, Ahlgren J, Maroni R, Holmberg L, Chen TH. Beneficial Effect of Consecutive Screening Mammography Examinations on Mortality from Breast Cancer: A Prospective Study. Radiology 2021;299:541-7. [Crossref] [PubMed]
- Pesapane F, Abbate F, Bozzini A, Dominelli V, Farina M, Ferrari F, Latronico A, Marinucci I, Meneghetti L, Meroni S, Nicosia L, Penco S, Pizzamiglio M, Rotili A, Trentin C, Cassano E. What breast radiologists have learned from the COVID-19 pandemic. J Public Health Emerg 2022;6:7.
- Parikh JR, Sun J, Mainiero MB. Prevalence of Burnout in Breast Imaging Radiologists. J Breast Imaging 2020;2:112-8. [Crossref] [PubMed]
- Lee AY, Wisner DJ, Aminololama-Shakeri S, Arasu VA, Feig SA, Hargreaves J, Ojeda-Fournier H, Bassett LW, Wells CJ, De Guzman J, Flowers CI, Campbell JE, Elson SL, Retallack H, Joe BN. Inter-reader Variability in the Use of BI-RADS Descriptors for Suspicious Findings on Diagnostic Mammography: A Multi-institution Study of 10 Academic Radiologists. Acad Radiol 2017;24:60-6. [Crossref] [PubMed]
- Salim M, Wåhlin E, Dembrower K, Azavedo E, Foukakis T, Liu Y, Smith K, Eklund M, Strand F. External Evaluation of 3 Commercial Artificial Intelligence Algorithms for Independent Assessment of Screening Mammograms. JAMA Oncol 2020;6:1581-8. [Crossref] [PubMed]
- Rodriguez-Ruiz A, Lång K, Gubern-Merida A, Teuwen J, Broeders M, Gennaro G, Clauser P, Helbich TH, Chevalier M, Mertelmeier T, Wallis MG, Andersson I, Zackrisson S, Sechopoulos I, Mann RM. Can we reduce the workload of mammographic screening by automatic identification of normal exams with artificial intelligence? A feasibility study. Eur Radiol 2019;29:4825-32. [Crossref] [PubMed]
- Kim HE, Kim HH, Han BK, Kim KH, Han K, Nam H, Lee EH, Kim EK. Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study. Lancet Digit Health 2020;2:e138-48. [Crossref] [PubMed]
- Watanabe AT, Lim V, Vu HX, Chim R, Weise E, Liu J, Bradley WG, Comstock CE. Improved Cancer Detection Using Artificial Intelligence: a Retrospective Evaluation of Missed Cancers on Mammography. J Digit Imaging 2019;32:625-37. [Crossref] [PubMed]
- Vasilev YA, Tyrov IA, Vladzymyrskyy AV, Arzamasov KM, Shulkin IM, Kozhikhina DD, Pestrenin LD. Double-reading mammograms using artificial intelligence technologies: A new model of mass preventive examination organization. Digit Diagnostics 2023;4:93-104.
- Arzamasov KM., Vasilev YuA., Vladzymyrskyy AV., Omelyanskaya OV., Bobrovskaya TM., Semenov SS., Chetverikov SF., Kirpichev YuS., Pavlov NA., Andreychenko AE.. The use of computer vision for the mammography preventive research. Russ J Prev Med. 2023;26:117-23.
- Dembrower K, Wåhlin E, Liu Y, Salim M, Smith K, Lindholm P, Eklund M, Strand F. Effect of artificial intelligence-based triaging of breast cancer screening mammograms on cancer detection and radiologist workload: a retrospective simulation study. Lancet Digit Health 2020;2:e468-74. [Crossref] [PubMed]
- Dembrower K, Crippa A, Colón E, Eklund M, Strand F. ScreenTrustCAD Trial Consortium. Artificial intelligence for breast cancer detection in screening mammography in Sweden: a prospective, population-based, paired-reader, non-inferiority study. Lancet Digit Health 2023;5:e703-11. [Crossref] [PubMed]
- Sasaki M, Tozaki M, Rodríguez-Ruiz A, Yotsumoto D, Ichiki Y, Terawaki A, Oosako S, Sagara Y, Sagara Y. Artificial intelligence for breast cancer detection in mammography: experience of use of the ScreenPoint Medical Transpara system in 310 Japanese women. Breast Cancer 2020;27:642-51. [Crossref] [PubMed]
- Ezeana CF, He T, Patel TA, Kaklamani V, Elmi M, Brigmon E, Otto PM, Kist KA, Speck H, Wang L, Ensor J, Shih YT, Kim B, Pan IW, Cohen AL, Kelley K, Spak D, Yang WT, Chang JC, Wong STC. A Deep Learning Decision Support Tool to Improve Risk Stratification and Reduce Unnecessary Biopsies in BI-RADS 4 Mammograms. Radiol Artif Intell 2023;5:e220259. [Crossref] [PubMed]
- Lyu SY, Zhang Y, Zhang MW, Zhang BS, Gao LB, Bai LT, Wang J. Diagnostic value of artificial intelligence automatic detection systems for breast BI-RADS 4 nodules. World J Clin Cases 2022;10:518-27. [Crossref] [PubMed]
- Ghunaim HA, Alatawi RE, Borhan WM, Daqqaq TS, Alhasan AS, Aboualkheir MM, Elkady RM. Accuracy of imaging of BI-RADS 4 subcategorizations in breast lesion diagnosis: Radiologic-pathologic correlation. Saudi Med J 2024;45:1228-37. [Crossref] [PubMed]
- Yang L, Zhang N, Jia J, Ma Z. Deep learning radiomics on grayscale ultrasound images assists in diagnosing benign and malignant of BI-RADS 4 lesions. Sci Rep 2024;14:31479. [Crossref] [PubMed]
- Ma S, Li Y, Yin J, Niu Q, An Z, Du L, Li F, Gu J. Prospective study of AI-assisted prediction of breast malignancies in physical health examinations: role of off-the-shelf AI software and comparison to radiologist performance. Front Oncol 2024;14:1374278. [Crossref] [PubMed]
- Badawy E, Shalaby FS, Saif-El-nasr SI, Elyamany AM, Hegazy RMA. The synergy between AI and radiologist in advancing digital mammography: comparative study between stand-alone radiologist and concurrent use of artificial intelligence in BIRADS 4 and 5 female patients. Egypt J Radiol Nucl Med 2023; [Crossref]
- Luo WQ, Huang QX, Huang XW, Hu HT, Zeng FQ, Wang W. Predicting Breast Cancer in Breast Imaging Reporting and Data System (BI-RADS) Ultrasound Category 4 or 5 Lesions: A Nomogram Combining Radiomics and BI-RADS. Sci Rep 2019;9:11921. [Crossref] [PubMed]
- Wang G, Shi D, Guo Q, Zhang H, Wang S, Ren K. Radiomics Based on Digital Mammography Helps to Identify Mammographic Masses Suspicious for Cancer. Front Oncol 2022;12:843436. [Crossref] [PubMed]
- Li G, Huang Z, Luo H, Tian H, Ding Z, Deng Y, Xu J, Wu H, Dong F. Photoacoustic Imaging Radiomics to Identify Breast Cancer in BI-RADS 4 or 5 Lesions. Clin Breast Cancer 2024;24:e379-88.e1. [Crossref] [PubMed]
- Vasilev YA, Vladzymyrskyy AV, Omelyanskaya OV, Arzamasov KM, Chetverikov SF, Rumyantsev DA, Zelenova MA. Methodology for testing and monitoring artificial intelligence-based software for medical diagnostics. Digit Diagnostics 2023;4:252-67.
- Guidelines on the use of the BI-RADS system for mammographic examination / Guidelines edited by Doctor of Medicine, Professor, Corresponding Member of the Russian Academy of Sciences, Honored Scientist of the Russian Federation, A. Y. Vasiliev. — Moscow. Available online: https://telemedai.ru/biblioteka-dokumentov/metodicheskie-rekomendatsii-po-ispolzovaniyu-mezhdunarodnoy-sistemy-bi-rads-pri-mammograficheskom-obsledovanii
- Spak DA, Plaxco JS, Santiago L, Dryden MJ, Dogan BE. BI-RADS® fifth edition: A summary of changes. Diagn Interv Imaging 2017;98:179-90.
- Dang LA, Chazard E, Poncelet E, Serb T, Rusu A, Pauwels X, Parsy C, Poclet T, Cauliez H, Engelaere C, Ramette G, Brienne C, Dujardin S, Laurent N. Impact of artificial intelligence in breast cancer screening with mammography. Breast Cancer 2022;29:967-77. [Crossref] [PubMed]
- Vasiliev YA, Vladzimirskyy A, Arzamasov K, Shulkin IM, Aksenova L, Pestrenin L, Semenov S, Bondarchuk DV, Smirnov IV. The first 10,000 mammography exams performed as part of the “description and interpretation of mammography data using artificial intelligence” service. Manag Zdr 2023;8:54-67.
- Lee SE, Hong H, Kim EK. Positive Predictive Values of Abnormality Scores From a Commercial Artificial Intelligence-Based Computer-Aided Diagnosis for Mammography. Korean J Radiol 2024;25:343-50. [Crossref] [PubMed]
- Mohapatra SK, Mishra A, Sahoo TK, Nayak RB, Das PK, Nayak B. The Positive Predictive Values of the Breast Imaging Reporting and Data System (BI-RADS) 4 Lesions and its Mammographic Morphological Features. Indian J Surg Oncol 2021;12:182-9. [Crossref] [PubMed]
- Monaghan TF, Rahman SN, Agudelo CW, Wein AJ, Lazar JM, Everaert K, Dmochowski RR. Foundational Statistical Principles in Medical Research: Sensitivity, Specificity, Positive Predictive Value, and Negative Predictive Value. Medicina (Kaunas) 2021;57:503. [Crossref] [PubMed]
- Coolen AMP, Lameijer JRC, Voogd AC, Louwman MWJ, Strobbe LJ, Tjan-Heijnen VCG, Duijm LEM. Characteristics of screen-detected cancers following concordant or discordant recalls at blinded double reading in biennial digital screening mammography. Eur Radiol 2019;29:337-44. [Crossref] [PubMed]
- Turk F, Akkur E, Eroğul O. BI-RADS categories and breast lesions classification of mammographic images using artificial intelligence diagnostic models. Neural Netw World 2023;33:413-32.
- Santos Aragon LN, Soto-Trujillo D. Effectiveness of Tomosynthesis Versus Digital Mammography in the Diagnosis of Suspicious Lesions for Breast Cancer in an Asymptomatic Population. Cureus 2021;13:e13838. [Crossref] [PubMed]
- Lehman CD, Arao RF, Sprague BL, Lee JM, Buist DS, Kerlikowske K, Henderson LM, Onega T, Tosteson AN, Rauscher GH, Miglioretti DL. National Performance Benchmarks for Modern Screening Digital Mammography: Update from the Breast Cancer Surveillance Consortium. Radiology 2017;283:49-58. [Crossref] [PubMed]


