Approach to a preparation of dataset combining digital mammographic images and patient clinical data from electronic medical records

Veronika Kazarinova; Yuriy Vasilev; Anton Vladzymyrskyy; Olga Omelyanskaya; Kirill Arzamasov; Ekaterina Savkina; Tatiana Bobrovskaya

doi:10.21037/qims-24-1689

Brief Report

Approach to a preparation of dataset combining digital mammographic images and patient clinical data from electronic medical records

Veronika Kazarinova¹ , Yuriy Vasilev¹ , Anton Vladzymyrskyy^1,2 , Olga Omelyanskaya¹ , Kirill Arzamasov^1,3 , Ekaterina Savkina¹ , Tatiana Bobrovskaya¹

¹State Budget-Funded Health Care Institution of the City of Moscow “Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Healthcare Department”, Moscow, Russian Federation; ²Department of Information Technologies and Medical Data Processing, I.M. Sechenov First Moscow State Medical University of the Ministry of Health of the Russian Federation (Sechenov University), Moscow, Russian Federation; ³Department of Artificial Intelligence Technologies, MIREA – Russian Technological University, Moscow, Russian Federation

Correspondence to: Veronika Kazarinova, MD. State Budget-Funded Health Care Institution of the City of Moscow “Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Healthcare Department”, Petrovka str., 24, 127051 Moscow, Russian Federation. Email: KazarinovaVE@zdrav.mos.ru.

Abstract: A process of generating datasets is complex, expensive, and labor-intensive. However, we can optimize this process by modifying existing datasets for their reuse, which also complies with the FAIR principles. In this work, we developed a method to enrich a dataset with patients’ clinical information from electronic medical records. A proposed approach includes the following stages: selection of studies with and without signs of the chosen pathology, formation of a list of clinical signs based on the literature review results, extraction of clinical information, data processing. The presented method allows enriching a dataset of radiological studies with clinical parameters, which will save resources and assure further dataset application. A limitation of the method is its dependence on completeness of entered clinical information into electronic medical records. During our work, the dataset has been generated and registered, which includes mammographic images of 200 patients and the following clinical information: a patient’s age at the time of study, the age at menopause, and a number of births. A statistical analysis of the dataset was carried out. Despite a very weak correlation between the studied parameters and the presence of pathology, statistically significant differences were revealed between the groups of patients with and without pathology for the features of age at the time of study, age at menopause, and late menopause. The prepared dataset can be used for scientific research, as well as for training and testing software based on artificial intelligence (AI) technologies (AI-based software), which evaluates not only mammographic images, but also clinical information.

Keywords: Dataset; artificial intelligence (AI); mammography; malignant breast tumors

Submitted Aug 14, 2024. Accepted for publication Jan 28, 2025. Published online Mar 18, 2025.

doi: 10.21037/qims-24-1689

Introduction

Breast is the most common localization of malignant tumors in women in the Russian Federation (1). This pathology takes the first place (16.2%) in the structure of mortality in the female population. A number of patients diagnosed with stage I and II breast cancer has been increasing in recent years, which indicates increase in the proportion of patients diagnosed at the earlier stages of the disease (2).

Detection of the early stages of breast cancer, even in the absence of symptoms is possible thanks to a radiology imaging method—mammography, which is the gold standard in the diagnosis of breast pathologies. Mammography has high sensitivity and specificity, and therefore it is included in screening programs for the purpose of regular monitoring of patients’ health and timely detection of breast cancer (3). Currently in the Russian Federation, mammography of both breasts in two projections with double reading of images is performed for women from 40 to 75 years old every other year [except in cases where it is impossible to conduct a study for medical reasons due to mastectomy; mammography is not performed if mammography or breast computed tomography (CT) scan was carried out during the previous 12 months] according to the Order No. 404n of the Ministry of Health of the Russian Federation dated April 27, 2021, “On approval of a procedure for preventive health check-ups for certain groups of the adult population” (4).

In standard clinical practice, a radiologist evaluates mammograms and describes detected changes, classifying the result according to the Breast Imaging Reporting and Data System (BI-RADS) score. Depending on the severity of detected changes, a specialist gives a score from 0 to 6, a structure of breast tissue is also reflected in the Latin letters A, B, C or D. Analysis of images is a difficult task due to the subtle differences between a pathology and a background fibroglandular tissue, various types of lesions, which considering additionally a large flow of patient and a limited time, may lead to incorrect conclusions (5). Breast cancer screening requires double reading, in other words, each study is evaluated by two radiologists, which increases a workload for them. Research shows that one reading can be performed by artificial intelligence (AI) models (6), which is now being actively implemented in practical healthcare (7) without compromising a screening quality (8). This way, the workload for a radiologist is reduced (9).

Today, AI-based software demonstrates good results, in particular for mammography. The following diagnostic accuracy indicators have been achieved in our study (10): one of 5 AI services showed sensitivity, specificity, and area under the curve (AUC) metrics 0.833 [95% confidence interval (CI): 0.728–0.939], 0.960 (95% CI: 0.906–1.000), and 0.958 (95% CI: 0.923–0.994), respectively. The “average” radiologist demonstrated the following minimum metrics of sensitivity, specificity, and AUC 0.792 (95% CI: 0.677–0.907), 0.940 (95% CI: 0.874–1.000), and 0.928 (95% CI: 0.883–0.976), respectively. Liu et al. (11) conducted a meta-analysis of 32 studies on diagnosing breast cancer using machine learning methods, the overall summary assessment of sensitivity, specificity, and AUC was 0.914 (95% CI: 0.868–0.945), 0.916 (95% CI: 0.873–0.945), and 0.945, respectively.

However, the existing AI-based software does not take into account clinical data of patients, which could further improve diagnostic accuracy. Automated extraction of clinical information from Health Information System (HIS) is a challenging task. One of the existing solutions is presented in the work of Zhang et al. (12). An issue is that medical electronic records are unstructured, they differ in data presentation formats, and therefore they are not suitable for subsequent machine processing. For this reason, it is almost impossible to find datasets containing images and related clinical information in the public domain. Thus, for the successful development and application of AI in the diagnosis of breast diseases, it is necessary to generate high-quality datasets combining digital mammographic images and clinical data of patients from electronic medical records.

The objective of this work was to develop an approach to generating a dataset augmented with clinical parameters of patients taken from electronic medical records.

Methods

Stage 1 of a dataset preparation is a selection of studies with/without signs of the chosen pathology

In this work, we used two datasets which includes mammographic images with the presence and absence of signs of breast cancer. Each dataset includes anonymized studies of 100 female patients from Moscow outpatient medical facilities, who underwent screening mammography [i.e., a regular examination of patients at risk (women over 40 years old) without any symptoms] (13). The minimum age of patients is 40 years, the maximum is 70 years, the average age is 61 years, the median is 62 years. A research collection period is from March 28, 2018 through February 18, 2020. All information from state medical facilities is stored in the Unified Medical Information and Analytical System of Moscow (UMIAS). UMIAS is a modern information platform integrating all stages of medical care in Moscow. Screening mammography in women over 40 years old were randomly selected from this system, which allowed obtaining the most representative sample for a general population of Moscow.

The datasets were used for an external validation of AI-based software in order to confirm the declared values of diagnostic accuracy, as well as fine-tuning of AI services (14). This work was a part of the “Experiment on the use of innovative computer vision technologies for the analysis of medical images and further application in the Moscow healthcare system” (15), registered in ClinicalTrials.gov (NCT04489992). The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013) and approved by the Independent Ethics Committee of Moscow Regional Office of the Russian Society of Radiologists and Radiographists (protocol #2/2020, dated February 20, 2020). Informed consent form was signed during the clinical examination. A ratio of classes with and without signs of malignancy equal to 50/50 is due to a minimum required sample size for receiver operating characteristic (ROC) analysis, which was applied to the external validation (16-18). When creating a validation dataset, the binary labeling of X-ray images by classes, where “1” designated a presence of pathology and “0”—its absence (Table 1), was done upon reaching a consensus between two radiologists with more than 3 years of experience in mammography. In case of discrepancy, the decision was made by an expert—a doctor with more than 5 years of experience in mammography.

Table 1

Fragment of the labeling table of the original dataset

Anonymized study number	Pathology
1.2.643.5.1.13.13.12.2.77.8252.0803	1
1.2.643.5.1.13.13.12.2.77.8252.0906	0
1.2.643.5.1.13.13.12.2.77.8252.1007	1
1.2.643.5.1.13.13.12.2.77.8252.1107	0
1.2.643.5.1.13.13.12.2.77.8252.0715	0
1.2.643.5.1.13.13.12.2.77.8252.0911	0
1.2.643.5.1.13.13.12.2.77.8252.1404	0
1.2.643.5.1.13.13.12.2.77.8252.0802	0
1.2.643.5.1.13.13.12.2.77.8252.0503	0
1.2.643.5.1.13.13.12.2.77.8252.0413	1

Radiologists reviewed the studies and applied binary labeling, where “1” is assigned in case of identifying radiological signs specific to BI-RADS 3 and higher (Figure 1), and “0”—in case of BI-RADS 1–2 (Figure 2). Studies labeled as a pathology (“1”) were additionally verified by a biopsy (data from the Information and Analytical System “Cancer Registry”). A study was classified as a norm (“0”) in case of consensus between two radiologists. In case of disagreement, the study was not included in the dataset.

Figure 1 Example of mammography with a pathology: (A) an irregular-shaped lesion in the lower outer quadrant of left breast; (B) a lesion with the radial contour at the border of upper quadrants of the right breast.

Figure 2 Example of mammography without a pathology—breasts are symmetrical, without lesions, architectural distortion, and suspicious calcifications.

Stage 2 of creating a list of clinical signs based on the literature review

At this stage, a literature review was carried out. Based on its results, the following additional clinical indicators were selected: a patient’s age at the time of study, the age of menarche, the age at menopause, a number of births, the age at first birth, a presence and duration of lactation, a use of hormone replacement therapy, body mass index (BMI), cancer history, having close relatives with breast cancer, mutations in the BRCA1, BRCA2 genes, smoking and alcohol abuse. These data are risk factors for the occurrence of malignant breast neoplasms (19), their link with the disease has been demonstrated in studies (20-23). Therefore, these clinical data should be considered in the diagnosis of breast cancer as signs increasing alertness for this pathology.

Stage 3 of extracting clinical information

Consultations with a primary care physician, gynecologist, and oncologist at the closest date to mammography were found in the examination section of the electronic medical record. The obtained information was used to fill out the table.

Stage 4 of data processing

When working on the dataset, a significant limitation was identified due to a lack of data because of incomplete filling out patient’s medical history and examination results by doctors in the electronic medical records in UMIAS. Hospital discharge summaries turned out to be the most informative in case if a patient was hospitalized after the examination. A completeness of filling out electronic medical records is important when conducting the examination using clinical data (24). Alwhaibi et al. (25) also paid attention to the completeness of electronic medical records in their paper. The analysis of electronic medical records showed that the data were filled in by 91.0% for outpatient visits and by 93.2% for inpatient visits.

It should be remembered that insufficiently populated dataset parameters affect its quality and possibility of further application in scientific research or AI training and testing. Therefore, the dataset should be checked for omissions. To do this, you need to calculate a percentage of filling in data for each feature. The permissible number of omissions is specified in the technical assignment for a study. It is determined by the study objectives, disease prevalence, and uniqueness of the clinical parameters. In medical practice, data gaps occur, and unfortunately, this is inevitable. It is up to the researchers to decide how to deal with insufficient parameters. It is possible to remove parameters with many omissions or increase a sample size at stage 1 in order to collect eventually the required amount of data (17,26).

Thus, the fact that the information in UMIAS was incomplete, required adjustments to the dataset. In our case, clinical parameters that were filled in by less than 80% were removed according to the technical assignment for the study. After processing the data, three parameters remained in the partition table: a patient age at the time of study, the age at menopause, and a number of births. Of these, UMIAS reported the age at the time of study in 100% of patients, the age at menopause—in 89% of patients, and a number of births—in 84%.

The data presented in this study are available on the request from: https://mosmed.ai/datasets/mosmeddata-mmg-s-nalichiem-i-otsutstviem-priznakov-zlokachestvennih-novoobrazovanii-molochnoi-zhelezi-obogaschennii-klinicheskoi-informatsiei/ (accessed on January 23, 2025).

Results

Based on the results of our research, the dataset is enriched with three clinical parameters: age at the time of a study, age at menopause, and a number of births. In addition, columns with International Classification of Diseases (ICD) diagnosis codes and the BI-RADS category assigned by a radiologist based on the screening mammography results (1, 2, or 0) were added. According to the guidelines for screening mammography (13), if a malignant neoplasm is suspected (BI-RADS 3 and above), the BI-RADS score is 0 and a patient is referred for further examination and consultation with a mammologist. Therefore, there are BI-RADS scores of 1, 2, and 0 in our dataset.

A table with the clinical data of 200 patients was generated. This table contains the following columns: an anonymized study number, a presence/absence of pathology, BI-RADS score, a diagnosis (ICD code), the age at the time of study, the age of menopause, a number of births (Table 2). The resulting dataset was registered and published (27) in the public domain.

Table 2

Fragment of the resulting table

Anonymized study number	Pathology	BI-RADS score	Diagnosis (ICD code)	Age at the time of the study (years)	Age at menopause (years)	Number of births
1.2.643.5.1.13.13.12.2.77.8252.0502	0	2	N60.1	52	37	1
1.2.643.5.1.13.13.12.2.77.8252.1211	1	0	C50.8	61	55	2
1.2.643.5.1.13.13.12.2.77.8252.1004	1	0	C50.4	67	56	2
1.2.643.5.1.13.13.12.2.77.8252.1401	1	0	C50.4	66	48	1
1.2.643.5.1.13.13.12.2.77.8252.0215	1	0	C50.4	60	55	2
1.2.643.5.1.13.13.12.2.77.8252.1206	1	0	C50.8	65	51	1
1.2.643.5.1.13.13.12.2.77.8252.1403	1	0	C50.4	65	56	0
1.2.643.5.1.13.13.12.2.77.8252.0413	1	0	C50.4	51	50	1
1.2.643.5.1.13.13.12.2.77.8252.0806	1	0	C50.4	68	52	2
1.2.643.5.1.13.13.12.2.77.8252.0110	1	0	C50.4	69	50	0

BI-RADS, Breast Imaging Reporting and Data System; ICD, International Classification of Diseases.

A statistical analysis of the dataset was carried out. The Spearman’s correlation coefficient matrix was calculated (Table 3), since this method can be applied to both quantitative and categorical data in the resulting dataset.

Table 3

Spearman’s correlation coefficient matrix

Variable	Pathology	Age at the time of study	Age at menopause	Late menopause	Number of births
Pathology	1.0000	0.1521	0.1509	0.1679	–
Age at the time of study	0.1521	1.0000	0.1849	0.2264	–
Age at menopause	0.1509	0.1849	1.0000	0.7062	–
Late menopause	0.1679	0.2264	0.7062	1.0000	–
Number of births	–	–	–	–	1.0000

The values of the Spearman correlation coefficient presented in Table 3 indicate a statistically significant relationship between the features (P<0.05). A correlation strength was interpreted by the Evans scale (28).

Age

Our dataset revealed a very weak correlation (r=0.1521, P<0.05) between age and pathology presence. The incidence of breast cancer increases with age among 30–80-year-old women (21). We assume this discrepancy resulted from the predominance of older patients in the dataset (age range, 40–70 years, mean age: 61 years). The absence of younger patients affected the representativeness of the dataset and results of statistical analysis. However, the dataset was based on the screening mammography results, which according to the Russian and International recommendations (29) is performed for women over 40 years old, that did not allow to include younger patients.

The age at menopause

A very weak statistically significant correlation with the presence of pathology was found (r=0.1509, P<0.05). The study (22) has shown that late-onset menopause increases the risk of developing breast cancer, which is associated with the duration of estrogen exposure on a woman’s body. Therefore, we added a column “late menopause” with a binary assessment of the age at menopause—“1” if the age at menopause is over 55 years old, “0” if before. The threshold of 55 years was used, as it corresponds to the clinical recommendations (30). A correlation between the late menopause and a pathology was also very weak, but statistically significant (r=0.1679, P<0.05).

Number of childbirths

There was no correlation between a number of births and a presence of pathology in our dataset. However, the study (23) demonstrates that every childbirth reduces the relative risk of developing breast cancer by 7%.

A comparative analysis was also conducted using the Mann-Whitney test for quantitative characteristics—the age at the time of study, the age at menopause, a number of births (Table 4). Statistically significant differences were revealed between the groups with and without a pathology in the parameters of age at the time of study and age at menopause (P value less than 0.05).

Table 4

Comparative analysis: Mann-Whitney test

Variable	P value	Group 1 (n)	Group 2 (n)
Age at the time of study	0.0320	100	100
Age at menopause	0.0448	96	82
Number of births	0.1229	93	75

Group 1, with a pathology; Group 2, without a pathology.

A comparative analysis was also conducted using Chi-squared test for the categorical feature—late menopause. Statistically significant differences were found between the groups with and without pathology (P=0.0251).

Despite a very weak correlation between the studied parameters and the presence of pathology, statistically significant differences were revealed between the groups of patients with and without pathology for the features of age at the time of study, age at menopause, and late menopause.

Discussion

Current studies demonstrate that the AI-based software using datasets augmented with clinical parameters show higher diagnostic accuracy results (31) compared to algorithms using only mammographic images, and can also improve prognostic models (32). However, it is almost impossible to find ready-made datasets enriched with clinical information in the public domain; they either contain insufficient clinical data or locate in the closed access.

Reference datasets should have a structure and characteristics necessary to enable the application of machine learning methods, they also should meet the objective from both a computer science and medical perspective (33). A dataset preparation phase is the important aspect of the development and implementation of AI in clinical practice (26), which is emphasized in the Decree of the President of the Russian Federation of October 10, 2019 No. 490 “On the development of artificial intelligence in the Russian Federation” (34).

Therefore, we recommend using the approach consisting of the following steps when preparing similar datasets (Figure 3):

Selection of studies with/without signs of the chosen pathology. A result of the stage is a partition table based on the radiology study data.
Formation of a list of clinical signs based on the literature review results. A result is a list of additional clinical signs.
Extraction of clinical information. A result is a partition table augmented with clinical parameters.
Data processing. A result is a generated dataset.

Figure 3 Steps of a dataset preparation combining the patient’s radiological studies and clinical data from electronic medical records.

Using the previously generated dataset for validation of the AI-based software as an example, we have developed a methodology for enriching the dataset with clinical information for its further use in other tasks. Since a process of generating datasets is complex, expensive, and labor-intensive (33), we can optimize this process by modifying (enriching) existing datasets for their reuse, which also complies with the FAIR principles (35). The stages and clear algorithm for generating a dataset described in this paper allow saving resources and also contributing to the automatic generation of datasets meeting quality criteria.

The dataset generated within this study has a number of limitations for training and testing AI-based software or solving scientific issues: the sample size of 200 studies, ethnic, socio-economic components (not specified in electronic medical records), age limit (screening mammography is performed on women over 40 years old), the studies were conducted in different clinics in Moscow, but on devices of the same manufacturer.

However, the purpose of this study was to create a methodology for preparing enriched datasets, which has been achieved. We assume that the developed methodology can be extrapolated to other modalities. A limitation of the method is its dependence on completeness of entered clinical information into HIS.

Conclusions

We have developed an approach to generating the dataset with additional clinical parameters of patients from electronic medical records. The presented method allows enriching a dataset of radiological studies with clinical parameters, which will save resources and assure further dataset application. Thanks to the developed methodology, during this work, the dataset containing mammographic images of 200 patients enriched with additional clinical information has been generated, registered, and published. The prepared dataset can be used for a scientific research, as well as for training and testing AI-based software which evaluates not only mammographic images, but also clinical information.

Acknowledgments

We would like to thank our translators Eugenia Lipkina and Andrei Romanov from State Budget-Funded Health Care Institution of the City of Moscow “Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department”, who helped correct the language of the article.

Footnote

Funding: This article was prepared by a group of authors as a part of the research and development effort titled “Development of a platform to generate datasets containing diagnostic imaging studies” (USIS No. 123031500003-8) in accordance with the Order No. 1196 dated December 21, 2022 “On approval of state assignments funded by means of allocations from the budget of the city of Moscow to the state budgetary (autonomous) institutions subordinate to the Moscow Healthcare Department for 2023 and the planned period of 2024 and 2025” issued by the Moscow Healthcare Department.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-1689/coif). All authors report that this article was prepared by a group of authors as a part of the research and development effort titled “Development of a platform to generate datasets containing diagnostic imaging studies” (USIS No. 123031500003-8) in accordance with the Order No. 1196 dated December 21, 2022 “On approval of state assignments funded by means of allocations from the budget of the city of Moscow to the state budgetary (autonomous) institutions subordinate to the Moscow Healthcare Department for 2023 and the planned period of 2024 and 2025” issued by the Moscow Healthcare Department. The authors have no other conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki (as revised in 2013) and approved by the Independent Ethics Committee of Moscow Regional Office of the Russian Society of Radiologists and Radiographists (protocol #2/2020, dated February 20, 2020). Informed consent form was signed during the clinical examination.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Kaprin VV, Starinsky AO, Shakhzadova M. Malignant neoplasms in Russia in 2021 (morbidity and mortality). Publishing Solutions: Moscow Research Oncological Institute named after P.A. Herzen - a branch of the Federal State Budgetary Institution “National Medical Research Radiological Centre of the Ministry of Health of the Russian Federation”. Moscow, Russia; 2022:252. Available online: https://oncology-association.ru/wp-content/uploads/2022/11/zlokachestvennye-novoobrazovaniya-v-rossii-v-2021-g_zabolevaemost-i-smertnost.pdf
Breast Cancer Clinical Guidelines 2021. Available online: https://diseases.medelement.com/disease/paк-мoлoчнoй-жeлeзы-кp-pф-2021/16979 (accessed on January 10, 2025).
Zhang T, Tan T, Samperna R, Li Z, Gao Y, Wang X, Han L. YU Q, Beets-Tan RG, Mann RM. Radiomics and artificial intelligence in breast imaging: a survey. Artif Intell Rev 2023;56:857-92. [Crossref]
Order No. 404n of the Ministry of Health of the Russian Federation dated April 27, 2021 “On approval of a procedure for preventive medical check-ups for certain groups of the adult population”. Available online: http://publication.pravo.gov.ru/Document/View/0001202106300043 (accessed on January 10, 2025).
Martiniussen MA, Sagstad S, Larsen M, Larsen ASF, Hovda T, Lee CI, Hofvind S. Screen-detected and interval breast cancer after concordant and discordant interpretations in a population based screening program using independent double reading. Eur Radiol 2022;32:5974-85. [Crossref] [PubMed]
Vasilev YA, Tyrov IA, Vladzymyrskyy AV, Arzamasov KM, Shulkin IM, Kozhikhina DD, Pestrenin LD. Double-reading mammograms using artificial intelligence technologies: A new model of mass preventive examination organization. Digital Diagnostics 2023;4:93-104. [Crossref]
Yoon JH, Strand F, Baltzer PAT, Conant EF, Gilbert FJ, Lehman CD, Morris EA, Mullen LA, Nishikawa RM, Sharma N, Vejborg I, Moy L, Mann RM. Standalone AI for Breast Cancer Detection at Screening Digital Mammography and Digital Breast Tomosynthesis: A Systematic Review and Meta-Analysis. Radiology 2023;307:e222639. [Crossref] [PubMed]
Lauritzen AD, Rodríguez-Ruiz A, von Euler-Chelpin MC, Lynge E, Vejborg I, Nielsen M, Karssemeijer N, Lillholm M. An Artificial Intelligence-based Mammography Screening Protocol for Breast Cancer: Outcome and Radiologist Workload. Radiology 2022;304:41-9. [Crossref] [PubMed]
McKinney SM, Sieniek M, Godbole V, Godwin J, Antropova N, Ashrafian H, et al. International evaluation of an AI system for breast cancer screening. Nature 2020;577:89-94. [Crossref] [PubMed]
Arzamasov KM, Vasilev YA, Vladzymyrskyy AV, Omelyanskaya OV, Bobrovskaya TM, Semenov SS, Chetverikov SF, Kirpichev YS, Pavlov NA, Andreychenko AE. The use of computer vision for the mammography preventive research. Russian Journal of Preventive Medicine 2023;26:117-23. [Crossref]
Liu J, Lei J, Ou Y, Zhao Y, Tuo X, Zhang B, Shen M. Mammography diagnosis of breast cancer screening through machine learning: a systematic review and meta-analysis. Clin Exp Med 2023;23:2341-56. [Crossref] [PubMed]
Zhang T, Tan T, Wang X, Gao Y, Han L, Balkenende L, D'Angelo A, Bao L, Horlings HM, Teuwen J, Beets-Tan RGH, Mann RM. RadioLOGIC, a healthcare model for processing electronic health records and decision-making in breast disease. Cell Rep Med 2023;4:101131. [Crossref] [PubMed]
Manuylova OO, Pavlova TV, Didenko VV, Smirnov IV, Abduraimov AB, Vasilev AY. Methodological recommendations on the use of the international BI-RADS system in mammographic examination. Moscow; 2017. Available online: https://telemedai.ru/documents/metod_rekom_po_ispolz_mezhdunar_sist_bi-rads_pri_mammograf_issled_ot_2017
Vasilev YA, Vladzymyrskyy AV, Omelyanskaya OV, Arzamasov KM, Chetverikov SF, Rumyantsev DA, Zelenova MA. Methodology for testing and monitoring artificial intelligence-based software for medical diagnostics. Digital Diagnostics 2023;4:252-67. [Crossref]
Vasiliev YA, Vladzymyrskyy AV, Arzamasov KM, Andreichenko AE, Gombolevskyy VA, Kulberg NS, Omelyanskaya OV, Pavlov NA, Reshetnikov RV, Sergunova KA, Sharova DE, Shulkin IM. Computer Vision in Diagnostic Radiology: The First Stage of the Moscow Experiment: Monograph, 2nd ed.; revised and expanded; Publishing Solutions: Moscow, Russia; 2023:376. Available online: https://telemedai.ru/biblioteka-dokumentov/kompyuternoe-zrenie-v-luchevoj-diagnostike-pervyj-etap-moskovskogo-eksperimenta
Bobrovskaya TM, Vasilev YA, Nikitin NY, Vladzimirskyy AV, Omelyanskaya OV, Chetverikov SF, Arzamasov KM. Sample size for assessing a diagnostic accuracy of AI-based software in radiology. Siberian Journal of Clinical and Experimental Medicine 2024;39:188-98. [Crossref]
Chetverikov SF, Arzamasov KM, Andreichenko AE, Novik VP, Bobrovskaya TM, Vladzimirsky AV. Approaches to Sampling for Quality Control of Artificial Intelligence in Biomedical Research. Sovrem Tekhnologii Med 2023;15:19-25. [Crossref] [PubMed]
Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. J Clin Epidemiol 2005;58:475-83. [Crossref] [PubMed]
Xu H, Xu B. Breast cancer: Epidemiology, risk factors and screening. Chin J Cancer Res 2023;35:565-83. [Crossref] [PubMed]
Mao X, Omeogu C, Karanth S, Joshi A, Meernik C, Wilson L, Clark A, Deveaux A, He C, Johnson T, Barton K, Kaplan S, Akinyemiju T. Association of reproductive risk factors and breast cancer molecular subtypes: a systematic review and meta-analysis. BMC Cancer 2023;23:644. [Crossref] [PubMed]
Singletary SE. Rating the risk factors for breast cancer. Ann Surg 2003;237:474-82. [Crossref] [PubMed]
Menarche, menopause, and breast cancer risk: individual participant meta-analysis, including 118 964 women with breast cancer from 117 epidemiological studies. Lancet Oncol 2012;13:1141-51. [Crossref] [PubMed]
Breast cancer and breastfeeding: collaborative reanalysis of individual data from 47 epidemiological studies in 30 countries, including 50302 women with breast cancer and 96973 women without the disease. Lancet 2002;360:187-95. [Crossref] [PubMed]
Weiskopf NG, Hripcsak G, Swaminathan S, Weng C. Defining and measuring completeness of electronic health records for secondary use. J Biomed Inform 2013;46:830-6. [Crossref] [PubMed]
Alwhaibi M, Balkhi B, Alshammari TM, AlQahtani N, Mahmoud MA, Almetwazi M, Ata S, Basyoni M, Alhawassi T. Measuring the quality and completeness of medication-related information derived from hospital electronic health records database. Saudi Pharm J 2019;27:502-6. [Crossref] [PubMed]
Vasilev YA, Arzamasov KM, Vladzymyrskyy AV. Generation of a dataset for training and testing software based on artificial intelligence technology: Educational and methodological manual. Moscow: Scientific and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Healthcare Department; 2023:108. Available online: https://telemedai.ru/biblioteka-dokumentov/podgotovka-nabora-dannyh-dlya-obucheniya-i-testirovaniya-programmnogo-obespecheniya-na-osnove-tehnologii-iskusstvennogo-intellekta
MosMedData: MMG with a presence and absence of signs of breast malignant tumors, enriched with clinical information. Available online: https://mosmed.ai/datasets/mosmeddata-mmg-s-nalichiem-i-otsutstviem-priznakov-zlokachestvennih-novoobrazovanii-molochnoi-zhelezi-obogaschennii-klinicheskoi-informatsiei/ (accessed on January 10, 2025).
Koterov AN, Ushenkova L, Zubenkova E, Kalinina M, Biryukov A, Lastochkina E, Molodtsova D, Vaynson A. Strength of Association. Report 2. Graduations of Correlation Size. Medical Radiology and Radiation Safety 2019;64:12-24. [Crossref]
Ren W, Chen M, Qiao Y, Zhao F. Global guidelines for breast cancer screening: A systematic review. Breast 2022;64:85-99. [Crossref] [PubMed]
Menopause and menopausal state in a woman Clinical Guidelines 2021. Available online: https://diseases.medelement.com/disease/мeнoпayзa-и-климaктepичecкoe-cocтoяниe-y-жeнщины-кп-pф-2021/16957 (accessed on January 10, 2025).
Trang NTH, Long KQ, An PL, Dang TN. Development of an Artificial Intelligence-Based Breast Cancer Detection Model by Combining Mammograms and Medical Health Records. Diagnostics (Basel) 2023;13:346. [Crossref] [PubMed]
Yala A, Mikhael PG, Strand F, Lin G, Smith K, Wan YL, Lamb L, Hughes K, Lehman C, Barzilay R. Toward robust mammography-based models for breast cancer risk. Sci Transl Med 2021;13:eaba4373. [Crossref] [PubMed]
Regulations for a preparation of datasets describing approaches to the formation of a representative data sample – Moscow: Scientific and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Healthcare Department; 2022:40. Available online: https://telemedai.ru/biblioteka-dokumentov/reglament-podgotovki-naborov-dannyh-s-opisaniem-podhodov-k-formirovaniyu-reprezentativnoj-vyborki-dannyh-chast-1-1
Decree of the President of the Russian Federation of October 10 2019 No. 490 “On the development of artificial intelligence in the Russian Federation”. Available online: http://www.kremlin.ru/acts/bank/44731 (accessed on January 10, 2025).
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 2016;3:160018. [Crossref] [PubMed]

Cite this article as: Kazarinova V, Vasilev Y, Vladzymyrskyy A, Omelyanskaya O, Arzamasov K, Savkina E, Bobrovskaya T. Approach to a preparation of dataset combining digital mammographic images and patient clinical data from electronic medical records. Quant Imaging Med Surg 2025;15(4):3631-3640. doi: 10.21037/qims-24-1689

Approach to a preparation of dataset combining digital mammographic images and patient clinical data from electronic medical records

Introduction

Methods

Stage 1 of a dataset preparation is a selection of studies with/without signs of the chosen pathology

Table 1

Stage 2 of creating a list of clinical signs based on the literature review

Stage 3 of extracting clinical information

Stage 4 of data processing

Results

Table 2

Table 3

Age

The age at menopause

Number of childbirths

Table 4

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share