Fine-tuned multimodal GPT-4o for generating diagnostic impressions in breast magnetic resonance imaging: insights into non-mass enhancement lesions
Original Article

Fine-tuned multimodal GPT-4o for generating diagnostic impressions in breast magnetic resonance imaging: insights into non-mass enhancement lesions

Jiahuan Tang1 ORCID logo, Xiaoqing Yu2 ORCID logo, Di Kang2 ORCID logo, Shengqin Luo3 ORCID logo, Meihong Sheng2 ORCID logo

1Department of Radiology, Affiliated Hospital of Jiaxing University, The First Hospital of Jiaxing, Jiaxing, China; 2Department of Radiology, Nantong First People’s Hospital, Southeast University, Affiliated Nantong Clinical College of Nantong University, Nantong, China; 3Nanhu Laboratory, Jiaxing, China

Contributions: (I) Conception and design: All authors; (II) Administrative support: M Sheng; (III) Provision of study materials or patients: J Tang, M Sheng; (IV) Collection and assembly of data: All authors; (V) Data analysis and interpretation: J Tang, S Luo; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Meihong Sheng, MD. Department of Radiology, Nantong First People’s Hospital, Southeast University, Affiliated Nantong Clinical College of Nantong University, No. 666, Shengli Road, Nantong 226000, China. Email: 5300198@ntu.edu.cn.

Background: Previous studies have reported limited performance of publicly available large language models for Breast Imaging Reporting and Data System (BI-RADS) categorization of breast lesions. However, their application to breast non-mass enhancement (NME) lesions has not been specifically investigated. This study evaluated the diagnostic capability of multimodally fine-tuned ChatGPT-4 omni [generative pre-trained transformer 4 omni (GPT-4o)] for breast NME lesions, focusing on the accuracy of generated diagnostic impressions.

Methods: In this retrospective study, magnetic resonance imaging (MRI) contrast-enhanced images, radiology reports, and clinical histories of 229 patients with breast NME lesions were collected. Six models were developed across three settings: zero-shot without fine-tuning (A1 and A2), single-modality fine-tuning (B and C), and multimodal fine-tuning (D and E). The models were evaluated for their ability to generate summarized imaging reports. Their diagnostic performance was compared against that of three radiologists with varying experience levels. Finally, the models were comprehensively assessed by the three additional experts using a five-point Likert scale.

Results: The zero-shot models achieved area under the receiver operating characteristic curve (AUC) of 0.67–0.68, slightly outperforming junior radiologists but remaining inferior to experienced radiologists. Among the fine-tuned models, the image-only model (Model B) showed the poorest performance [AUC, 0.56; 95% confidence interval (CI): 0.45–0.66], significantly inferior to the other models (all P≤0.047). In contrast, the text-only and multimodal fine-tuned models showed substantial performance improvements. Model E achieved the best performance, with an AUC of 0.81 (95% CI: 0.74–0.89), significantly outperforming the zero-shot, image-only, and text-only models (all P≤0.039). Model E also outperformed junior radiologists (P<0.001), while showing no significant difference compared with intermediate and senior radiologists (P=0.060 and 0.367, respectively). The average expert evaluation score for Model E was 4.94/5. Compared with senior radiologists, specificity increased by 14.0%, whereas sensitivity decreased by 5.3%. Agreement for model-generated BI-RADS assessments was moderate both within the model outputs (κ=0.47) and between the model and senior radiologists (κ=0.49).

Conclusions: Multimodally fine-tuned GPT-4o can generate clinically accurate diagnostic impressions for complex breast NME lesions with performance approaching that of experienced radiologists. The model can improve diagnostic specificity and may serve as a useful second reader for breast MRI interpretation, particularly for less experienced radiologists, although expert supervision remains necessary because sensitivity was slightly lower than that of senior radiologists.

Keywords: Multimodal generative pre-trained transformer 4 omni (multimodal GPT-4o); fine-tuning; breast; magnetic resonance imaging (MRI); non-mass enhancement (NME)


Submitted Nov 23, 2025. Accepted for publication May 15, 2026. Published online Jun 10, 2026.

doi: 10.21037/qims-2025-1-2523


Introduction

Breast cancer is the leading cause of cancer-related mortality among women (1). Non-mass enhancement (NME) represents a lesion category with complex pathology, often causing overlap in imaging features between benign and malignant conditions (2). Breast magnetic resonance imaging (MRI) is the most sensitive modality for detecting NME, providing detailed evaluation of lesions and surrounding structures (3). However, current Breast Imaging Reporting and Data System (BI-RADS) descriptors for morphological and hemodynamic changes remain insufficient to reliably distinguish benign from malignant lesions, making NME a major source of false-positive findings and an ongoing diagnostic challenge (4-6).

Recently, artificial intelligence, especially large language models (LLMs), has advanced rapidly, presenting significant opportunities for clinical diagnostics and breast imaging workflows (7-10). While text-based models cannot directly interpret images, multimodal LLMs (MLLMs) integrate natural language processing with imaging information, enabling automated medical report generation (11,12). This may improve NME lesion diagnosis and management. However, current MLLMs, which are predominantly trained on general-domain datasets, exhibit suboptimal performance in medical image analysis and, thus, require further optimization (13,14). Few-shot learning and domain-specific fine-tuning have demonstrated efficacy in enhancing model performance, enabling interpretations that approach expert-level accuracy even with limited annotated data (15-17). Nevertheless, complex diagnostic tasks, such as the evaluation of breast NME, which requires integrated analysis of morphological characteristics, kinetic parameters, and relevant clinical context, remain insufficiently investigated (18,19). Rigorous clinical validation studies are warranted to establish the diagnostic accuracy, reliability, and clinical applicability of MLLMs for breast imaging (20).

Here, a multimodal breast imaging dataset was used for multi-stage supervised fine-tuning (SFT) of generative pre-trained transformer 4 omni (GPT-4o). Its performance was compared with the zero-shot model to evaluate its ability to generate diagnostic impressions for breast NME lesions. We also explored the potential value and limitations of MLLMs in breast image interpretation and radiological language generation. By comparing model outputs with interpretations from radiologists of varying expertise, the study assessed the clinical feasibility and collaborative potential of MLLMs in diagnostic decision-making. We present this article in accordance with the STARD-AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2523/rc).


Methods

Patient data

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the institutional ethics board of Nantong First People’s Hospital (No. 2021KT167). Due to the retrospective nature of the study, the requirement for informed consent was waived. All consecutive patients who underwent breast MRI for breast lesion evaluation at Nantong First People’s Hospital between January 2014 and July 2024 were included, if their diagnosis was confirmed by surgery or biopsy. Indications for MRI included high-risk breast cancer screening, evaluation of suspicious nipple discharge or skin changes with negative conventional imaging, and adjunctive assessment for equivocal mammography or ultrasound findings.

The exclusion criteria were: (I) poor image quality or inconsistent imaging parameters (n=19); (II) incomplete clinical data (n=14); (III) known malignant lesions classified as BI-RADS category 6 (n=33); (IV) an interval of more than one month between imaging and histological examination or surgery (n=13); and (V) lesions classified as mass lesions (n=698). Ultimately, 229 patients with NME lesions were included, comprising a total of 233 lesions (4 patients had 2 lesions each, one in each breast). These included 106 benign and 127 malignant lesions. Clinical histories, including symptoms such as breast redness, pain, nipple discharge, nipple or skin changes, and their duration, were recorded for all patients. Figure 1 illustrates the patient inclusion and exclusion process. Table 1 summarizes the patients’ pathological results.

Figure 1 Patient inclusion and exclusion flowchart. BI-RADS, Breast Imaging-Reporting and Data System; MRI, magnetic resonance imaging; NME, non-mass enhancement.

Table 1

The pathological types of NME lesions

Group Value (%)
Malignant (n=127)
   Ductal carcinoma in situ 26 (20.47)
   Lobular carcinoma in situ 1 (0.79)
   Invasive ductal carcinoma 62 (48.82)
   Invasive carcinoma with ductal carcinoma in situ 15 (11.81)
   Ductal carcinoma in situ with invasive components 18 (14.17)
   Invasive lobular carcinoma 2 (1.57)
   Invasive special-type carcinoma 3 (2.36)
Benign (n=106)
   Adenosis 35 (33.02)
   Adenosis + atypical hyperplasia 5 (4.72)
   Adenosis + fibroadenoma 9 (8.49)
   Adenosis + inflammation 7 (6.60)
   Inflammation 44 (41.51)
   Intraductal papilloma (or intraductal papillomatosis) 6 (5.66)

NME, non-mass enhancement.

Breast MRI acquisitions

All breast MRI examinations were performed using a Siemens 3.0-T MRI scanner (Verio; Siemens) equipped with a 16-channel phased-array breast coil. Patients were positioned prone, head first into the scanner bore. The scanning protocol included routine MRI sequences, diffusion-weighted imaging (DWI), and dynamic contrast-enhanced (DCE) imaging. Sequences were as follows: (I) T2-weighted imaging, axial inversion recovery sequence; (II) non-fat-saturated T1-weighted imaging (T1WI), axial non-fat-saturated three-dimensional (3D) spoiled gradient echo sequence; (III) DWI, axial plane echo imaging sequence with b-values of 50, 400, and 800 s/mm2; and (IV) DCE-MRI, axial fat-saturated 3D spoiled gradient echo sequence. For DCE-MRI, a high-pressure injector was used to administer the contrast agent [gadolinium-diethylenetriamine pentaacetic acid (Gd-DTPA)] at a flow rate of 2 mL/s (15–20 mL), followed by an equal volume of saline. The injection was completed within 25 s, triggering a scan performed at six time points, with 1-min acquisition intervals. These included one pre-contrast scan and five post-contrast scans.

Image analysis

Two radiologists (Readers 1 and 2, with 5 and 10 years of breast MRI experience, >500 cases each) independently reviewed original and processed MRI images, including DCE T1 subtraction, maximum intensity projections, and sagittal reconstructions, blinded to pathological outcomes. Regions of interest were manually delineated on apparent diffusion coefficient (ADC) maps within the solid portion of each lesion, avoiding cystic and necrotic areas, and mean ADC values were calculated. A time-signal intensity curve (TIC) was generated from the region of maximal enhancement on DCE-MRI. Lesion type and radiological features were determined per the 2013 fifth edition of BI-RADS (21) and classified as masses, non-mass, or foci. NME was defined as enhanced areas distinct from surrounding tissue, lacking a space-occupying effect and often intermingled with fat or normal fibroglandular tissue. Radiological assessments included distribution pattern, internal enhancement, TIC type, mean ADC, and other pertinent findings. Disagreements were adjudicated by a third radiologist (Reader 3, 15 years of experience, >1,000 breast MRI cases).

Multi-stage fine-tuning model construction

Data preparation and preprocessing

Patients who underwent breast MRI between January 2014 and December 2021 were first screened. Those with concordant benign or malignant diagnoses and pathological results were assigned to Dataset 1. Patients who underwent breast MRI between January 2022 and July 2024 were subsequently assigned to Dataset 2. Dataset 1 comprised the corresponding imaging dataset Itr and textual dataset Ttr, which included radiology descriptions and diagnostic conclusions provided by three readers. Dataset 2 included the corresponding imaging dataset Its, a textual dataset Tts-dsc (radiology descriptions) provided by three readers, and a textual dataset Tts-cnc (diagnostic conclusions) provided by Reader 3. Cases from Dataset 1 were included in the training set, comprising 119 patients with 119 lesions (49 benign and 70 malignant). Subsequently, Dataset 2 was used as the test set, including 110 patients with 114 lesions (57 benign and 57 malignant, 4 patients had bilateral lesions). In addition, all patients’ chief complaints and previous imaging results were available to all radiologists and the MLLM.

A standardized image input protocol was established to ensure consistency and efficiency. Considering GPT-4o’s limitation in processing a restricted number of image inputs and the difficulty of identifying the most prominent NME slice in 3D data, five to ten consecutive axial slices depicting the lesion were annotated. NumPy was used to calculate pixel counts within delineated regions and identify the two adjacent slices with the largest lesion area. All images were cropped to 128×128 pixels (JPG format) centered on the lesion’s geometric center, and ten images covering all five contrast-enhanced phases were selected for each lesion.

Multi-stage fine-tuning

All data were uploaded to OpenAI cloud server (22), with the model required to output diagnostic conclusions and management recommendations in accordance with the specified BI-RADS categories. All text and image data were de-identified before upload, and no direct patient identifiers were included in the submitted data. The study protocol, including third-party cloud-based model processing, was reviewed and approved by the institutional ethics committee. Access to the processed data and model workflow was restricted to authorized study personnel. Data processing was conducted under applicable data-processing arrangements. The training set was used for model parameter optimization, while the test set was used for model performance evaluation. Prior to fine-tuning, the tasks were performed directly using the native pretrained GPT-4o, including the use of radiology reports alone (Model A1) or combined reports and imaging inputs (Model A2). The multi-stage fine-tuning approach was applied to different modalities and datasets in sequence. Models B, C, D and E underwent SFT for five epochs, with a batch size of 5 and a learning-rate multiplier of 2; all other optimization settings were automatically managed by the training platform. Specifically, Models B and C were fine-tuned using a single modality only, with Model B trained and evaluated on radiologic images and Model C trained and evaluated on textual radiology reports. Model D was first sequentially fine-tuned on text data, followed by paired text-image data, and evaluated using multimodal inputs. Model E was jointly fine-tuned on paired text and image data and evaluated using multimodal inputs. In detail, Model B was fine-tuned using the Itr, Model C was fine-tuned using the Ttr, while Models D and E were fine-tuned using the Itr and Ttr. Models A1 and C were tested using Tts-dsc, and Model B was tested using Its, A2, D, and E were tested using Its and Tts-dsc. An overview of the six models, including their training strategies and input formats, is provided in Figure 2 and Table 2. We also initially explored a zero-shot, image-only setting using the original GPT-4o model. However, because the model consistently refused to generate diagnostic responses under this setting, it was not included in the final comparative analysis.

Figure 2 Overview of model configurations and evaluation workflow. Models A1 and A2 are original (non-fine-tuned) GPT-4o models evaluated in a zero-shot setting, using either text-only radiology reports (Model A1) or combined text–image inputs (Model A2). Models B and C were fine-tuned using a single modality only, with Model B trained and evaluated on images and Model C trained and evaluated on text-only radiology reports. Model D is sequentially fine-tuned on text, then text–image pairs, and tested with multimodal inputs. Model E is jointly fine-tuned on paired text–image data and evaluated with multimodal inputs. Each model’s final output reflects the majority vote from three independent inferences, conducted with cleared context. AUC, area under the receiver operating characteristic curve; BI-RADS, Breast Imaging-Reporting and Data System; DCE-MRI, dynamic contrast-enhanced-magnetic resonance imaging; MR, magnetic resonance; NME, non-mass enhancement.

Table 2

Overview of model design and input formats

Model Design Training input Test input
A1 Zero-shot None Text-only
A2 Zero-shot None Text-image
B Single-modality fine-tuning (image-only) Breast MRI images Image-only
C Single-modality fine-tuning (text-only) Radiology reports Text-only
D Multimodal fine-tuning (sequential) Text, then text-image pairs Text-image
E Multimodal fine-tuning (joint) Text-image pairs Text-image

MRI, magnetic resonance imaging.

Multi-reader testing and evaluation

According to the diagnostic results of Reader 3 (Tts-cnc), two additional readers (Readers 4 and 5), representing intermediate and junior radiologists with 8 and 3 years of experience and 600 and 200 breast MRI interpretations, respectively, independently reviewed all MRI images in the test set. They were informed only of the lesion location (affected side and breast region) before providing radiological conclusions. To assess the language quality of model outputs, three readers uninvolved in the original dataset independently evaluated nine domains for all five models using a five-point Likert scale (23) (1= strongly disagree; 5= strongly agree). They evaluated only the quality of the outputs, regardless of the producing model.

Statistical analyses

Each model independently generated diagnostic results for the test set three times to assess stability. To eliminate bias from interactions during multiple rounds of conversation, an automated script was implemented to initiate a new chat session for each query and clear all previous session data. The final BI-RADS category was determined by majority voting; if all three outputs differed, the median category was used. Final BI-RADS assignments were used to calculate sensitivity, specificity, accuracy, and the area under the receiver operating characteristic curve (AUC). Diagnostic performance was also evaluated for Readers 3, 4, and 5.

Statistical analyses were performed using IBM SPSS Statistics (version 27.0; IBM). Continuous variables are expressed as mean ± standard deviation and categorical variables as frequencies and percentages. Group comparisons were made using the χ2 test or Fisher exact test. Intraobserver agreement was assessed with the Fleiss κ coefficient and interobserver agreement between each model and the senior radiologist (Reader 3) with the Cohen κ coefficient, both with 95% confidence intervals (CIs). κ values were interpreted as follows: <0.20, poor; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, good; and 0.81–1.00, excellent. Differences in quality scores among the six models were analyzed using the Kruskal-Wallis H test; when significant, pairwise comparisons were performed using the Dunn test with Bonferroni correction. The DeLong test was used to compare AUCs between each model and the radiologists. A two-sided P<0.05 indicated statistical significance.


Results

Patient baseline characteristics and imaging features

A total of 229 female patients were included in this study (mean age, 46.7±13.9 years; range, 25–81 years). Univariate logistic regression revealed significant differences in tumor distribution, internal enhancement, and TIC type between benign and malignant lesions. Detailed lesion characteristics are summarized in Table 3.

Table 3

Comparison of MRI features between benign and malignant NME lesions

MRI features Malignant (n=127) Benign (n=106) P value
Distribution 0.032
   Focal 27 (21.26) 30 (28.30)
   Linear 5 (3.94) 5 (4.72)
   Segmental 63 (49.61) 38 (35.85)
   Regional 19 (14.96) 12 (11.32)
   Multiple regions 8 (6.30) 19 (17.92)
   Diffuse 5 (3.94) 2 (1.89)
Internal enhancement patterns <0.001
   Homogeneous 8 (6.30) 7 (6.60)
   Heterogeneous 66 (51.97) 45 (42.45)
   Clumped 36 (28.35) 13 (12.26)
   Clustered ring 17 (13.39) 41 (38.68)
TIC <0.001
   Persistent 9 (7.09) 45 (42.45)
   Plateau 45 (35.43) 36 (33.96)
   Washout 73 (57.48) 25 (23.58)

Data are presented as n (%). MRI, magnetic resonance imaging; NME, non-mass enhancement; TIC, time-signal intensity curve.

Comparison of diagnostic performance

  • Comparison among radiologists: diagnostic performance varied with experience level among radiologists. Intermediate and senior radiologists outperformed junior radiologists (AUC 0.75 vs. 0.60, P<0.001; AUC 0.78 vs. 0.60, P<0.001), while no significant difference was observed between intermediate and senior radiologists (P=0.160).
  • Comparison among models: among the six models, Model B showed the poorest performance and was significantly inferior to the other five models. Excluding Model B, the diagnostic performance of the remaining models (A1, A2, and C–E) improved progressively, although differences between adjacent models were not statistically significant. Model C outperformed Model A1, Model D outperformed both Models A1 and A2, and Model E outperformed Models A1, A2, and C, with these differences being statistically significant (Figure 3). Among all models, Model E achieved the highest accuracy and specificity (78.1% and 61.4%), whereas Models A1 and C exhibited the highest sensitivity (96.5%) (Figure 4).
    Figure 3 Results of DeLong’s test for comparing AUCs among different models and radiologists. Models: A1, zero-shot original GPT-4o using text-only reports; A2, zero-shot original GPT-4o using text–image inputs; B, image-only fine-tuned model; C, text-only fine-tuned model; D, sequentially fine-tuned model using text followed by text–image pairs; E, jointly fine-tuned model using paired text–image data. Readers: R3, senior radiologist (15 years, >1,000 breast MRI interpretations); R4, intermediate radiologist (8 years, 600 interpretations); R5, junior radiologist (3 years, 200 interpretations). AUC, area under the receiver operating characteristic curve.
    Figure 4 Radar chart of diagnostic performance comparison. The radial range for each performance metric was defined according to the observed minimum and maximum values across all observers for that metric, with consistent tick intervals within each axis, and that each metric should be interpreted within its own scale. AUC, area under the receiver operating characteristic curve; FT, fine-tuning.
  • Cross-comparison between radiologists and models: zero-shot models (A1 and A2) performed slightly better than junior radiologists (AUC 0.67 and 0.68 vs. 0.60, P>0.05), but was inferior to intermediate and senior radiologists (P<0.05). Fine-tuned Model B showed performance comparable to that of junior radiologists (AUC 0.56 vs. 0.60; P>0.05), but inferior to that of intermediate- and senior-level radiologists (P<0.05). Fine-tuned Models C, D, and E significantly outperformed junior radiologists (P<0.05) and showed no significant difference compared with intermediate and senior radiologists (P>0.05). Model E attained the highest performance (AUC 0.81) (Figure 3). To visually demonstrate the case-level prediction consistency among different models and their differences compared with radiologists, we further plotted a case-level heatmap (Figure 5).
    Figure 5 Case-level heatmaps showing BI-RADS classifications generated by the zero-shot and multi-stage fine-tuned models based on GPT-4o, benchmarked against histopathologically confirmed ground truth (n=114). The top row (GT) represents reference diagnoses confirmed by surgical pathology or histopathologic biopsy. The bottom row shows BI-RADS scores independently assessed by the senior breast radiologists after full review of clinical reports and imaging studies. Color intensity reflects the predicted BI-RADS score, ranging from 2 to 5. Models: A1, zero-shot original GPT-4o using text-only reports; A2, zero-shot original GPT-4o using text–image inputs; B, image-only fine-tuned model; C, text-only fine-tuned model; D, sequentially fine-tuned model using text followed by text–image pairs; E, jointly fine-tuned model using paired text–image data. BI-RADS, Breast Imaging-Reporting and Data System; GPT-4o, generative pre-trained transformer 4 omni.

Table S1 summarizes comparative diagnostic performance among radiologists with different experience levels, among all models, and between radiologists and models.

Evaluation of radiological impressions generated by models

Three independent radiologists evaluated the radiological impressions generated by each model using a 5-point Likert scale. Across three independent assessments, Model B received the lowest scores, which were significantly lower than those of models A1, C, D, and E (P<0.05), but did not differ significantly from those of Model A2 (P>0.05). Model E achieved the highest scores (4.93, 4.95, and 4.94), which were significantly higher than those of Models A1, A2, and D (P<0.05). Model C also achieved high scores (4.88, 4.92, and 4.97), with no significant differences compared to Model E (P>0.05) (Table 4). Figure 6 presents a comprehensive evaluation of different models across various domains. Figure 7 shows confusion matrices for the six models and the senior radiologist, complementing the case-level heatmap by summarizing the overall classification results in the test set.

Table 4

Likert scale assessment results

Model Output 1 Output 2 Output 3
Score P value Score P value Score P value
A1 4.72 (4.69–4.75) 4.75 (4.72–4.78) 4.74 (4.71–4.77)
A2 4.57 (4.56–4.59) vs. A1: * 4.57 (4.56–4.59) vs. A1: * 4.57 (4.56–4.59) vs. A1: *
B 4.53 (4.50–4.56) vs. A1: *; vs. A2: 0.267 4.48 (4.45–4.51) vs. A1: *; vs. A2: 0.056 4.48 (4.45–4.51) vs. A1: *; vs. A2: 0.084
C 4.88 (4.85–4.92) vs. A1: *; vs. A2: *; vs. B: * 4.92 (4.89–4.95) vs. A1: *; vs. A2: *; vs. B: * 4.97 (4.96–4.98) vs. A1: *; vs. A2: *; vs. B: *
D 4.86 (4.83–4.89) vs. A1: *; vs. A2: *; vs. B: *; vs. C :0.074 4.89 (4.86–4.92) vs. A1: *; vs. A2: *; vs. B: *; vs. C: 0.052 4.86 (4.83–4.89) vs. A1: *; vs. A2: *; vs. B: *; vs. C: *
E 4.93 (4.91–4.95) vs. A1: *; vs. A2: *; vs. B: *; vs. C: 0.125; vs. D: * 4.95 (4.93–4.97) vs. A1: *; vs. A2: *; vs. B: *; vs. C: 0.152; vs. D: * 4.94 (4.92–4.96) vs. A1: *; vs. A2: *; vs. B: *; vs. C: 0.224; vs. D: *

Scores are presented as mean (95% CI). * denotes a statistically significant difference (P<0.05). A1, zero-shot GPT-4o using text-only reports; A2, zero-shot GPT-4o using text–image inputs; B, image-only fine-tuned model; C, text-only fine-tuned model; D, sequentially fine-tuned multimodal model; E, jointly fine-tuned multimodal model. CI, confidence interval.

Figure 6 Radar plots showing the performance profiles across four report-quality domains. The first row presents models A1–B using original scores with a radial range of 3.0–5.0, allowing preservation of the broader score distribution. The second row presents a zoomed-in view of the higher-performing models C–E using a restricted radial range of 4.5–5.0 to better visualize subtle inter-model differences. Because the two rows use different radial scales, the polygon sizes should be interpreted within, rather than across, rows. The domain of professional judgment encompasses specific diagnosis and differential diagnosis; the domain of knowledge accuracy includes scientific terminology and correctness; the domain of report structure consists of coherence, comprehensiveness, and management recommendations; and the domain of safety and ethics covers harmlessness and lack of bias. Models C and E achieved the highest scores, with no statistically significant difference between them. Although Models A1 and A2 received very similar scores, their performance characteristics differed qualitatively; for instance, A1 showed notable deficiencies in professional judgment, whereas A2 demonstrated clear shortcomings in report structure. Although the score of Model B was comparable to that of A2, it exhibited substantial deficiencies in knowledge accuracy. FT, fine-tuning.
Figure 7 Confusion matrices of benign-versus-malignant classification for the six models and the senior radiologist. Each matrix shows the numbers of true-positive, true-negative, false-positive, and false-negative cases in the test set. Models: A1, zero-shot original GPT-4o using text-only reports; A2, zero-shot original GPT-4o using text–image inputs; B, image-only fine-tuned model; C, text-only fine-tuned model; D, sequentially fine-tuned model using text followed by text–image pairs; E, jointly fine-tuned model using paired text–image data. LLM, large language model.

Comparison of diagnostic consistency

Diagnostic consistency was assessed using intra- and interobserver agreement analyses (Table 5). Model B showed poor intra-model agreement and poor agreement with senior radiologists. For benign-versus-malignant classification of NME lesions, Model C showed moderate intra-model agreement (κ=0.43), whereas the remaining models showed good to almost perfect intra-model agreement (κ=0.63–0.86). For BI-RADS classification, Model C showed fair intra-model agreement (κ=0.36), whereas the other models showed moderate to good intra-model agreement (κ=0.46–0.70). Across all tasks, all models other than Model B demonstrated fair to moderate agreement with senior radiologists (κ=0.39–0.60).

Table 5

Agreement analysis among LLMs and radiologists

Comparison BI-RADS classification [0–6], κ (95% CI) Benign-malignant classification [0, 1], κ (95% CI)
Intraobserver
   Model A1 0.70 (0.60–0.79) 0.76 (0.65–0.86)
   Model A2 0.67 (0.60–0.75) 0.86 (0.75–0.96)
   Model B −0.03 (−0.10 to 0.04) −0.07 (−0.18 to 0.03)
   Model C 0.36 (0.29–0.44) 0.43 (0.33–0.54)
   Model D 0.46 (0.39–0.52) 0.63 (0.53–0.74)
   Model E 0.47 (0.40–0.53) 0.72 (0.62–0.83)
Interobserver
   Model A1 vs. radiologists 0.42 (0.25–0.59) 0.42 (0.22–0.62)
   Model A2 vs. radiologists 0.42 (0.27–0.57) 0.41 (0.21–0.61)
   Model B vs. radiologists 0.02 (–0.11 to 0.14) 0.03 (–0.16 to 0.20)
   Model C vs. radiologists 0.39 (0.24–0.53) 0.47 (0.27–0.67)
   Model D vs. radiologists 0.42 (0.30–0.55) 0.56 (0.39–0.74)
   Model E vs. radiologists 0.49 (0.36–0.61) 0.60 (0.44–0.76)

A1, zero-shot GPT-4o using text-only reports; A2, zero-shot GPT-4o using text–image inputs; B, image-only fine-tuned model; C, text-only fine-tuned model; D, sequentially fine-tuned multimodal model; E, jointly fine-tuned multimodal model. BI-RADS, Breast Imaging-Reporting and Data System; CI, confidence interval; LLM, large language model; κ, kappa coefficient.


Discussion

We developed fine-tuned MLLMs to generate accurate diagnostic impressions for NME lesions and to explore the respective roles of imaging and textual information in this task. Our results indicate that the general-purpose model GPT-4o remains limited in pure breast MRI applications and relies heavily on textual information for diagnostic performance. First, Model B, which was fine-tuned on imaging data alone, still performed worse than the zero-shot models A1 and A2, both of which incorporated textual information. Second, the zero-shot models showed relatively limited performance with either standalone textual reports (Model A1) or combined text-and-image inputs (Model A2), with no significant difference between the two; by contrast, performance improved substantially once text was incorporated during fine-tuning. In addition to this text dependence, these findings may also reflect the foundation model’s limited domain-specific knowledge in breast imaging (18,24-26).

Given the previously observed failure of the foundation model in the image-only zero-shot diagnostic setting, we were unable to directly determine the independent value of image-only fine-tuning. We therefore developed a sequential multimodal model (Model D) and a jointly trained multimodal model (Model E) to further investigate the contributions of different modalities during fine-tuning. Compared with the text-only fine-tuned Model C, Model D did not show a statistically significant improvement, whereas Model E performed significantly better than Model C. These findings suggest that sequential fine-tuning may still be driven largely by textual information, whereas joint fine-tuning may better promote cross-modal feature integration and training efficiency, thereby allowing the incremental value of image-based fine-tuning to be more fully realized. The absence of a significant difference between models E and D may reflect the limited sample size, or alternatively suggest that imaging information provides additional value only in a subset of cases. Further studies are warranted to clarify the effect of sample size on model performance (27).

Importantly, once text was incorporated into fine-tuning, all fine-tuned models outperformed the non-fine-tuned models, regardless of the specific fine-tuning strategy, and achieved performance comparable to that of intermediate- and senior-level radiologists. Specifically, under matched modality settings, Model C outperformed Model A1, whereas Models D and E both outperformed Model A2. Together, these findings suggest that task-specific fine-tuning on relatively small but reasonably well-annotated datasets can still meaningfully improve model performance in targeted clinical applications, consistent with previous reports (16,28-30).

Fine-tuned Model E achieved higher AUC, accuracy, and specificity than all other models and all three radiologists. Its improved specificity suggests potential value in reducing unnecessary biopsy recommendations or overcalling in selected cases, particularly when used as a second reader for less experienced radiologists. However, its sensitivity remained 5.3% lower than that of senior radiologists (Figure 5). These false-negative cases were primarily attributable to the misclassification of malignant lesions as benign adenosis, including 3 cases of ductal carcinoma in situ (DCIS), 1 with microinvasion, which showed either regional/multiregional distribution or less pronounced restricted diffusion on DWI. This pattern indicates that, despite its favorable specificity, the model should not be considered a standalone diagnostic tool, especially in clinically suspicious cases or when imaging findings are subtle, because missed malignancies remain a concern. Accordingly, its most appropriate role may be as an assistive tool within a human-artificial intelligence (AI) collaborative framework.

In terms of model stability and quality scores, Model B, which was fine-tuned on imaging data alone, showed clear limitations, further suggesting that this strategy has limited practical value. Overall, the zero-shot models demonstrated relatively high intra-model agreement, but their mean quality scores were lower. In particular, Model A2 showed a marked decline in report structure performance (Figure 6). Excluding Model B, Model C showed the lowest intra-model agreement, despite achieving relatively high quality scores that even exceeded those of the multimodal fine-tuned Model D. Among all fine-tuned models, Model E achieved the highest agreement and quality scores. These findings suggest that foundation models pretrained on general knowledge can generate relatively safe and linguistically plausible statements (31), which may contribute to greater output consistency. However, their lower language quality also reflects insufficient domain-specific expertise in breast imaging. Fine-tuning on radiology report text alone may help the model learn relatively fixed reporting patterns, but it primarily establishes a text-to-summary mapping and does not explicitly model the correspondence between textual content and imaging evidence. This tendency may amplify biases in the source reports, thereby increasing the risk of hallucinations and inconsistencies (32,33). By contrast, joint fine-tuning on both text and images enables the model to directly integrate imaging findings with report content, making the training objective more closely aligned with the clinical task and allowing more robust feature learning through cross-modal verification. As a result, the summaries generated by this strategy were not only more consistent with the original imaging findings but also more similar to reports written by experienced radiologists (Figure 8). Nevertheless, even the best-performing model achieved only moderate agreement with senior radiologists in BI-RADS assignment of NME lesions (κ=0.49). This level of agreement is similar to that reported in a previous study using ChatGPT to interpret contrast-enhanced breast MRI narrative reports, in which agreement with radiologists’ consensus was also moderate (AC1 =0.49–0.52) (19). This finding should be interpreted in the context of breast MRI itself, where inter-reader agreement is often imperfect, particularly for complex NME characterization. Indeed, prior work has shown that inter-reader agreement for BI-RADS assignment of NME lesions is only moderate (α=0.38) (34), further highlighting the difficulty of this task and the remaining limitations of the model in complex breast MRI reasoning.

Figure 8 Demonstration of GPT-4o-generated diagnostic impressions. (A) Schematic of the prompt strategy used for GPT-4o-based generation of diagnostic impressions. (B) Diagnostic impressions generated across different models. (C) Reference diagnostic impression provided by a senior radiologist. Summary statements are highlighted. ADC, apparent diffusion coefficient; BI-RADS, Breast Imaging-Reporting and Data System; DCE-MRI, dynamic contrast-enhanced-magnetic resonance imaging; GPT-4o, generative pre-trained transformer 4 omni; T1WI, T1-weighted imaging.

Limitations

Several limitations warrant consideration. First, image input was limited: owing to GPT-4o’s constraints and the invisibility of many NME lesions on ultrasound or mammography, this study relied solely on MRI contrast-enhanced images. The use of 2D images restricted lesion localization and access to 3D volumetric data, particularly for spatially complex NME lesions, as reflected by the clear deficiencies of the image-only fine-tuned model (Model B) in lesion localization and morphologic description. Future work should explore volumetric, sequence-level, multi-planar, and multimodality imaging inputs to better capture the full spectrum of features of breast NME lesions. Second, clinical data were incomplete: key variables such as hormonal status, menopausal status, family history, genetic risk, and prior breast cancer history were not available to models. Their absence may have weakened model reasoning and also affected comparison with radiologists, who usually integrate such information in routine practice. Future studies should incorporate more comprehensive clinical data to better reflect real-world breast MRI interpretation. Third, sample selection bias: GPT-4o analyzed only NME cases identified by radiologists and confirmed pathologically; however, some cases likely remained undetected due to masking by background parenchymal enhancement, absence of contrast in certain cancers, or technical issues such as motion artifacts (35). Their omission may have slightly overestimated radiologists’ sensitivity. A further limitation is that model inference and fine-tuning involved cloud-based processing of de-identified clinical imaging data, which may still raise governance concerns in regulated clinical settings. Future work should explore locally deployable models to address this issue. Finally, this retrospective single-center design, together with a relatively small sample size and lack of external validation, limits the generalizability of the findings and supports interpreting this study as proof-of-concept; moreover, the retrospective nature of the study obviated the need for formal a priori power analysis and limited the interpretability of post hoc power analysis; therefore, no power analysis was performed. Model performance may vary across scanners, institutions, acquisition protocols, and patient populations, because these factors can influence image appearance, lesion conspicuity, reporting style, and case mix. Further research is required to assess the model across diverse populations.


Conclusions

In conclusion, multimodal fine-tuning substantially improved the performance of GPT-4o for generating diagnostic impressions of breast NME lesions, with the jointly fine-tuned model showing the best overall balance of diagnostic accuracy, specificity, stability, and report quality. Our findings indicate that the general-purpose foundation model remains limited in pure breast MRI applications and relies heavily on textual information, whereas effective integration of imaging and text through joint fine-tuning can better unlock the clinical value of multimodal learning. Although the best-performing model achieved performance comparable to, and in some aspects exceeding, that of experienced radiologists, its residual false-negative risk indicates that it should be considered an assistive second reader rather than a replacement for physician expertise.


Acknowledgments

None.


Footnote

Reporting Checklist: The authors have completed the STARD-AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2523/rc

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2523/dss

Funding: This work was supported by Affiliated Hospital of Jiaxing University for Scientific Research (No. 2022-QMX-017), Jiangsu Health Commission for Scientific Research (No. F202037), and Nantong University for Scientific Research (No. 2024LY007).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2523/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the institutional ethics board of Nantong First People’s Hospital (No. 2021KT167). Individual consent for this retrospective analysis was waived.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Bray F, Laversanne M, Sung H, Ferlay J, Siegel RL, Soerjomataram I, Jemal A. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2024;74:229-63. [Crossref] [PubMed]
  2. Chadashvili T, Ghosh E, Fein-Zachary V, Mehta TS, Venkataraman S, Dialani V, Slanetz PJ. Nonmass enhancement on breast MRI: review of patterns with radiologic-pathologic correlation and discussion of management. AJR Am J Roentgenol 2015;204:219-27. [Crossref] [PubMed]
  3. Xie Y, Zhang X. A risk prediction stratification for non-mass breast lesions, combining clinical characteristics and imaging features on ultrasound, mammography, and MRI. Front Oncol 2024;14:1337265. [Crossref] [PubMed]
  4. Clauser P, Krug B, Bickel H, Dietzel M, Pinker K, Neuhaus VF, Marino MA, Moschetta M, Troiano N, Helbich TH, Baltzer PAT. Diffusion-weighted Imaging Allows for Downgrading MR BI-RADS 4 Lesions in Contrast-enhanced MRI of the Breast to Avoid Unnecessary Biopsy. Clin Cancer Res 2021;27:1941-8. [Crossref] [PubMed]
  5. Baltzer PAT, Kaiser WA, Dietzel M. Lesion type and reader experience affect the diagnostic accuracy of breast MRI: a multiple reader ROC study. Eur J Radiol 2015;84:86-91. [Crossref] [PubMed]
  6. Liu G, Li Y, Chen SL, Chen Q. Non-mass enhancement breast lesions: MRI findings and associations with malignancy. Ann Transl Med 2022;10:357. [Crossref] [PubMed]
  7. Eriksen A V, Möller S, Ryg J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI 2024;1:
  8. Alam S, Sohail SS. Enhancing Breast Imaging Strategies: The Role of ChatGPT in Optimizing Screening Pathways. Clin Breast Cancer 2024;24:e772. [Crossref] [PubMed]
  9. Maroncelli R, Rizzo V, Pasculli M, Cicciarelli F, Macera M, Galati F, Catalano C, Pediconi F. Probing clarity: AI-generated simplified breast imaging reports for enhanced patient comprehension powered by ChatGPT-4o. Eur Radiol Exp 2024;8:124. [Crossref] [PubMed]
  10. Rao A, Kim J, Kamineni M, Pang M, Lie W, Dreyer KJ, Succi MD. Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot. J Am Coll Radiol 2023;20:990-7. [Crossref] [PubMed]
  11. Shahriar S, Lund BD, Mannuru NR, Arshad MA, Hayawi K, Bevara RVK, Mannuru A, Batool L. Putting gpt-4o to the sword: A comprehensive evaluation of language, vision, speech, and multimodal proficiency. Applied Sciences 2024;14:7782.
  12. Tanno R, Barrett DGT, Sellergren A, Ghaisas S, Dathathri S, See A, et al. Collaboration between clinicians and vision-language models in radiology report generation. Nat Med 2025;31:599-608. [Crossref] [PubMed]
  13. Nguyen D, Rao A, Mazumder A, Succi MD. Exploring the accuracy of embedded ChatGPT-4 and ChatGPT-4o in generating BI-RADS scores: a pilot study in radiologic clinical support. Clin Imaging 2025;117:110335. [Crossref] [PubMed]
  14. Brin D, Sorin V, Barash Y, Konen E, Glicksberg BS, Nadkarni GN, Klang E. Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol 2025;35:1959-65. [Crossref] [PubMed]
  15. Voinea ȘV, Mămuleanu M, Teică RV, Florescu LM, Selișteanu D, Gheonea IA. GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3. Bioengineering (Basel) 2024;11:1043. [Crossref] [PubMed]
  16. Lim YH, Yeoh PSQ, Lai KW. Evaluating fine-tuned GPT models on different datasets in the healthcare domain. Innov Emerg Technol 2025;12:2550012.
  17. Lee S, Youn J, Kim H, Kim M, Yoon SH. CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images. Eur Radiol 2025;35:4374-86. [Crossref] [PubMed]
  18. Haver HL, Yi PH, Jeudy J, Bahl M. Use of ChatGPT to Assign BI-RADS Assessment Categories to Breast Imaging Reports. AJR Am J Roentgenol 2024;223:e2431093. [Crossref] [PubMed]
  19. Cozzi A, Pinker K, Hidber A, Zhang T, Bonomo L, Lo Gullo R, Christianson B, Curti M, Rizzo S, Del Grande F, Mann RM, Schiaffino S. BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology 2024;311:e232133. [Crossref] [PubMed]
  20. Soni N, Ora M, Agarwal A, Yang T, Bathla G. A Review of the Opportunities and Challenges with Large Language Models in Radiology: The Road Ahead. AJNR Am J Neuroradiol 2025;46:1292-9. [Crossref] [PubMed]
  21. D’Orsi CJ, Sickles EA, Mendelson EB, Morrie EA. ACR BI-RADS® Atlas, Breast Imaging Reporting and Data System. 2013. [Accessed July 18, 2025]. Available online: https://www.acr.org/Clinical-Resources/Clinical-Tools-and-Reference/Reporting-and-Data-Systems/BI-RADS
  22. OpenAI website. [accessed 20 Dec, 2024]. Available online: openai.com/research/gpt-4
  23. Zhang L, Liu M, Wang L, Zhang Y, Xu X, Pan Z, Feng Y, Zhao J, Zhang L, Yao G, Chen X, Xie X. Constructing a Large Language Model to Generate Impressions from Findings in Radiology Reports. Radiology 2024;312:e240885. [Crossref] [PubMed]
  24. Mukherjee P, Hou B, Suri A, Zhuang Y, Parnell C, Lee N, Stroie O, Jain R, Wang KC, Sharma K, Summers RM. Evaluation of GPT Large Language Model Performance on RSNA 2023 Case of the Day Questions. Radiology 2024;313:e240609. [Crossref] [PubMed]
  25. Suh PS, Shim WH, Suh CH, Heo H, Park KJ, Kim PH, Choi SJ, Ahn Y, Park S, Park HY, Oh NE, Han MW, Cho ST, Woo CY, Park H. Comparing Large Language Model and Human Reader Accuracy with New England Journal of Medicine Image Challenge Case Image Inputs. Radiology 2024;313:e241668. [Crossref] [PubMed]
  26. Yao JQ, Zhang R, Yang ZB, Zhang B, Jiang SQ, Jiang L, Zhang XE, Xie XY, Huang TY, Xu M. Multimodal large language models in ultrasound diagnosis of breast masses: a multicenter comparative analysis based on GPT-4o, radiologists, and convolutional neural network (CNN). Quant Imaging Med Surg 2025;15:9453-65. [Crossref] [PubMed]
  27. Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, Goodman AM, Longhurst CA, Hogarth M, Smith DM. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 2023;183:589-96. [Crossref] [PubMed]
  28. Sievert M, Aubreville M, Mueller SK, Eckstein M, Breininger K, Iro H, Goncalves M. Diagnosis of malignancy in oropharyngeal confocal laser endomicroscopy using GPT 4.0 with vision. Eur Arch Otorhinolaryngol 2024;281:2115-22. [Crossref] [PubMed]
  29. Liu G, Tang X, He J, Li P, Chen Z, Zhong S. PeFoMed: Parameter efficient fine-tuning of multimodal large language models for medical CXR. Sci Rep. 2026; [Crossref]
  30. Ji J, Li G, Fu B, Zhao H, Wu Y, Liang H, Wu Y. Comparison of online radiologists and large language model chatbots in responding to common radiology-related questions in Chinese: a cross-sectional comparative analysis. Quant Imaging Med Surg 2026;16:129. [Crossref] [PubMed]
  31. Yang Q, Chen J, Sun Y, Wang Y, Tan T. Fine-tuning medical language models for enhanced long-contextual understanding and domain expertise. Quant Imaging Med Surg 2025;15:5450-62. [Crossref] [PubMed]
  32. Schiaffino S, Zhang T, Mann RM, Pinker K. The Role of Large Language Models (LLMs) in Breast Imaging Today and in the Near Future. J Magn Reson Imaging 2025;62:1296-304. [Crossref] [PubMed]
  33. Kim K, Cho K, Jang R, Kyung S, Lee S, Ham S, Choi E, Hong GS, Kim N. Updated Primer on Generative Artificial Intelligence and Large Language Models in Medical Imaging for Medical Professionals. Korean J Radiol 2024;25:224-42. [Crossref] [PubMed]
  34. El Khoury M, Lalonde L, David J, Labelle M, Mesurolle B, Trop I. Breast imaging reporting and data system (BI-RADS) lexicon for breast MRI: interobserver variability in the description and assignment of BI-RADS category. Eur J Radiol 2015;84:71-6. [Crossref] [PubMed]
  35. Korhonen KE, Zuckerman SP, Weinstein SP, Tobey J, Birnbaum JA, McDonald ES, Conant EF. Breast MRI: False-Negative Results and Missed Opportunities. Radiographics 2021;41:645-64. [Crossref] [PubMed]
Cite this article as: Tang J, Yu X, Kang D, Luo S, Sheng M. Fine-tuned multimodal GPT-4o for generating diagnostic impressions in breast magnetic resonance imaging: insights into non-mass enhancement lesions. Quant Imaging Med Surg 2026;16(7):536. doi: 10.21037/qims-2025-1-2523

Download Citation