Original Article
Fine-tuned multimodal GPT-4o for generating diagnostic impressions in breast magnetic resonance imaging: insights into non-mass enhancement lesions
Abstract
Background: Previous studies have reported limited performance of publicly available large language models for Breast Imaging Reporting and Data System (BI-RADS) categorization of breast lesions. However, their application to breast non-mass enhancement (NME) lesions has not been specifically investigated. This study evaluated the diagnostic capability of multimodally fine-tuned ChatGPT-4 omni [generative pre-trained transformer 4 omni (GPT-4o)] for breast NME lesions, focusing on the accuracy of generated diagnostic impressions.
Methods: In this retrospective study, magnetic resonance imaging (MRI) contrast-enhanced images, radiology reports, and clinical histories of 229 patients with breast NME lesions were collected. Six models were developed across three settings: zero-shot without fine-tuning (A1 and A2), single-modality fine-tuning (B and C), and multimodal fine-tuning (D and E). The models were evaluated for their ability to generate summarized imaging reports. Their diagnostic performance was compared against that of three radiologists with varying experience levels. Finally, the models were comprehensively assessed by the three additional experts using a five-point Likert scale.
Results: The zero-shot models achieved area under the receiver operating characteristic curve (AUC) of 0.67–0.68, slightly outperforming junior radiologists but remaining inferior to experienced radiologists. Among the fine-tuned models, the image-only model (Model B) showed the poorest performance [AUC, 0.56; 95% confidence interval (CI): 0.45–0.66], significantly inferior to the other models (all P≤0.047). In contrast, the text-only and multimodal fine-tuned models showed substantial performance improvements. Model E achieved the best performance, with an AUC of 0.81 (95% CI: 0.74–0.89), significantly outperforming the zero-shot, image-only, and text-only models (all P≤0.039). Model E also outperformed junior radiologists (P<0.001), while showing no significant difference compared with intermediate and senior radiologists (P=0.060 and 0.367, respectively). The average expert evaluation score for Model E was 4.94/5. Compared with senior radiologists, specificity increased by 14.0%, whereas sensitivity decreased by 5.3%. Agreement for model-generated BI-RADS assessments was moderate both within the model outputs (κ=0.47) and between the model and senior radiologists (κ=0.49).
Conclusions: Multimodally fine-tuned GPT-4o can generate clinically accurate diagnostic impressions for complex breast NME lesions with performance approaching that of experienced radiologists. The model can improve diagnostic specificity and may serve as a useful second reader for breast MRI interpretation, particularly for less experienced radiologists, although expert supervision remains necessary because sensitivity was slightly lower than that of senior radiologists.

