Comparison of online radiologists and large language model chatbots in responding to common radiology-related questions in Chinese: a cross-sectional comparative analysis
Introduction
In several healthcare fields, artificial intelligence (AI)-based large language model chatbots (LLM-chatbots) have demonstrated the ability to respond to patient queries, with their performance in accurately understanding questions and providing standardized medical responses approximating that of human professionals (1). LLM-chatbots are capable of drafting or automatically replying to a portion of these medical queries, which allows clinical staff to focus on more complex tasks, thus reducing their workload and preventing burnout (2). The production of seemingly credible but incorrect responses, a common phenomenon known as the “hallucination effect”, however, complicates the use of LLMs in clinical healthcare practice. Thus, evaluating the reliability of LLM-generated responses and ensuring their validity remain key challenges in advancing healthcare practice.
Radiology is a complex field that encompasses clinical medicine, imaging technology, and diagnostic imaging. Radiological services may seem somewhat intimidating, mysterious, or complex to patients and may even violate patient privacy. Generally, patients with numerous healthcare-related questions are unable to meet with their diagnostic radiologists, which worsens the effectiveness of patient-physician communication. The lack of resources for radiological healthcare providers has led patients to turn to Internet telemedicine consultations for guidance. However, this Internet-based counseling involves certain drawbacks, including the fragmentary nature of the available information, the uneven professional level of Online Radiologist have brought many disadvantages to Internet medical consultation. Patients also must pay for extra medical consultations, and they may receive meaningless responses. Meanwhile, physicians must devote considerable time and attention to deciding how to respond to their patients, which contributes substantially to their burden and may not achieve the desired results (3). Indeed, patient-LLM interaction may occasionally be an obstacle to effective patient-online physician communication (4-6).
Previous studies have examined the value of LLM-chatbots in generating radiology reports, streamlining the reports, and aiding in diagnosis (7-9). Given that LLMs only receive radiation-related knowledge piecemeal for training purposes, their ability to produce high-quality, empathetic, and reliable responses has not been validated. In addition, due to the differences between Chinese- and English-language training databases, it has not been clearly determined to what extent LLMs produce reliable responses in this field.
This study thus aimed to evaluate the performance of LLM-chatbots in responding to radiology-related questions posed on online platforms. Specifically, the questions and answers produced in first-time consultations with patients were compared between online radiologists and LLM-chatbots through use of qualitative, subjective evaluators. Moreover, two mainstream LLM-chatbots, DeepSeek-R1 and ChatGPT-4, were compared quantitatively in terms of textual (Chinese language) features, response time, and self-improvement capability. This combined expert-led and data-led assessment method facilitated the holistic assessment of LLM-chatbot performance in responding to common radiology-related patient inquiries.
Methods
Data source
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The cross-sectional study was waivered from review by the Ethics Committee of The First Affiliated Hospital of Henan Medical University as data were publicly available and followed the STROBE reporting guideline. From May 2024 to January 2025, we collected publicly available patient questions and online responses from radiologist posted on an online social media forum, HaoDF (www.haodf.com). HaoDF is one of China’s largest online health consultation platforms, through which patients can pay to receive medical consultations (10). This platform includes 900,000 physicians registered from 10,000 hospitals in China. To ensure medical service quality, more than 70% of the registered physicians are from comprehensive medical centers that integrate education, healthcare, and academic research. All analyses in this study adhered to HaoDF’s terms and conditions.
Study design
The sample analyzed in this study contained 106 queries from 106 randomly selected patient consultation records of interactions occurring from May 2024 to January 2025. When an online radiologist replied more than once, we only considered the first response, as most of the subsequent responses were restatements of the first response. Additionally, data were collected free and publicly accessible online LLMs, DeepSeek-R1 (https://www.DeepSeek.com/) and ChatGPT-4o (https://chat.openai.com). From February 24 to 30 2025, each patient inquiry was input into a new chatbot session with the following prompt: “You are a Chinese radiologist. Please answer the following question according Chinese medical consensus and knowledge”.
Three radiologists (Yue Wu, Y.F., and Z.Z.) with 15 years of clinical experience from different medical centers were recruited as reviewers to independently and blindly evaluate all responses originating from three sources (online radiologists, DeepSeek-R1, and ChatGPT-4o). The following three subjective dimensions were graded on a 5-point Likert scale: quality, empathy, and potential harm. Easily identifiable expressive features were removed. For example, typical phrases from online radiologists could include “Auntie, this is definitely fine”, “Try this medicine first”, “More likely (disease)”, and “Monitor for now”. Those from chatbots could include “I understand your concern” and “As an AI assistant, I recommend consulting a professional,” while markers of structured responses could include “First”, “Second”, “Finally”, and “In summary”. Each inquiry was followed by responses from all three sources. Three radiologists, blinded to the source of the responses, conducted independent evaluations of all questions.
Further comparisons between LLM-chatbots and online radiologists were conducted with three objective evaluation metrics: textual features, response time, and self-improvement capacity. A set of linguistic features were measured with the Chinese Readability Index Explorer (CRIE; http://www.chinesereadability.net; tool), which included six metrics across the lexical, syntactic, and semantic domains. Linguistic features can, to a certain extent, serve as valid indicators of textual complexity (11). The specific interpretations of each linguistic metric and their correlations with textual complexity are summarized in Table 1.
Table 1
| Dimension | Metric name | Description | Relevance to text complexity |
|---|---|---|---|
| Lexis | WC | Nontext characters in the text are removed for word segmentation, and the number of independent words in it are calculated. Higher values correspond to a greater volume of textual content, which tends to increase reading difficulty | Positive correlation |
| NDW | The International Chinese Language Education Chinese Proficiency Standards specifies the number of vocabulary items beyond the scope of high-frequency words. Higher numerical values correspond to increased reading difficulty | Positive correlation | |
| Syntax | ASL | The average length of all sentences. Longer sentences with more complex structures tend to exhibit higher reading difficulty | Positive correlation |
| NS | Total sentence count provides a measure of textual length and the amount of information | Positive correlation | |
| Semantics | RDW | The ratio of semantically distinct word occurrences in the text. A higher value indicates greater semantic diversity and consequently increased text comprehension complexity | Positive correlation |
| NSCM | The total number of sentences containing complex semantic categories. The higher the ratio of polysemous words to semantic categories is, the greater the comprehension difficulty of the text | Positive correlation |
ASL, average sentence length; NDW, number of difficult words; NSCM, number of sentences with complex meaning; NS, number of sentences; RDW, ratio of dissimilar words; WC, word count.
To evaluate the response time, we recorded the time from when the question was submitted to when the chatbot replied. The response time of the online radiologists could be calculated through forums. The forum displayed the time at which the questioner submitted their query and the time at which the online physician completed their response. The response time was recorded as the interval between the two aforementioned time points. The LLM-chatbots’ response time was manually obtained with a stopwatch application. To minimize the potential influence of prompt sequence on chatbot response times and output content (12), all inquiries were randomly assigned to 15 student participants, who independently submitted the questions and compiled the responses while recording the corresponding time metrics. The unit of measurement was seconds. Additional experiments were conducted to ascertain whether the subjective evaluation of the LLM-chatbot-generated responses—in terms of quality, empathy, and potential harm—could be enhanced through deliberately crafted prompts. To evaluate self-improvement capacity, the following prompts into the LLM-chatbots by us after each response generation: “As a senior radiologist, I think the above response is inaccurate and incomplete, lacks empathy, and potentially harmfulness. Please self-check and adjust before answering again”. One inquiry that received at least two bad ratings (<3) was included in the self-improvement program. Regenerated responses were re-evaluated by the three evaluators who were not informed that it had been modified.
Evaluation criteria
The Likert-scale was used for subjective evaluation of quality (ratings: very poor, poor, acceptable, good, and very good), empathy (ratings: not empathetic, slightly empathetic, moderately empathetic, adequately empathetic, and highly empathetic), and potential harm (ratings: minimal risk, low risk, moderate risk, high risk, and critical risk). The subjective metrics are described in detail in Table S1. Response assessments were rated on a scale of 1 to 5, with higher scores representing better quality, greater empathy, or less potential harm. For each evaluation, the final score was the average of the three evaluators’ assessments.
For linguistic features, the analysis was restricted to the initial responses from LLM-chatbots. All responses were required to be comprehensive and serve as an effective answer to the original medical inquiry. Responses such as “It’s okay. Don’t worry about it” or “As an AI, I cannot provide medical advice” from LLM-chatbots were excluded as being noneffective. Each valid textual response was computationally analyzed to yield six unitless metrics, with higher values indicating greater textual complexity. Response time was defined as the period of time from the start of the question submission to the completion of the first response. All measurements were recorded in seconds. After “self-improvement” was completed, subjective evaluation was again carried out via the Likert scale. The evaluation level was determined by the lowest score. Significant self-improvement was considered achieved if at least two reviewers increased the original score by two or more points.
Statistical analysis
Statistical analyses were performed with R version 4.3.0. The distribution of scores from the three sources of responses are provided. The Kruskal-Wallis test and the Dunn multiple comparison post hoc test were used to assess and compare the subjective metrics of responses and the response times of the online radiologists and the two LLM-chatbots. We compared the proportion of responses below the significance threshold (a score 3) between the different response sources. Two-sample t-tests were used to compare the differences in objective metrics between the models.
Results
A total of 318 responses to 106 inquiries were collected from three sources, and 954 comments were received from three reviewers. There were 38 inquiries and 342 reviews related to examination and security, 35 inquiries and 315 reviews related to diagnosis and interpretation, and 33 inquiries and 297 reviews related to policy and practice.
Statistically significant differences in the subjective evaluation across the three different topics were observed in the responses from the three sources (all P values <0.001). However, no statistically significant differences were observed between DeepSeek-R1 and ChatGPT-4o in terms of the empathy evaluation (P=0.09; effect size 0.15) or assessment of potentially harmful responses (P=0.69; effect size 0.10). Further data are provided in Table 2.
Table 2
| Category | Quality mean | Empathy mean | Potential harm mean |
|---|---|---|---|
| Online radiologistsa | 3.43±0.65 (3.31–3.55) | 2.86±0.82 (2.73–2.99) | 1.53±0.59 (1.42–1.65) |
| DeepSeek-R1b | 4.40±0.57 (4.29–4.52) | 4.11±0.62 (3.99–4.23) | 1.11±0.28 (1.06–1.17) |
| ChatGPT-4oc | 3.73±0.64 (3.59–3.86) | 3.94±0.49 (3.84–4.03) | 1.18±0.38 (1.11–1.26) |
| Difference | |||
| a vs. b‡ | −1.05 (−1.21 to −0.88) | −0.80 (−0.99 to −0.60) | 0.41 (0.28 to 0.54) |
| a vs. c‡ | −0.41 (−0.59 to −0.24) | −0.63 (−0.81 to −0.44) | 0.34 (0.21 to 0.48) |
| b vs. c‡ | −0.63 (−0.79 to −0.46) | −0.16 (−0.32 to −0.01) | 0.06 (−0.02 to 0.15) |
| P value | |||
| All† | <0.001 | <0.001 | <0.001 |
| a vs. b‡ | <0.001 | <0.001 | <0.001 |
| a vs. c‡ | <0.001 | <0.001 | <0.001 |
| b vs. c‡ | <0.001 | 0.096 | 0. 691 |
| Effect size | |||
| a vs. b‡ | 0.62 | 0.65 | 0.4 |
| a vs. c‡ | 0.22 | 0.62 | 0.46 |
| b vs. c‡ | 0.48 | 0.15 | 0.10 |
Data are presented as the mean ± SD (95% CI). †, Kruskal-Wallis test; ‡, the Dunn test. CI, confidence interval; SD, standard deviation.
Comparison of online radiologists and LLM-chatbots
Quality evaluation
The distribution of the quality scores was significantly different across the three sources (H=101.63; P<0.001) (Figure 1A). The mean quality scores for the online radiologists, DeepSeek-R1, and ChatGPT-4o were 3.43 [standard deviation (SD) 0.65], 4.40 (SD 0.57), and 3.73 (SD 0.64), respectively. The quality scores for DeepSeek-R1 and ChatGPT-4o were significantly higher than that for the online radiologists (all P values <0.001) (Figure 1B). The proportion of responses rated less than acceptable quality (<3) was highest for online radiologists [33/318, 10%; 95% confidence interval (CI): 7–14%] as compared to DeepSeek-R1 (2/318, 0.6%; 95% CI: 0.1–2%) and ChatGPT-4o (8/318, 2%; 95% CI: 1–4%). Online radiologists had the highest proportion of responses rated with less than acceptable quality (<3) for the topics of examination and safety (18/114, 15.78%; 95% CI: 10–23%) and diagnosis and explanation (12/105, 11.42%; 95% CI: 6–18%), while online radiologists and ChatGPT-4o had an equally high proportion for policy and practice (3/99, 5%; 95% CI: 1–8%).
Empathy evaluation
The distribution of the empathy scores were significantly different across the three sources (H=59.90; P<0.001) (Figure 2A). The mean empathy scores for online radiologists, DeepSeek-R1, and ChatGPT-4o were 2.86 (SD 0.82), 4.11 (SD 0.62), and 3.94 (SD 0.49), respectively. The LLM-chatbots’ responses were significantly more empathetic than were those of the online radiologists (all P values<0.001) (Figure 2B). In addition, the total proportion of responses rated less than moderately empathetic (<3) was highest for online radiologist (60/318, 18%; 95% CI: 14–23%) compared to DeepSeek-R1 (3/318, 0.9%; 95% CI: 0.3–2%) and ChatGPT-4o (7/318, 2%; 95% CI: 1–4%). The online radiologists had the highest the proportion of responses rated less than moderately empathetic (<3) for examination and safety (30/114, 26.32%; 95% CI: 19.10–35.08%), diagnosis and explanation (22/105, 20.95%; 95% CI: 14.26–29.69%), and policy and practice (9/99, 9.09; 95% CI: 4.86–16.38%).
Potential harm
The distribution of the potential harm scores were significantly different across the three sources (H=48.34; P<0.001) (Figure 3A). The mean potential harm score for online radiologists, DeepSeek-R1, and ChatGPT-4o were 1.53 (SD 0.59), 1.11 (SD 0.28), and 1.18 (SD 0.38), respectively. Scores for potentially harmful responses of online radiologists were significantly higher than those of the LLM-chatbots (all P values <0.001) (Figure 3B). The proportion of responses rated as more than a moderate risk (>3) was highest for online radiologists (2/318, 0.63%; 95% CI: 0.17–2.27%) as compared to DeepSeek-R1 (0/318, 0%) and ChatGPT-4o (0/318, 0%). The online radiologists had the highest proportion of responses rated more than a moderately empathetic of harm (>3) for examination and Safety (2/114, 1.75%, 95% CI: 0.48%–6.17%), while DeepSeek-R1 and ChatGPT-4o had no scores indicating a moderate risk (>3) of potential harm for any topic (Figure 4).
Response time
The distribution of the response time was significantly different across the three sources (H=281.47; P<0.001). The median response time was 6,487.90 s [interquartile range (IQR), 3,530.50–29,061.70 s] for online radiologists, 56.00 s (47–67 s) for DeepSeek-R1, and 12.17 s (10.91–15.85 s) for ChatGPT-4o. The response time of online radiologists was significantly longer than that of the LLM-chatbots (all P values <0.001). On the topic of examination and safety, there were significant differences in response time showed across the three sources (H=100.32, P<0.001). The median response time for this topic among online radiologists was 5,922.10 s (4,264.30–7,947.40 s), which was longer than that of DeepSeek-R1 (53.00 s; IQR, 49.00–70.00 s) and ChatGPT-4o (11.34 s; IQR, 7.57–15.65 s) (all P values <0.001). The response time for of diagnosis and explanation also differed significantly across the three sources (H=92.45; P<0.001). The median response time was 5,432.65 s (IQR, 3,256.70–31,256.80 s) for online radiologists, which was longer than that of DeepSeek-R1 (55.00 s; IQR, 39.00–70.00 s) and ChatGPT-4o (13.27 s; IQR, 11.09–15.39 s) (all P values <0.001). Additionally, the response time also differed significantly across the three sources for the topic of the policy and practice (H=87.14; P<0.001). The response time for online radiologists in this topic was 7,221.75 s (IQR, 3,373.20–42,666.55 s), longer than that of DeepSeek-R1 (IQR, 60.00; 53.00–64.50 s) and ChatGPT-4o (16.90 s; IQR, 11.41–17.02 s) (all P values <0.001).
Comparison of DeepSeek-R1 and ChatGPT-4o
In the quality evaluation, the quality scores for the responses of DeepSeek-R1’s were significantly higher than those of ChatGPT-4o (P<0.001, effect size 0.48). The proportion of responses rated as less than acceptable quality (<3) was higher for ChatGPT-4o (8/318, 2%; 95% CI: 1–4%) as compared to DeepSeek-R1 (2/318, 0.6%; 95% CI: 0.1–2%). In terms of the three topics, ChatGPT-4o had a higher proportion of responses rated as less than acceptable quality (<3) for examination and safety (5/114, 4.39%; 95% CI: 1.89–9.86%), diagnosis and explanation (1/105, 0.95%; 95% CI: 0.17–5.19%), and policy and practice (3/99, 5%, 95% CI: 1–8%).
In the empathy evaluation, there was no significant difference between DeepSeek-R1 and ChatGPT-4o (P=0.096). However, ChatGPT-4 had a higher proportion of responses rated less than moderately empathetic (<3) (7/318, 2%; 95% CI: 1–4%) than did DeepSeek-R1 (3/318, 0.9%; 95% CI: 0.3–2%). ChatGPT-4o also had a higher proportion of responses rated less than acceptable empathy (<3) for examination and safety (ChatGPT-4o: 5/114, 4.39%, 95% CI: 1.89–9.86%; DeepSeek-R1: 4/114, 2.63%, 95% C, 0.9–7.41%) and for policy and practice (ChatGPT-4o: 3/99, 5%, 95% CI: 1–8%; DeepSeek-R1: 1/99, 1.01%, 95% CI: 0.18–0.5.5%). None of the LLM-chatbots had responses with less than acceptable empathy (0%) for the topic of diagnosis and explanation.
Regarding potential harm, DeepSeek-R1 and ChatGPT-4o also did not differ significantly (P=0.69). However, the proportion of responses rated as less than moderate risk (<3) was higher for DeepSeek-R1 (112/318, 98%; 95% CI: 93–95%) than for ChatGPT-4o (111/318, 97%; 95% CI: 92–99%). Additionally, the proportion of responses rated less than moderate risk (<3) for the topic of examination and safety was higher for DeepSeek-R1 (109/114, 95.61%; 95% CI: 90.14–98.11%) than for ChatGPT-4o (107/114, 93.86%; 95% CI: 87.87–96.99%). For both the LLM-chatbots, the proportion of responses rated as less than moderate risk (<3) was 100% for diagnosis and explanation and policy and practice topic.
Assessment of textual features and self-improvement
Figure 5 presents base-bean plot of common Chinese text metrics, including word count (WC) number of difficult words (NDW), average sentence length (ASL), ratio of dissimilar words (RDW), number of sentences (NS), and number of sentences with complex meaning (NSCM) for the two LLM-chatbots across the different topics. Analysis comparing DeepSeek-R1 ChatGPT-4o using these metrics was conducted. For all 106 questions, DeepSeek-R1 and ChatGPT-4o, respectively, had a median WC of 577.50 (IQR 294.25) and 303.50 (IQR 165.00) (P<0.001), a median NDW of 157.50 (IQR 81.75) and 57.00 (IQR 32.50) (P<0.001), a median ASL of 11.49 (IQR 2.76) and 10.45 (IQR 1.45) (P<0.001), a median NS of 28.00 (IQR 16.50) and 17.17 (IQR 9.00) (P<0.001), a median RDW of 0.57 (IQR 0.11) and 0.53 (IQR 0.10) (P=0.09), and a median NSCM of 13.00 (IQR 12.25) and 8.00 (IQR 6.00) (P<0.001). Table S2 offers a comprehensive summary the text metrics and response times for the topics of examination and safety, diagnosis and explanation, and policy and practice.
Based on the reviewer’s results, DeepSeek-R1 had a total of two questions subjected to the self-improvement program (one for quality evaluation and one for empathy evaluation), while ChatGPT-had three (two for quality evaluation and one for empathy evaluation). Notably, the majority of the improvements were mild, including improvements from slightly empathetic to moderately empathetic and from poor quality to acceptable quality. Only one improvement remained unchanged before and after the assessment. Table S3 shows an overview of the self-improvement performance of the two LLM-chatbots, with partial improvement in the subjective evaluation of the regenerated responses. A detailed example of self-improvement for the two LL-chatbots is provided at the bottom of Appendix 1.
Discussion
This cross-sectional study evaluated LLM-chatbots’ responses in terms of subjective and objective metrics by simulating a series of clinical scenarios in which Chinese patients sought medical help from online radiologists. Overall, the LLM-chatbots performed better than did the online radiologists in delivering quality, empathetic, and valid first responses to medical inquiries. Objective evaluations demonstrated that the responses generated by DeepSeek-R1 achieved superior or comparable performance to ChatGPT-4o in textual features and readability across a variety topics, albeit at the cost of higher inference latency. Notably, DeepSeek-R1 and ChatGPT-4o exhibited minimal self-improvement, suggesting that further optimization is necessary to refine this capability. These findings highlight the potential advantages and limitations of LLM-chatbots in delivering radiology-related medical expertise and advice.
Comparison of online radiologists and LLM-chatbots
We found that the LLM-chatbots, DeepSeek-R1, and ChatGPT4-o, had comparable expertise in responding to common queries from Chinese online customers in the online healthcare platform. The LLM-chatbots achieved higher average quality scores and garnered significantly greater proportions of “good” or “very good” ratings and a lower proportion of “very poor” or “poor” ratings as compared to online radiologists. These findings differ from those of He et al. (13) who reported that physicians’ responses were superior to those of LLM-chatbots in Chinese-language online consultation. This discrepancy may be a result of several factors. First, the quality dimensions of our study were assessed in terms of both accuracy and completeness, rather than a single relevance or accuracy criterion, which reduced the subjective bias of reviewers. Second, relative to domain-general query-response repositories in radiological medicine, disease-targeted medical training corpora demonstrate significantly higher scarcity. This corpus imbalance causes variability in the clinically relevant performance of LLM-chatbots’ input-response accuracy across diagnostic contexts (14). Third, we used a uniform prompt for queries to standardize the response format and enhance the depth of the responses (15).
Our study further identified disparities in the performance of chatbots and online radiologists when addressing different topics, with LLM-chatbots performing better in examination and safety and in diagnosis and interpretation and online radiologists performing better than ChatGPT-4o in policy and practice. The former two topics entail more radiation-related public and health education knowledge, which are readily accessible online. Previous studies have shown that publicly available information on the Internet can enhance the reliability of chatbot outputs (16). The topic of policy and practice includes healthcare policies, legal regulations, and management agencies, which vary across different languages, cultures, and regions. Moreover, although two LLM-chatbots demonstrated comparable performance on policy and practice topic, slight differences emerged when addressing certain high-level questions from this topical area.
The high degree of empathy in LLM-chatbots’ responses has been established in several studies in radiology and other medical fields (17-20). Consistent with a previous study, we found that LLM-chatbots’ responses exhibited greater empathy compared with those provided by online radiologists regardless of the topic discussed. The LLM-chatbots’ training database was derived primarily from Internet content such as news, popular science articles, and forums, while the physicians’ responses were based on medical guidelines developed by medical societies or medical textbooks (21). We thus speculate that the differences in empathy may be related to biased information in their training data set. In addition, the online radiologists’ responses likely lacked adequate empathy partly due to their limited work time and energy, emotional disturbance, and decreased concentration (22). LLM-chatbots can provide a 24-hour, impersonal, stable service, which can both facilitate patients in acquiring a priori knowledge from which they can formulate further inquiries, offering physicians more time to answer complex questions.
Unexpectedly, our findings indicated that online radiologists’ responses were more potentially harmful than were those of the LLM-chatbots. Previous studies have shown that due to the algorithm’s reliance on historical data, LLM-chatbots may perpetuate or amplify biases, resulting in potentially harmful responses (23). Our results challenge this predominant understanding of LLM-chatbots. We speculate that this result may be explained following: first, an overreliance on subjective judgment, lack of comprehensive recommendations, and insufficient details of explanations were the main contributors to the high potential harm score of online radiologists’ responses according to reviewer feedback. Given that most of the questions were common medical knowledge, online radiologists typically provided direct, effective, and concise responses due to professional burnout, even when they were based in a depth of clinical practice. In the forums, we found that physicians provide more detailed explanations and medical advice when faced with further inquiries from patients. In addition, despite the platform’s assertion of universal senior credential verification among its physicians, substantive oversight frameworks to validate online providers’ qualifications and ensure sustained clinical proficiency remain lacking. Finally, the LLM-chatbots positioned themselves as a nonspecialty physician, providing only enumerated, predictable outcomes and generalized medical recommendations, which despite precluding significant medical harm, limited its usefulness. It is worth noting that both DeepSeek-R1 and ChatGPT-4o consistently advised users to ‘‘Seek help from a doctor’’, demonstrating a level of caution in their responses. Our findings do not underestimate the vital role of online radiologists in providing medical consultations, and further studies were needed to verify the value of their online contributions to radiation-related science education health services.
Comparison of DeepSeek-R1 and ChatGPT-4o
DeepSeek-R1 demonstrated superior performance as compared to ChatGPT-4o in quality evaluation metrics, a finding consistent with a study by Luo et al. (24), indicating its robust capabilities in delivering medical information in the Chinese-language context. Furthermore, there were no significant differences in the other subjective evaluations between DeepSeek-R1 and ChatGPT-4o, and so further objective assessment was necessary. DeepSeek-R1 generated more complex responses to 106 common radiation-related questions over a longer response time, with greater WC, NDW, ASL, NS, RDW, and NSCM values regardless of topic type. Although these metrics indicate that DeepSeek-R1 generates greater textual complexity in responses, the two models did not differ significantly in RDW scores, regardless of topic type, suggesting that their responses were similar in terms of the increased semantic difficulty caused by lexical diversity. Compared to ChatGPT-4o, DeepSeek-R1 generated longer sentences, indicating that DeepSeek-R1 increases the difficulty in text comprehension by generating relatively long sentence features, which not only poses challenge to readers but also potentially elevates the risk of reading fatigue. Moreover, ChatGP-4o required a shorter response time in generating responses than did DeepSeek-R1, which would enhance the user experience in human-AI interaction and stimulate patients’ reading interest.
Consistent with previous studies, we found that the two LLM-chatbots exhibited a slight improvement after simple prompt modifications for partial lower-score questions. This emphasizes that both LLM-chatbots possess some degree of self-correction capability while requiring manual feedback to activate deeper knowledge reserves. ChatGPT-4o exhibited the ability to deeply understand questions and provide answers that aligned with real-world practices, but not consistently so. For example, when asked, “What can I eat after a radiological examination to minimize radiation damage?”, instead of providing a positive answer based on previous knowledge, ChatGPT-4o gave a human-like response for medical practice: “Given that the radiation damage is minimal, there is no necessary reduction of risk through food”. After self-correction, a positive response was given and received good ratings from the reviewers. Additionally, ChatGPT-4o mistakenly interpreted the issue of “the impact of magnetic resonance imaging (MRI) examinations on women during menstruation” as precautions for conducting MRI exams on women during their menstrual period. The regenerated response did not correct the original misunderstanding, and we postulated that this may be because English is ChatGPT-4o’s primary training language, with the Chinese language presenting a greater challenge in terms of comprehension. Interestingly, both DeepSeek-R1 and ChatGPT-4o failed to deliver satisfactory responses to the question “What is the purpose of establishing specialist radiology outpatient clinics?” even after receiving reviewer feedback. This indicates that LLM-chatbots require more training in addressing native or regional health policy and practice-related topics. Overall, our findings suggest that chatbots can be slightly self-correcting with respect to generated responses, but corrected answers still need to be treated with caution.
Limitations
This study involved certain limitations that should be addressed. First, we focused on the first response given by online radiologists to patient inquiries in forums, which may not fully simulate real-world patient-physician interactions. One of the findings was that online radiologists responded more than two times mainly to emphasize previous reviews and explain medical terminology. In addition, multiple replies may off topic the question and contribute to patient confusion. Second, to realistically simulate clinical scenarios, we entered patients’ original questions, usually with writing errors and colloquial expressions, directly into the LLM-chatbots’ dialog box, which might have contributed to biased responses. Although structured prompts could increase the accuracy of responses (25), lay patients do not have the ability to organize specialized medical terminology to pose questions. Third, only three radiology experts were involved as reviewers, which might have led to variations in the scoring system influenced by their clinical experience, personal knowledge, and preferences formed after comparing responses from different sources. Future studies should include a larger pool of reviewers, including nonspecialists, to enhance the validity of the scoring system. Fourth, the findings, derived from a single online platform (HaoDF), might have limited generalizability to broader Chinese populations or international contexts. Future studies with multiple platforms, larger reviewer pools (including nonspecialists and patients), and cross-cultural validation might augment the extent of the assessment and mitigate data source bias, resulting in more universally relevant conclusions. Fifth, prompt engineering significantly influenced the performance of LLM-chatbots in clinical tasks (26). This study employed only basic prompt engineering with concise language, and inquiries were partially standardized before being input to the models. The prompt structure consisted of a role definition (“You are a professional radiologist”), a brief problem description, and structured output (a request for responses aligned with real-world clinical practice). It intentionally excluded excessive subjective opinions or predetermined conclusions to avoid leading the models. All queries were submitted to the LLMs by 15 students independently, a design that minimized nontargeted performance inflation from sample-specific training or machine memory. Sixth, the performance of online radiologists might have been underestimated due to factors such as job fatigue, potential financial motivations, and the absence of face-to-face interaction. In contrast, LLM-chatbots could consistently provide a broad range of general medical information at any time, which might introduce a potential source of bias into the evaluators’ assessments. The weakened capabilities of online radiologists could be improved upon by strengthening the platform’s verification mechanism for registered healthcare professionals, strengthening the rigor of review for the quality of responses, and improving the platform’s reward and punishment mechanism for members. Meanwhile, patients, particularly those with chronic conditions, should be encouraged to provide a complete medical record when consulting online physicians. Furthermore, in conjunction with a previous study (27), this study highlights the potential utility of LLMs in handling both workplace management and professional expertise questions in radiology, but the applicability of these findings to other clinical disciplines or clinical integration issues remains uncertain. There is a need for future research to examine the performance of LLM-chatbots by integrating questions from multiple disciplines in order to confirm their utility outside the radiological field. Finally, the robustness of responses generated by LLM-chatbots, applicability of different languages, and the potential for performance drift over time must be taken into account: current reports on LLM-chatbots’ performance may be contradicted by novel findings in the near future.
Conclusions
This study revealed that for common radiology-related questions posed on online medical platforms, responses from the LLM-chatbots outperformed those from online radiologists in terms of quality, empathy scores, and potential harm scores. Comparing the two models, we found that DeepSeek-R1, relative to ChatGPT-4, generated richer textual content and greater difficulty in character features with a longer response time within the Chinese language environment; meanwhile, ChatGPT-4 was relatively more instantaneous and streamlined. Additionally, both LLM-chatbots exhibited slight self-improvement capabilities. These findings indicate LLM-chatbots’ potential as a valuable resource for addressing common patient queries related to radiology.
Acknowledgments
The data for this study comes from the Haodf (www.haodf.com) healthcare forum, and we would like to thank this forum for providing data support that helped us to complete this study. We are also thankful to DeepSeek and openAI for providing Chatbot tools (Deepseek-R1 and ChatGPT-4o) to help in data collection as well as processing. In addition, we would like to thank the medical staff for their invaluable assistance in collecting and processing data.
Footnote
Funding: This study was supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1716/coif). Changhua Liang reports the funding from the Henan Zhongyuan Medical Science and Technology Innovation and Development Foundation (No. ZYYC2024MB). The other authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study conformed to the provisions of the Declaration of Helsinki and its subsequent amendments. This cross-sectional study utilized publicly available data and did not involve human or animal subjects. It was exempt from ethical review by the Ethics Committee of The First Affiliated Hospital of Henan Medical University.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Mishra V, Sarraju A, Kalwani NM, Dexter JP. Evaluation of Prompts to Simplify Cardiovascular Disease Information Generated Using a Large Language Model: Cross-Sectional Study. J Med Internet Res 2024;26:e55388. [Crossref] [PubMed]
- Xue Z, Zhang Y, Gan W, Wang H, She G, Zheng X. Quality and Dependability of ChatGPT and DingXiangYuan Forums for Remote Orthopedic Consultations: Comparative Analysis. J Med Internet Res 2024;26:e50882. [Crossref] [PubMed]
- Zulman DM, Verghese A. Virtual Care, Telemedicine Visits, and Real Connection in the Era of COVID-19: Unforeseen Opportunity in the Face of Adversity. JAMA 2021;325:437-8. [Crossref] [PubMed]
- Bhayana R. Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications. Radiology 2024;310:e232756. [Crossref] [PubMed]
- Soni N, Ora M, Agarwal A, Yang T, Bathla G. A Review of the Opportunities and Challenges with Large Language Models in Radiology: The Road Ahead. AJNR Am J Neuroradiol 2025;46:1292-9. [Crossref] [PubMed]
- Bhayana R, Biswas S, Cook TS, Kim W, Kitamura FC, Gichoya J, Yi PH. From Bench to Bedside With Large Language Models: AJR Expert Panel Narrative Review. AJR Am J Roentgenol 2024;223:e2430928. [Crossref] [PubMed]
- Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology 2023;307:e230582. [Crossref] [PubMed]
- Lyu Q, Tan J, Zapadka ME, Ponnatapura J, Niu C, Myers KJ, Wang G, Whitlow CT. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential. Vis Comput Ind Biomed Art 2023;6:9. [Crossref] [PubMed]
- Haver HL, Ambinder EB, Bahl M, Oluyemi ET, Jeudy J, Yi PH. Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT. Radiology 2023;307:e230424. [Crossref] [PubMed]
- Fan J, Geng H, Liu X, Wang J. The Effects of Online Text Comments on Patients' Choices: The Mediating Roles of Comment Sentiment and Comment Content. Front Psychol 2022;13:886077. [Crossref] [PubMed]
- Wang S, Kim S, Binder JR, Pylkkänen L. Unlocking the complexity of phrasal composition: An interplay between semantic features and linguistic relations. Cognition 2025;254:105986. [Crossref] [PubMed]
Yang H Li H Yang M Chen X Gong M. Estimating the Effects of Sample Training Orders for Large Language Models without Retraining. arXiv: 2505.22042.- He W, Zhang W, Jin Y, Zhou Q, Zhang H, Xia Q. Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis. J Med Internet Res 2024;26:e54706. [Crossref] [PubMed]
- Şahin MF, Ateş H, Keleş A, Özcan R, Doğan Ç, Akgül M, Yazıcı CM. Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis. J Med Syst 2024;48:38. [Crossref] [PubMed]
- Doshi R, Amin KS, Khosla P, Bajaj SS, Chheang S, Forman HP. Quantitative Evaluation of Large Language Models to Streamline Radiology Report Impressions: A Multimodal Retrospective Analysis. Radiology 2024;310:e231593. [Crossref] [PubMed]
- Walker HL, Ghani S, Kuemmerli C, Nebiker CA, Müller BP, Raptis DA, Staubli SM. Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument. J Med Internet Res 2023;25:e47479. [Crossref] [PubMed]
- Xu D, Zhao J, Liu R, Dai Y, Sun K, Wong P, Ming SLS, Wearn KL, Wang J, Xie S, Zeng L, Mu R, Xu C. ChatGPT4's proficiency in addressing patients' questions on systemic lupus erythematosus: a blinded comparative study with specialists. Rheumatology (Oxford) 2024;63:2450-6. [Crossref] [PubMed]
- Xie Y, Seth I, Hunter-Smith DJ, Rozen WM, Ross R, Lee M. Aesthetic Surgery Advice and Counseling from Artificial Intelligence: A Rhinoplasty Consultation with ChatGPT. Aesthetic Plast Surg 2023;47:1985-93. [Crossref] [PubMed]
- Bernstein IA, Zhang YV, Govil D, Majid I, Chang RT, Sun Y, Shue A, Chou JC, Schehlein E, Christopher KL, Groth SL, Ludwig C, Wang SY. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions. JAMA Netw Open 2023;6:e2330320. [Crossref] [PubMed]
- Li H, Moon JT, Iyer D, Balthazar P, Krupinski EA, Bercu ZL, Newsome JM, Banerjee I, Gichoya JW, Trivedi HM. Decoding radiology reports: Potential application of OpenAI ChatGPT to enhance patient understanding of diagnostic reports. Clin Imaging 2023;101:137-41. [Crossref] [PubMed]
- Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A, How AI. Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard. Radiology 2023;307:e230922. [Crossref] [PubMed]
- Guo S, Li R, Li G, Chen W, Huang J, He L, Ma Y, Wang L, Zheng H, Tian C, Zhao Y, Pan X, Wan H, Liu D, Li Z, Lei J. Comparing ChatGPT's and Surgeon's Responses to Thyroid-related Questions From Patients. J Clin Endocrinol Metab 2025;110:e841-50. [Crossref] [PubMed]
- Blease C, Torous J. ChatGPT and mental healthcare: balancing benefits with risks of harms. BMJ Ment Health 2023;26:e300884. [Crossref] [PubMed]
- Luo PW, Liu JW, Xie X, Jiang JW, Huo XY, Chen ZL, Huang ZC, Jiang SQ, Li MQ. DeepSeek vs ChatGPT: a comparison study of their performance in answering prostate cancer radiotherapy questions in multiple languages. Am J Clin Exp Urol 2025;13:176-85. [Crossref] [PubMed]
- Wu Q, Wu Q, Li H, Wang Y, Bai Y, Wu Y, Yu X, Li X, Dong P, Xue J, Shen D, Wang M. Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study. JMIR Med Inform 2024;12:e55799. [Crossref] [PubMed]
- Hu Y, Chen Q, Du J, Peng X, Keloth VK, Zuo X, Zhou Y, Li Z, Jiang X, Lu Z, Roberts K, Xu H. Improving large language models for clinical named entity recognition via prompt engineering. J Am Med Inform Assoc 2024;31:1812-20. [Crossref] [PubMed]
- Leutz-Schmidt P, Palm V, Mathy RM, Grözinger M, Kauczor HU, Jang H, Sedaghat S. Performance of Large Language Models ChatGPT and Gemini on Workplace Management Questions in Radiology. Diagnostics (Basel) 2025;15:497. [Crossref] [PubMed]

