Fine-tuning medical language models for enhanced long-contextual understanding and domain expertise

Qimin Yang; Jiexin Chen; Yue Sun; Yapeng Wang; Tao Tan

doi:10.21037/qims-2024-2655

Original Article

Fine-tuning medical language models for enhanced long-contextual understanding and domain expertise

Qimin Yang , Jiexin Chen, Yue Sun, Yapeng Wang, Tao Tan

Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, China

Contributions: (I) Conception and design: Q Yang, J Chen; (II) Administrative support: T Tan, Y Sun, Y Wang; (III) Provision of study materials or patients: None; (IV) Collection and assembly of data: Q Yang, J Chen; (V) Data analysis and interpretation: Q Yang, J Chen; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Tao Tan, PhD. Associate Professor, Faculty of Applied Sciences, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao SAR 999079, China. Email: taotan@mpu.edu.mo.

Background: Since the emergence of large language models (LLMs), a large number of applications have emerged in vertical fields, such as medical, legal, and subject education. By fine-tuning the pre-trained base model, professional knowledge can be well parameterized into model capabilities, enabling it to have better performance in specific fields. However, we observed that although the fine-tuned model has improved domain-specific knowledge, the performance of medical LLMs (Med-LLMs) in long-context understanding has declined significantly due to the large amount of knowledge-intensive fine-tuning, especially compared with the general language model with similar parameters. This study aims to investigate the problem of the decline in performance of Med-LLMs in long-context understanding.

Methods: We designed a series of experiments to conduct open-book professional knowledge tests related to the medical field on models using different fine-tuning methods to evaluate their long-context understanding capabilities in the medical field. These experiments included benchmarks of general language models, benchmarks of medical language models, tests that adjusted the ratio and amount of general data and professional data during fine-tuning, and experimental data to determine the best data composition to optimize professional models and achieve a balance between long-context performance and specific domain knowledge.

Results: Our experimental framework evaluated 5 general-purpose LLMs and 6 medical-adapted models through an open-book knowledge assessment protocol. The results revealed a striking performance hierarchy: even the lowest-performing general model (37.52% accuracy) outperformed non-retrained medical baselines (34.65% peak accuracy). However, medical models employing our optimized fine-tuning strategies demonstrated marked accuracy gains, with maximum improvements reaching 13.5 percentage points. Notably, retrained medical specialists like IvyGPT (40.48%) and WiNGPT2 (38.94%) surpassed several general models of larger parameter scales, establishing new performance benchmarks in medical context processing. Our experiments on fine-tuning data volume revealed a critical saturation threshold near 100,000 domain-specific samples. When approaching this boundary, models exhibited instability in contextual understanding and even performance regression, while further fine-tuning beyond this point failed to induce measurable improvements in long-context comprehension capabilities. This suggests an inherent limitation in scaling domain-specific knowledge integration through continued data exposure alone.

Conclusions: The composition and quantity of data for model fine-tuning actually affect the model’s ability to understand context in downstream tasks. The balance between the model’s expertise and context understanding depends on the rationality of the fine-tuning data.

Keywords: Large language model (LLM); big data; artificial intelligence; fine-tuning; medical model

Submitted Nov 26, 2024. Accepted for publication Mar 13, 2025. Published online Jun 03, 2025.

doi: 10.21037/qims-2024-2655

Introduction

Large language models (LLMs) now have powerful general conversation capabilities. In mainstream models, the huge corpus used in the pre-training phase injects knowledge into the transformer-based (1) model that enables it to complete human-computer conversations fluently and correctly in most cases. The emergence of LLMs marks an important milestone in natural language processing (NLP) technology, which not only excels in traditional applications such as generating natural language text, automatic translation, and text summarization (2), but also demonstrates great potential in emerging areas such as dialogue systems and knowledge answering and sentiment analysis. In order to adapt to the needs of various professional fields, these models are usually fine-tuned on question-and-answer datasets in specific fields after pre-training, so that their professional knowledge and question-and-answer capabilities are significantly improved. For example, the language model of the medical profession has demonstrated excellent disease diagnosis capabilities after fine-tuning using doctor-patient Q&A data (3,4). This transformation not only enhances the breadth of application of the model, but also improves its practicality in specific scenarios.

In the medical field, LLMs can help support physicians in clinical decision-making. By analyzing large amounts of medical literature, electronic health records (EHRs), and the latest research data, LLMs can quickly provide diagnostic advice, treatment plans, and drug recommendations, thus assisting doctors in making more accurate decisions. BenTsao (5) introduced a method that integrates structured medical knowledge bases to “fine-tune” existing LLMs. This method enables the model to refer to authoritative medical information when generating responses, thereby improving the quality and accuracy of the answers. IvyGPT (6) proposed a new improvement on fine-tuning training data by mixing real question answering data with generated data for fine-tuning, in order to expand the effective training amount. To better utilize the data extracted by ChatGPT and real data, HuatuoGPT (7) trained a reward model that aligns the language model with the advantages brought by the two types of data, following the approach of reinforcement learning from AI feedback (RLAIF). Their second-generation model, HuatuoGPT-II (8), proposed a unified domain adaptation protocol that combines continuous pre-training and fine-tuning stages into a single process. Meanwhile, in terms of doctor-patient communication, LLMs can be used to develop intelligent Q&A systems and chatbots to help patients answer common health questions, book outpatient appointments and manage personal health information (9).

However, despite significant progress in domain-specific knowledge mastery, these models tend to suffer from a decline in long-context understanding and instruction following abilities. This phenomenon has a negative impact on the overall performance of the model in practical applications, limiting its performance in scenarios that require comprehensive understanding and multi-tasking. For example, an LLM fine-tuned with professional data may be able to provide precise medical advice, but may not perform as expected when multiple rounds of conversations are involved or when contextual information is required to understand and answer questions. The normal process of medical diagnosis inquiry should include multiple rounds, coherent, and logical dialogue. Existing medical diagnosis models generally lose track of the previously described symptoms after 2–3 rounds of dialogue, or ignore some key symptoms that have been described in lengthy descriptions, thereby affecting the diagnostic results.

Therefore, it has become an important topic to study how to improve or at least preserve the long-context understanding ability of the model while maintaining professional knowledge capabilities. At the heart of this study is finding a balance that allows models to capture deep expertise without losing their ability to understand and process a wide range of texts. Achieving this goal will not only improve the application effect of LLMs in professional fields, but also expand its applicability in a wider range of practical scenarios. We experimentally evaluate the performance of medically-specific LLMs with different data ratios and quantities, analyze the impact of data composition on model performance, and optimize the data combination strategy for fine-tuning to achieve a balance between medical knowledge and broad language comprehension capabilities.

Methods

We designed an evaluation method to test the model’s contextual ability and long context comprehension ability (Figure 1). We collected some traditional Chinese medicine (TCM) examinations, including physician examinations, nursing examinations, pharmacist examinations, medical technology examinations, professional knowledge examinations, and medical graduate examinations, and used these examinations to test the model. These examinations are in the form of single-choice or multiple-choice. However, unlike other studies that directly test the model, we input the relevant knowledge required to answer the question as a prompt to the model while inputting the question, and then require the model to answer only based on the given information and directly output the correct option. There are two reasons for using this method: First, since we are testing the model’s contextual understanding ability in the medical field, we did not use the traditional method of testing the model using literary articles. The professional knowledge possessed by the medical model can help them understand the input prompt words in the medical context to a certain extent, and the obtained score can be approximately regarded as the weighted sum of the model’s contextual understanding ability and professional knowledge; second, this is more in line with the actual use scenarios of these models and the evaluation of the model is more accurate.

Figure 1 The overall process of testing the long-context understanding ability of a large language model. We use an open-book exam method to test the model’s ability to understand long contexts in a professional field (in this case, the medical field). We select specific medical multiple-choice questions and attach their original text content in the textbook to the model, requiring it to output the correct answer. Finally, we perform statistical evaluation on the accuracy of each model. Large language models (LLMs) refers to a series of medical models represented by HuatuoGPT and a series of general models represented by Qwen in our experiments.

Fine-tuning data

We collected a large number of question-answering datasets for fine-tuning for the experiment, including the public Alpaca Chinese dataset (10), which is mainly used as the general data part of the dataset, and a large number of professional datasets that we collected and extracted from it for fine-tuning of professional knowledge. We classified them according to the categories of traditional Chinese medicine and modern medicine, and the professional data used in all experiments maintained a 1:1 ratio of traditional Chinese medicine to modern medicine. Private datasets include collected books, cleaned open-source data, question-answering data from forum websites, and data provided by cooperative hospitals.

General model examination

As shown in Figure 2, we first selected some general large language models (GLLMs) that can be publicly used for daily online question-answering for testing. Since the number of parameters of these models is usually large, open source medical models do not have an advantage when directly comparing model performance, so we also introduced a small parameter version of DeepSeek (11) for comparison. Specifically, GPT-3.5-Turbo and GPT-4 (12) are both from OpenAI. These models are known for their excellent language generation capabilities, especially in English conversations. This is mainly due to the characteristics of their training data, which enables them to perform well in the English context. However, since we are testing medical question-answering in the Chinese field, we introduced two long text multimodal models, GLM-4 and Qwen-max, which have excellent Chinese processing capabilities. It is worth noting that GLM-4 and Qwen-max even outperform GPT-4 when processing Chinese contexts, demonstrating strong language understanding and generation capabilities. This diverse model selection makes our comparative test more comprehensive, and can also better reflect the actual application effects of different types of models in the Chinese medical field.

Figure 2 The specific process of open-book examination. Large language models (LLMs) refers to all our large language models involved in the experiment, including all medical and general models with and without fine-tuning.

Medical model examination

The China Medical Benchmark (CMB) (13) is a large-scale benchmark test set specifically for the Chinese medical field, which aims to comprehensively evaluate the performance of various language models in medical scenarios. We selected several representative medical large language models (Med-LLMs) from the provided model list for unified testing to verify how they perform in CMB’s professional medical ability evaluation. These models have achieved excellent results in this benchmark test, demonstrating their potential in medical question answering, diagnostic reasoning, clinical assistance, and other aspects.

First, the tested models include the earlier released HuatuoGPT. This model has achieved certain results in the Chinese medical field, can handle common medical problems, and provide relatively accurate diagnostic suggestions. However, compared with it is HuatuoGPT-II, which is the second-generation version that further expands the training data in the medical field based on HuatuoGPT. HuatuoGPT-II not only greatly improves the data scale, but also introduces more diverse and professional medical knowledge and clinical cases, which significantly enhances its understanding and reasoning ability in medical scenarios. In the CMB rankings, HuatuoGPT-II’s professional performance far exceeds the former, especially in the handling of complex cases and rare diseases, demonstrating its leading position in medical models.

In addition, although the remaining models we tested have certain differences in architecture design and training methods, these differences are not decisive factors. Their main difference lies in the composition of the datasets used. For example, some models may use a wider range of medical literature, clinical data, and diagnostic records, while others focus more on specialized medical exam question banks and case analysis. It is this diversity of training data that leads to differences in the performance of these models on specific tasks. Despite this, they all performed very well in the CMB benchmark, further demonstrating the key role of high-quality medical training data in improving model performance.

Fine-tuning with general data

As shown in Figure 3, in order to explore the potential improvement of the long-context understanding ability of medical LLMs by general data, we first fine-tune Med-LLMs that we tested in the previous section by using the general data that mentioned above. After fine-tuning, we gave these models the same open-book exam as before to evaluate their improvement in long-context understanding and instruction following abilities.

Figure 3 The process of fine-tuning the model includes fine-tuning of the medical LLMs and fine-tuning of the general LLMs. They both use the LoRA method, but the data used are different. Large language models (LLMs) refers to the models that we need to fine-tune, i.e., various parameter versions of Qwen1.5. LoRA, or Low-Rank Adaptation, is an efficient fine-tuning technique that reduces computational costs while maintaining high performance.

During the test, we noticed that even though we had clearly instructed the model to only output the correct answer, some models did not fully follow the instructions when answering. These models often analyzed the question or each option in detail before providing the answer, and output the analysis together. Although this behavior may demonstrate the model’s logical reasoning ability in some situations, for this test, this unnecessary analysis step violated our instructions and made the result extraction process more complicated.

We believe that this phenomenon not only affects the extraction and subsequent processing of test data, but also reflects the inadequacy of these models in understanding long contexts and following instructions. Although the models are able to demonstrate certain analytical and reasoning capabilities, they fail to accurately understand and execute instructions for concise answers. This deviation from instructions indicates that the models have certain limitations when facing complex tasks, especially when dealing with explicit requirements. Therefore, when evaluating, we regard these cases where questions are not answered strictly according to instructions as incorrect answers.

Fine-tuning with different data composition

To further study the impact of different mixing ratios of general data and professional data on the fine-tuning performance of LLMs, we designed a second set of experiments. The core goal of this experiment is to explore how different data combinations affect the model’s contextual understanding and instruction-following capabilities in the medical field during model fine-tuning.

We choose multiple parameter versions of the Qwen1.5 model as the experimental subjects. Qwen1.5 was chosen because it provides model versions with multiple parameter sizes, ranging from small to large models. This diversity allows us to demonstrate the general nature of mixed ratio training at different parameter levels and observe the impact of model capacity on the effectiveness of the data mixing strategy.

At the same time, we prepared training datasets with different mixed ratios. These datasets consist of a certain proportion of general data and medical data. General data uses Alpaca’s public Chinese dataset, which covers a wide range of question-answer pairs of non-medical texts, aiming to provide general language understanding capabilities. Professional data refers to the question-answer pairs in medical contexts collected from multiple channels mentioned above, which aim to strengthen the model’s medical expertise and terminology understanding.

Fine-tuning with different data quantity

In the context of exploring the impact of data volume effect on specific models in the medical field, we focused on the Qwen1.5-7B model, especially the application scenario of pure medical data, and conducted in-depth analysis. Qwen1.5-7B was chosen as the basic model for the study mainly because its parameter scale is moderate, which is convenient for observing the direct impact of data volume changes on model performance, while not causing overfitting due to insufficient model parameters or excessive data volume. In order to build this analysis framework, we systematically prepared a series of professional datasets, gradually expanding the data scale from 10k question-answer pairs to 200 k, aiming to discover the trend of the impact of professional data on the model’s contextual reading ability. Considering the breadth and diversity of the medical field, we specifically included question-answer pairs of traditional Chinese medicine and Western medicine in the dataset, with a balanced ratio of 1:1 to cope with different types of exam questions. Subsequently, we used these professional datasets of different scales to fine-tune the Qwen1.5-7B model and conducted the same open-book exam as before to evaluate the impact of different data volumes on model performance.

Results

Tables 1-5 present the results of our full set of experiments, where we tested five GLLMs and six Med-LLMs, and finally performed additional experiments using various versions of Qwen1.5. The accuracy rates listed in the tables represent the average accuracy of the six tests we performed.

Table 1

Some of the fine-tuning data used

Alpaca general data	Our professional data
“instruction”: “List four examples of natural language processing tasks”	“instruction”: “My cough has not improved recently and there are no other symptoms. What should I do?”
“input”: “”	“input”: “”
“output”: “1. Text classification: Classify text into predefined categories, such as sentiment classification or topic classification. 2. Text summarization: Generate a text summary containing key information. 3. Language translation: Translate text in one language into another language, such as English into Spanish. 4. Entity extraction: Extract entities from text, such as names or places, using natural language processing.”	“output”: “If your cough lasts for a long time, it is recommended that you see a doctor for examination to determine the cause of the cough. Coughing may be caused by a variety of reasons, such as colds, allergies, tracheitis, bronchitis, pneumonia, etc. The doctor can conduct relevant examinations and diagnoses based on the symptoms and signs, and give corresponding treatment plans. At the same time, maintaining good living habits, such as drinking more water, getting more rest, avoiding smoking, and reducing exposure to air pollution, can help relieve cough symptoms.”
“instruction”: “Generate a creative title that summarizes the following article.”	“instruction”: “”
“input”: “A new report shows that rising carbon dioxide levels are leading to higher temperatures, more extreme weather events, and a rapid decline in biodiversity.”	“input”: “For the following stomach symptoms, please give suggestions on the dosage and dosage of the drug. Stomach pain, nausea, vomiting.”
“output”: “The dangers of increasing carbon: a rapid climate and biodiversity crisis.”	“output”: “For stomach pain, nausea, and vomiting, the drug selection and dosage recommendations vary slightly depending on the symptoms and causes. Do you have any special information about the time, place, food, or other triggers for the onset of these symptoms? This information will help to more accurately determine the cause and drug selection. Generally speaking, for the above symptoms caused by excessive stomach acid, you can consider taking drugs such as proton pump inhibitors or H2 receptor antagonists. Choose the appropriate drug and dosage according to your personal situation, and it is recommended to take the drug under the guidance of a doctor. In addition, maintaining good eating habits and a regular lifestyle can also help relieve symptoms.”

Table 2

The performance of GLLMs in open-book examination

Model	Average accuracy (all examinations)
GLM-4	68.20%
GPT3.5-Turbo	41.28%
GPT4	57.72%
Qwen-max	75.76%
DeepSeek	37.52%

General large language models (GLLMs): advanced artificial Intelligence systems trained on extensive text data to understand and generate human-like language. GLLMs can perform various tasks like text generation, translation, and question-answering without specific fine-tuning. Examples include GLM-4, GPT4, etc.

Table 3

The performance of Med-LLMs in open-book examination

Model	Average accuracy (all examinations)
PULSE	34.65%
DISC-MedLLM	33.36%
WiNGPT2	32.71%
IvyGPT	32.33%
HuatuoGPT	11.22%
HuatuoGPT-II	4.73%

Medical large language models (Med-LLMs): specialized artificial intelligence systems designed to understand and generate human-like language within the medical domain. Med-LLMs are trained on vast amounts of medical text data, enabling them to perform tasks such as clinical note summarization, diagnosis assistance, and treatment recommendation. Examples include PULSE, IvyGPT, etc.

Table 4

An example of a question, answer, and model answer

Question	Answer (ABCDE)
Question	PULSE	DISC-MedLLM	WiNGPT2	IvyGPT	HuatuoGPT	HuatuoGPT-II
(<Known information> HIV can infect CD4⁺T cells, macrophages, and dendritic cells. The correct statement about increased pigmentation is that it is common in melasma. The characteristic of a non-spindle cell nevus is that the nevus cells contain melanin. The most common cause of paraplegia, a serious complication of epidural block anesthesia, is epidural hematoma. Chalazion is common in adolescents. The patient is a 15-year-old female student. She was hospitalized for two weeks due to sudden excitement and disordered behavior. She caught a cold two weeks ago, with a body temperature of 38 ℃. Three days later, her body temperature returned to normal. The patient became excited and talked a lot. She kept calling many classmates, even some classmates she didn’t contact on a regular basis. She liked to go out and gave money to beggars when she saw them. The most likely diagnosis for this patient is a manic episode. Patients with confirmed Behcet’s disease sometimes need to be hospitalized. The following consideration is inappropriate: hoping to be completely cured by hospitalization. The fundus manifestations of tuberculous uveitis include retinal vasculitis, macular edema, choroiditis, snowball opacities of the underlying vitreous, and granulomatous anterior uveitis. Female, 28 years old. Progressive back pain, lower limb weakness, and loss of appetite occurred in the late pregnancy. Physical examination showed mild kyphosis of T7 with percussion pain. X-rays showed narrowing of the T6–7 gap, bulging paravertebral soft tissue shadows, and an erythrocyte sedimentation rate of 60 mm/h. The diagnoses that can be excluded are thoracic metastatic cancer, thoracic hemangioma, suppurative spondylitis, and thoracic disc herniation. Each mental symptom has a clear definition and has the following characteristics: the occurrence of symptoms is not controlled by the patient’s consciousness, the content of the symptoms is not commensurate with the surrounding environment, and the symptoms cause varying degrees of functional impairment to the patient. </Known information>	D	D	ABCDE	CDE	D	The fundus manifestations of tuberculous uveitis mainly include retinal vasculitis, macular edema, choroiditis, and snowball opacities of the underlying vitreous. In addition, another common manifestation is granulomatous anterior uveitis. These fundus manifestations are one of the typical pathological changes of tuberculous uveitis.
The following is a multiple-choice question from the Chinese Medical Doctor Examination. Please answer the following questions based on the above known information. No analysis or explanation is required. The answer options are directly output:
<Question> The fundus manifestations of tuberculous uveitis include A. Retinal vasculitis B. Macular edema C. Choroiditis D. Snowball opacity of the lower vitreous E. Granulomatous anterior uveitis</Question>)

Table 5

The results obtained from fine-tuning Med-LLMs using general data and conducting the same open-book testing on them, as well as the improvements obtained after fine-tuning

Model	Average accuracy (all examinations)	Improvement
HuatuoGPT-II	18.23%	+13.5%
IvyGPT	40.48%	+8.15%
WiNGPT2	38.94%	+6.23%
HuatuoGPT	16.20%	+4.98%
PULSE	34.50%	−0.15%
DISC-MedLLM	28.54%	−4.82%

Medical large language models (Med-LLMs): specialized artificial intelligence systems designed to understand and generate human-like language within the medical domain. Med-LLMs are trained on vast amounts of medical text data, enabling them to perform tasks such as clinical note summarization, diagnosis assistance, and treatment recommendation. Examples include PULSE, IvyGPT, etc.

Discussion

In this study, we proposed a knowledge-tuning approach leveraging structured Chinese medical knowledge bases to enhance the performance of natural language models for processing medical data. The results demonstrated promising improvements in model accuracy and the generation of contextually appropriate responses. However, there are several limitations and considerations that need to be addressed for future development and broader application.

Limitations of knowledge-tuning approach

One significant limitation of our approach is its heavy reliance on structured Chinese medical knowledge bases. This focus on Chinese-specific resources limits the approach’s applicability to other languages and medical systems. While the method was designed with Chinese medical contexts in mind, its generalizability to non-Chinese medical systems or other linguistic frameworks remains uncertain. This could affect the robustness of the model when applied to different language families, potentially resulting in suboptimal performance or misunderstandings of medical concepts unique to other regions or languages (14). Future research could explore adapting the model to support multilingual and multicultural contexts, using broader medical knowledge bases that span various languages.

Dependence on knowledge base quality

The effectiveness of the method is intrinsically tied to the quality and currency of the knowledge base used. An outdated or incomplete knowledge base could lead to inaccurate medical advice or responses, which may compromise the model’s reliability. We acknowledge that the medical field evolves rapidly, and knowledge bases need continuous updates to reflect the latest findings and clinical guidelines. To mitigate this issue, it would be beneficial to incorporate mechanisms for regularly updating the knowledge base and ensuring that new medical conditions, treatments, and diagnostic tools are included in the training data. This ongoing maintenance would help maintain the accuracy and relevance of the model’s output, reducing the risk of errors arising from outdated information (15).

Over-reliance on specific knowledge base

Another challenge is the potential over-reliance on the specific knowledge base used for training the model. While this approach ensures high accuracy for conditions and treatments covered in the knowledge base, it could limit the model’s ability to generalize to rare or emerging medical conditions that were not included in the training data. For instance, newly discovered diseases or medical conditions may not be represented in the current knowledge base, which could lead to gaps in the model’s understanding. To address this issue, future iterations of the model could benefit from integrating more diverse and comprehensive medical data sources. Including real-world medical case studies, unstructured medical literature, and international guidelines could help the model expand its ability to handle diverse scenarios and improve its generalization capabilities (16).

Risk of misinterpretation

Despite the clear disclaimer within the system’s design that it is not intended for providing direct medical advice, there remains a risk that users may misinterpret the generated responses as authoritative guidance (17). This could lead to unsafe or inappropriate medical decisions. It is crucial to implement more explicit warnings and guidelines that emphasize the tool’s intended role as a support resource rather than a substitute for professional medical consultation. These precautions could include integrating a stronger disclaimer in the interface, providing clearer guidance on the limitations of the system, and encouraging users to seek advice from qualified healthcare professionals for any medical decisions. Such measures would help mitigate the risk of misinterpretation and promote responsible usage of the tool.

Broader implications of the observed trade-offs

We believe that intensive training of specialized domain data will cause model parameters to over-fit to a specific distribution, covering part of the general semantic space. For example, high-frequency terms in medical literature (such as “pathology” and “prognosis”) may crowd out the semantic representation of general vocabulary (18). At the same time, the self-attention layer of the Transformer architecture redistributes weights, and local patterns in specialized domains (such as structured expressions in medicine) may weaken the ability to pay global attention to open domain text. We discuss the following cases in multi-round diagnosis scenarios:

Patient: I have blurred vision recently and numbness in my hands and feet.
Model: it is recommended to check the glycosylated hemoglobin (professional correct).
Patient: but the examination last month was normal.
Model: that may be presbyopia, pay attention to eye hygiene (general knowledge misleading).

Why GLLMs performs better than Med-LLMs in long-context tasks and how to bridge the gap

Our analysis shows that GLLMs outperforms Med-LLMs on long-context tasks, which is mainly due to the different distribution of training and fine-tuning data and the different training objectives: (I) data diversity and complexity: general data encompasses a wide range of topics and domains, including everyday conversations, news articles, literary works, and scientific papers. This diversity exposes the model to various linguistic expressions and contextual structures, incorporating rich rhetorical devices, polysemy, metaphors, and more, which helps the model learn deeper language patterns. In contrast, specialized data (e.g., medical data) is typically confined to a specific domain, with language expressions that are relatively uniform and standardized. The language use in such data tends to be more direct and concise. (II) Context length and structure: general data often includes longer contexts, such as lengthy articles or continuous dialogues, which aids the model in learning how to process and understand long-range dependencies. These contexts are structurally diverse, encompassing narratives, arguments, descriptions, and more, enabling the model to learn different contextual processing strategies. On the other hand, specialized data usually features shorter contexts, such as medical records or experimental reports, lacking opportunities for training on long contexts. Moreover, the contextual structures in specialized data are relatively fixed and standardized. Once downstream tasks deviate from this standardized data distribution, the model’s ability to comprehend long contexts deteriorates catastrophically. (III) Training objectives and tasks: general data is typically used for multi-task learning, requiring the model to perform well across various tasks, including language generation, translation, summarization, and more. This enhances the model’s generalization capabilities and fosters a more comprehensive understanding of context. In contrast, specialized data is often used for specific tasks, such as disease diagnosis or drug recommendation, where the training objectives are more narrowly focused. The model’s learning scope is thus limited, lacking exposure to broader language patterns.

To address these limitations, this paper introduces more features of general data into professional data and adopts a hybrid training strategy that combines the advantages of general and professional data to improve the model’s long-context understanding ability. Additional methods can also be added to bridge the gap between GLLMs and Med-LLMs, such as using Memory-Augmented Fine-Tuning, which introduces episodic memory architectures that allow models to recall prior exchanges within multi-turn consultations. Retrieval-augmented generation (RAG) can embed medical literature databases directly into inference pipelines to reinforce contextual retention.

Conclusions

We tested some publicly available GLLMs for everyday online question answering (Table 2). The results show that the contextual ability and command following ability of GLLMs are good, and most test accuracy rates are above 50% or close to 50%. Their performance depends not only on the number of parameters and architecture inherent in the model itself, but also on their training methods and data. Models trained with multi-round dialogues or long-context dialogues can achieve excellent results in this test. In order to maintain fairness when comparing with the medical model, we also selected a low-parameter version of DeepSeek, and the score was also reduced to 37.52%.

The same method was used to test the selected 6 open source Med-LLMs, and the accuracy obtained was not ideal compared with GLLMs (Table 3). It is worth noting that HuatuoGPT-II has outstanding medical capabilities among these test models, but the average accuracy of the open-book test is only 4.37%, which is much lower than other models. HuatuoGPT-II has used a model with a parameter volume of 7B in the evaluation of medical professionals, and its response in the Chinese medical context is better than GPT-4 and others. We checked its answer data and found that it often ignores the instructions of “directly output the correct option” in its answers, analyzes each option, and then outputs the possible correct option (Table 4). It can be seen that it does have excellent diagnostic capabilities in the medical context, thanks to high-quality training data. At the same time, in the process of using PULSE, we found that it often relies on multiple rounds of dialogue to obtain more information, rather than directly diagnosing the described symptoms through a single inquiry, which makes it the best score among the 6 models in this test.

To explore the potential improvement of general data on the long-context understanding ability of Med-LLMs, we first fine-tuned the Med-LLMs tested in the previous section using the general data mentioned above. After fine-tuning, we gave these models the same open-book test as before to evaluate their improvement in long-context understanding and instruction following ability (Table 5). The results show that fine-tuning with general data does bring improvement. This means that general data can improve the contextual understanding ability of Med-LLM. This is most obvious in HuatuoGPT-II, which is basically the most medical among these models, with an improvement of 13.5%. It uses a large amount of medical question-answering data for fine-tuning, which makes its understanding ability of long context insufficient. Once fine-tuned again with general data, its ability to read long context can be greatly improved. This also shows that there is almost a trade-off between context reading ability and professional ability, and more attention should be paid to the use of general data when fine-tuning the model. On the contrary, models that achieved excellent results in previous tests may decline. These models focused on the use of general data during the first fine-tuning and balanced the two well. They are naturally good at reading long context content. The second fine-tuning produced a negative effect by destroying the well-balanced fine-tuning data ratio.

Based on the above understanding, we believe that general data plays a vital role in improving the model’s long-context understanding ability during the fine-tuning process of medical professional models. In the experiment of the ratio of general data to professional data in fine-tuning data, the results (Table 6) show that a basic trend is that the higher the proportion of general data, the better the performance of each model in the open-book exam. This finding shows the importance of general data in improving the model’s long-context understanding and instruction following ability. It is recommended to retain a certain proportion of general data when fine-tuning professional models to maintain comprehensive language understanding ability. We believe that the reason for this phenomenon is that data in the medical field is usually more professional, covering relatively narrow content and forms, while general question-answering data usually covers a wider range of topics and language forms, including more complex structures and diversity in natural language communication, allowing the model to learn a wider range of context understanding capabilities. In the ratio test of all models, the accuracy difference between the data ratio of 9:1 and the data ratio of 1:1 is basically no more than 2%. In order to maintain the long-context understanding ability and enrich the professional knowledge of the model at the same time, 1:1 may be a better fine-tuning data ratio.

Table 6

Fine-tuning results of the Qwen1.5 model with different parameters under different data ratios

Model	9:1 data ratio accuracy (all examinations)	4:1 data ratio accuracy (all examinations)	1:1 data ratio accuracy (all examinations)	1:4 data ratio accuracy (all examinations)	1:9 data ratio accuracy (all examinations)
Qwen1.5-1.8B	39.02%	38.64%	37.58%	33.38%	33.73%
Qwen1.5-4B	52.53%	52.63%	51.14%	49.96%	49.51%
Qwen1.5-7B	62.96%	62.27%	60.42%	59.96%	59.65%
Qwen1.5-14B	68.81%	68.53%	68.1%	65.12%	64.85%

The ratio is general data to professional data.

According to the results obtained from fine-tuning the data ratio, we not only set the ratio of traditional Chinese medicine to modern medicine to 1:1, but also set the ratio of general data to medical data to 1:1 for testing. The experimental results (Table 7) reveal an important phenomenon: in the early stage when the amount of data is limited, the increase or decrease in the amount of data has a significant impact on the contextual understanding ability and guiding subsequent performance of the medical LLMs. This means that compared with the slight change in the amount of data during fine-tuning, the performance of the model will show more obvious fluctuations, and even the increase in the amount of data may lead to counterproductive results, highlighting the important role of the amount of data at this stage. With the continuous expansion of training data, the overall performance of the model shows a steady upward trend, indicating that the introduction of more samples helps the model learn features more comprehensively, thereby optimizing its performance in complex medical information processing. However, it is worth noting that when the amount of data input reaches a certain threshold, the speed of performance improvement gradually slows down until it reaches a plateau, indicating that the model begins to encounter the so-called “data saturation point”. This finding shows that blindly increasing the amount of data does not always bring the expected performance leap. While medium data volumes introduce uncertainty, their performance is not significantly different when the data volumes are low or high, which suggests that the conclusions drawn above about fine-tuning the data ratio are robust.

Table 7

Fine-tuning results using different data quantity

Model	Data quantity	Accuracy (all examinations)
Qwen1.5-7B	10 k	57.60%
	20 k	55.08%
	50 k	53.64%
	70 k	54.83%
	80 k	51.71%
	90 k	53.41%
	100 k	58.71%
	200 k	56.12%

In conclusion, this is a series of studies that explored the gap between Med-LLMs and GLLMs in long-context understanding capabilities and explored the impact of the ratio and quantity of general data to medical data on the context understanding and instruction following capabilities of Med-LLMs through experiments. It was found that GLLMs performed well in long-context understanding and instruction following; medical LLMs did not perform well in these capabilities, but could be improved to a certain extent through further fine-tuning; the performance of the Qwen1.5 model under different data mixing ratios showed that the higher the ratio of general data, the stronger the model’s context understanding ability; when using a small amount of medical data, different data amounts had a significant impact on the model’s capabilities, and data amount control was crucial to optimizing the fine-tuning process. Using a 1:1 data ratio of general data to professional data and a data volume of about 100k for mixed data fine-tuning can alleviate the limitations of Med-LLMs in long context scenarios. This provides guidance for the subsequent fine-tuning of LLMs.

Acknowledgments

None

Footnote

Funding: This work was supported by Science and Technology Development Fund of Macao (No. 0004/2024/E1B1) and Macao Polytechnic University Grant (No. RP/FCA-08/2024).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2024-2655/coif). All authors report that this work was supported by Science and Technology Development Fund of Macao (No. 0004/2024/E1B1) and Macao Polytechnic University Grant (No. RP/FCA-08/2024). The payments were made to the author’s institution (Macao Polytechnic University). The authors have no other conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems 2017; 6000-10.
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI 2019;1:9.
Tian D, Jiang S, Zhang L, Lu X, Xu Y. The role of large language models in medical image processing: a narrative review. Quant Imaging Med Surg 2024;14:1108-21. [Crossref] [PubMed]
Liu F, Liu M, Li M, Xin Y, Gao D, Wu J, Zhu J. Automatic knowledge extraction from Chinese electronic medical records and rheumatoid arthritis knowledge graph construction. Quant Imaging Med Surg 2023;13:3873-90. [Crossref] [PubMed]
Wang H, Zhao S, Qiang Z, Li Z, Xi N, Du Y, Cai M, Guo H, Chen Y, Xu H, Qin B. Knowledge-tuning large language models with structured medical knowledge bases for reliable response generation in Chinese. arXiv preprint arXiv:2309.04175. 2023 Sep 8. doi: 10.1145/3686807.
Wang R, Duan Y, Lam C, Chen J, Xu J, Chen H, Liu X, Pang PC, Tan T. Ivygpt: Interactive Chinese pathway language model in medical domain. In: Fang L, Pei J, Zhai G, Wang R, (eds). Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science, Springer, Singapore; 2023. pp. 378-82.
Zhang H, Chen J, Jiang F, Yu F, Chen Z, Li J, Chen G, Wu X, Zhang Z, Xiao Q, Wan X. Huatuogpt, towards taming language model to be a doctor. In: Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10859–10885, Singapore. Association for Computational Linguistics.
Chen J, Wang X, Ji K, Gao A, Jiang F, Chen S, Zhang H, Song D, Xie W, Kong C, Li J. Huatuogpt-ii, one-stage training for medical adaption of llms. arXiv preprint arXiv:2311.09774. 2023 Nov 16.doi: 10.48550/arXiv.2311.09774
Bai J, Bai S, Chu Y, Cui Z, Dang K, Deng X, Fan Y, Ge W, Han Y, Huang F, Hui B. Qwen technical report. arXiv preprint arXiv:2309.16609. 2023 Sep 28. doi: 10.48550/arXiv.2309.16609.
Alpaca Chinese Dataset. (2023). Retrieved October 10, 2023, Available online: https://github.com/hiyouga/LLaMA-Factory
Bi X, Chen D, Chen G, Chen S, Dai D, Deng C, Ding H, Dong K, Du Q, Fu Z, Gao H. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954. 2024 Jan 5. doi: 10.48550/arXiv.2401.02954.
Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, Avila R. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. 2023 Mar 15. doi: 10.48550/arXiv.2303.08774.
Wang X, Chen GH, Song D, Zhang Z, Chen Z, Xiao Q, Jiang F, Li J, Wan X, Wang B, Li H. Cmb: A comprehensive medical benchmark in Chinese. arXiv preprint arXiv:2308.08833. 2023 Aug 17. doi: 10.48550/arXiv.2308.08833.
Qin L, Chen Q, Zhou Y, Chen Z, Li Y, Liao L, Li M, Che W, Yu PS. A survey of multilingual large language models. Patterns (N Y) 2025;6:101118. [Crossref] [PubMed]
Mousavi SM, Alghisi S, Riccardi G. 2024. DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8014–8029, Miami, Florida, USA. Association for Computational Linguistics. Doi: 10.18653/v1/2024.findings-emnlp.471.
Wu J, Dong H, Li Z, Wang H, Li R, Patra A, Dai C, Ali W, Scordis P, Wu H. A hybrid framework with large language models for rare disease phenotyping. BMC Med Inform Decis Mak 2024;24:289. [Crossref] [PubMed]
Goodman RS, Patrinely JR, Stone CA Jr, Zimmerman E, Donald RR, Chang SS, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open 2023;6:e2336483. [Crossref] [PubMed]
Sultan MD, Sil A, Florian R. Not to Overfit or Underfit the Source Domains? An Empirical Study of Domain Generalization in Question Answering. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022 pages 3752–3761, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.247.

Cite this article as: Yang Q, Chen J, Sun Y, Wang Y, Tan T. Fine-tuning medical language models for enhanced long-contextual understanding and domain expertise. Quant Imaging Med Surg 2025;15(6):5450-5462. doi: 10.21037/qims-2024-2655

Fine-tuning medical language models for enhanced long-contextual understanding and domain expertise

Introduction

Methods

Fine-tuning data

General model examination

Medical model examination

Fine-tuning with general data

Fine-tuning with different data composition

Fine-tuning with different data quantity

Results

Table 1

Table 2

Table 3

Table 4

Table 5

Discussion

Limitations of knowledge-tuning approach

Dependence on knowledge base quality

Over-reliance on specific knowledge base

Risk of misinterpretation

Broader implications of the observed trade-offs

Why GLLMs performs better than Med-LLMs in long-context tasks and how to bridge the gap

Conclusions

Table 6

Table 7

Acknowledgments

Footnote

References

Article Options

Download Citation

Share