
Evaluation of a large language model to simplify discharge summaries and provide cardiological lifestyle recommendations
How did your country report this? Share your view in the comments.
Diverging Reports Breakdown
Evaluation of a large language model to simplify discharge summaries and provide cardiological lifestyle recommendations
Study: LLMs can significantly enhance readability by reducing the required reading level from a college level to a 10th-grade level. Study: Most simplifications generated by the LLM in our study were evaluated as correct, complete, harmless, and comprehensible. We identified several instances where the model produced potentially harmful inaccuracies. We found that prior diagnoses were mistakenly presented as current ones. The omission of medication dosage information poses a great risk, especially if patients rely solely on these simplified summaries without consulting their healthcare providers. We also found that LLM’s capability to generate lifestyle recommendations from discharge summaries could complement the primary integration of pharmacological queries and specific queries without requiring specific queries18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,50.
Notably, the LLM-generated outputs exhibited significantly improved readability and were highly rated by experts for their comprehensibility to medical laypeople, even though they contained a higher word count. The experts in our study also highlighted the importance of providing explanations rather than simply translating medical jargon, as this approach is more likely to improve patient understanding. Effective explanations require an understanding of the context and LLMs appear to have a distinct advantage in this regard, as they can dynamically capture complex contexts to guide their responses, unlike deterministic rule-based systems that, e.g., only provide layman translations of ICD-10 codes.
The adaptability and flexibility of LLMs do not come without risks. Although most simplifications generated by the LLM in our study were evaluated as correct, complete, harmless, and comprehensible, we identified several instances where the model produced potentially harmful inaccuracies. One of the primary concerns identified in our study is the generation of inaccurate information, known as “hallucinations” in the context of LLMs. Hallucinations occur when the LLM produces plausible-sounding but factually incorrect content41. For example, we found that prior diagnoses were mistakenly presented as current ones. This issue is likely due to the anonymization process used in our study, in which excluding diagnosis dates may have caused the model to confuse past and present medical conditions. Yet, such a misrepresentation of a patient’s medical history, along with other instances of incorrect information generated by the LLM, can have serious consequences in real-world settings, such as inappropriate treatment decisions and unnecessary anxiety for patients.
Another problem identified was instances of insensitive communication, which can lead to patient confusion and psychological distress. This underscores the need for communication that is both accurate and empathetic in patient-centered summaries. This presents an additional challenge for the LLM to adjust from the typically neutral style used in provider-to-provider communication to a more compassionate approach suitable for patients.
Perhaps the most critical issue was the recurring omission of medication dosage information. The absence of this information poses a great risk, especially if patients were to rely solely on these simplified summaries without consulting their healthcare providers. Interestingly, in our study, the omission of dosage information occurred only when using a full-text prompt. Similarly, transitioning from the full-text to the segment-wise prompt improved the output in terms of harmlessness, completeness, and comprehensibility. Although our study lacked the statistical power to detect significant differences, these improvements suggest that refining prompt design could mitigate some of the existing limitations of LLMs without modifying the underlying model architecture15,35. In addition, incorporating further mitigation strategies, such as a human-in-the-loop approach, appears indispensable for balancing the inherent strengths of LLMs with the potential errors stemming from their probabilistic nature—two sides of the same coin.
Overall, our findings regarding our first research objective align with previous research, demonstrating the potential of using LLMs in medical text simplification34. Our results also resonate with the assertion of Jeblick et al. that the primary goal of simplification should be to enhance clarity and comprehension rather than merely reducing text length14. We also concur with the growing consensus that implementing complementary safeguards and refraining from using LLMs as standalone solutions are essential to mitigate the significant errors they may produce14,15,42.
In addressing our second research objective, we evaluated the LLM’s capability to automatically generate lifestyle recommendations from discharge summaries. Our findings indicate that the LLM produced a substantial number of diverse lifestyle recommendations that medical experts generally consider to be relevant, evidence-based, complete, consistent, and harmless. This supports and extends previous research by demonstrating that LLMs can produce pertinent recommendations directly from real-world discharge summaries, without requiring specific queries18.
The integration of these recommendations could potentially complement pharmacological treatments and support primary, secondary and tertiary prevention, with minimal additional burden on the treating physician. However, a notable limitation is the generic nature of the recommendations produced. For instance, while advising cardiovascular exercise may be appropriate in many contexts, a patient with a foot injury would require a more tailored recommendation, such as low-impact exercises like swimming, to accommodate their clinical limitations. While some lack of personalization in our study may be partially attributable to the segmented and anonymized nature of the original discharge summaries, our observation of the generic nature of the LLM-generated recommendations aligns with previous research. A prior study demonstrated that, although LLMs can generate relevant and accurate treatment advice and lifestyle recommendations based on MRI reports, they still lack personalization43.
Several limitations of our study warrant acknowledgment. First, the relatively small sample size of 20 discharge summaries may limit the generalizability of our findings. Second, the exclusion of rare disease diagnoses may further restrict generalizability, as these conditions often require specialized discharge documentation. Third, the non-deterministic nature of LLM responses, characterized by variable outputs even with identical inputs, may affect the reproducibility of our results, as each prompt was applied exactly once per discharge summary. Fourth, our analysis was confined to a single LLM and did not include comparisons with domain-specific or fine-tuned models, which could potentially produce different outcomes. Fifth, while we employed standardized metrics to assess readability and gathered expert evaluations of comprehensibility from a patient’s perspective, these measures do not conclusively demonstrate that the texts are indeed easier for patients to understand. Sixth, due to the emerging nature of research on LLMs for medical text simplification, there is a lack of validated instruments designed to measure the quality of their outputs. Although we based the quality dimensions and Likert-scale assessments on previous studies to ensure comparability, there remains an urgent need for the development and validation of specific scales tailored to this application. Furthermore, the scope of this research precluded a systematic expert grading of the lifestyle recommendations or their comparison against clinical guidelines, which could have provided an additional dimension for quality assessment. Despite these limitations, this study is, to our knowledge, the first to apply the GPT-4o model to real-world cardiological discharge summaries to generate simple language explanations and lifestyle recommendations for German-speaking populations.
Future research should aim to address these limitations by incorporating a larger and more diverse sample of discharge summaries to enhance generalizability. Comparative analyses that explore different prompting techniques and assess both open-source and proprietary models would provide further insight into the performance and reliability of LLMs in clinical settings. It is also essential to directly engage patients in evaluating both objective comprehension and subjective satisfaction with the simplified texts, as well as to determine whether these modifications lead to improved health outcomes post-discharge. Furthermore, future studies should consider to systematically evaluate accountability, equity, security, fairness, and transparency in the design and deployment of these models, given the documented biases present in training data44. Unchecked biases risk producing unequal outcomes for different patient groups and may exacerbate existing health disparities. Finally, the development and validation of specialized instruments for assessing the quality of LLM outputs remain urgent priorities to ensure that such tools meet clinical and ethical standards in real-world healthcare environments.
Despite these efforts, broader concerns may persist around integrating LLMs into standard clinical workflows. As our study demonstrates, LLMs can generate misleading or harmful information, potentially compromising patient safety and raising ethical and legal questions about liability and malpractice. Additionally, the absence of a consensus on acceptable quality benchmarks and the limited research on real-world clinical accuracy complicate their acceptance by healthcare professionals. Concerns over data privacy pose another hurdle as entering personal data into LLMs without prior anonymization risks patient confidentiality45. While manual anonymization may be feasible in a research setting, it is impractical in a clinical environment and would negate the efficiency gains offered by LLMs’ automatic text generation. Potential solutions, such as employing anonymization algorithms, could be explored to ensure data privacy is maintained without compromising the functionality and utility of LLMs. Finally, LLMs, including GPT-4o, are not currently approved as medical devices and therefore cannot be used in clinical practice. However, the rapid and unregulated use of these models suggest that regulatory bodies will soon need to evaluate them. Such evaluations will present their own set of LLM-specific challenges46,47. Historically, the introduction of machine learning-based medical devices also faced regulatory hurdles. Nonetheless, as of December 2024, the U.S. Food and Drug Administration has authorized 1016 artificial intelligence-enabled medical devices48. We anticipate that, with further research and technological advancements, LLMs will eventually reach a risk/benefit threshold that allows them to obtain regulatory approval.
A promising development in this area is the emergence of open-source models. As highlighted by Riedemann, Labonne, and Gilbert, open-source models offer the most viable path to regulatory approval as medical devices49: Compared to closed-source models, open-source LLMs enable greater control over the model architecture, the source of training data, and update processes. Additionally, open-source models can help address data privacy concerns by allowing more stringent control over data flows, access rights, and enabling on-premise deployment. While there still appears to be a performance gap between open-source and closed-source LLMs, research has shown that fine-tuning open-source models for specific tasks can effectively close this gap50. These advantages make open-source models a compelling option for advancing their use in medical applications.
In conclusion, this study provides preliminary evidence that, with further development, LLMs could support the automated generation of patient-centered discharge summaries by improving readability while maintaining a reasonable, though imperfect, level of quality. While the LLM-generated lifestyle recommendations were generally of high quality, they lacked personalization, which may limit their clinical utility. Significant challenges remain, particularly concerning quality assurance, regulatory compliance, and data privacy. Further research is necessary to evaluate the real-world applicability, effectiveness, and safety of LLMs before they can be adopted in routine clinical practice.
Source: https://www.nature.com/articles/s43856-025-00927-2