Logo Utrecht University

Medical Text Simplification using Transformer-Based Models

Fine tuning the T5 encoder-decoder architecture

Mischa Brinkhuis - 6574580

Michail Mollov - 6291643

Ruben Swarts - 6945333

Chutian You - 9353992

wordcount: XXXX
4-13-2025

Introduction:

Experts and governments alike have been confronted with the following challenge for decades: how to communicate to ‘the public’ in a way that is understandable for all, while the experts agree with the content? Experts tend to use technical jargon and terminology specific to their fields, which can hinder public understanding.

In recent years, this issue has gathered more attention. In 2023, research showed that 30% of Dutch citizens struggle to understand government communication (Universiteit Utrecht, 2023), illustrating that crucial information does not reach a large part of the general public.

Text simplification could be a helpful tool in bridging this gap, as in 2015, a study on the effect of text simplification was conducted at the University of Amsterdam, which showed that text simplification improves ‘recall and attitudes, among low health literacy people’ (Meppelink, et al., 2015).

The discussion on simplified text is especially prominent in the medical field, as the stakes are so high when it comes to good communication about health issues. Miscommunication about medical conditions may cause significant harm to patients and the general public.Over 15 years ago, articles were already published about simplifying medical communication (Carmona, 2009). In recent years, a bigger role in tackling this problem has been given to technology. For example, apps were introduced , functioning as a dictionary for medical terms. A user would enter a word they did not understand and an explanation about this word would be provided (Hendawi, Alian, & Li, 2022). Despite it being a major improvement, researchers wanted to further improve these assistance programs - ideally by being able to ‘translate’ entire texts written by medical experts to more easily understandable text.

This is where artificial intelligence can play a role. Machine learning provides the tools to build models that could simplify medical texts (University of Arizona, 2023). A big role is now predicted for Large Language Models, as they excel in translation tasks (Gaper, 2023). The aim of this project is to fine tune existing models to simplify medical texts.
The following research question will be addressed

RQ:
How effectively can transformer-based models be finetuned to simplify complex medical texts while preserving critical information?

This research question will be answered by identifying the major challenges in medical text simplification, reviewing existing transformer-based models used for this task, and evaluating their effectiveness. In particular, we will fine tune the t5 encoder–decoder mode on the unfiltered MultiCochrane corpus. Additionally, the implementation of our own transformer-based solution for medical text simplification will be discussed and will be evaluated using external benchmarking to assess its strengths and weaknesses.

Literature review

Transformer-based models provide a promising solution to the challenges of medical text simplification due to their ability to process and generate contextually accurate, fluent, and coherent text. Firstly, they excel at terminology preservation by leveraging pre-trained embeddings and fine-tuning on domain-specific corpora, ensuring that medical terms are either retained or accurately paraphrased (Raffel et al., 2020). Models like BioBERT and SciBERT, trained on biomedical literature, improve understanding of specialized vocabulary and mitigate the risk of incorrect substitutions (Lee et al., 2020). Secondly, they handle contextual ambiguity effectively through self-attention mechanisms, allowing them to capture long-range dependencies and infer meaning based on surrounding text, reducing the likelihood of misinterpretation (Vaswani et al., 2017). Thirdly, transformers improve sentence structure and readability by learning from large-scale parallel corpora of complex and simplified texts, enabling models like T5 and BART to generate clearer, more concise sentences without losing essential details (Lewis et al., 2020). Fourthly, they address dataset limitations by enabling transfer learning, where pre-trained models adapt to smaller domain-specific datasets with relatively low supervision, reducing the need for large annotated corpora (Brown et al., 2020). Lastly, trust and factual accuracy are maintained through reinforcement learning and controlled text generation techniques, which guide models to prioritize factual consistency while simplifying content (Phatak et al., 2023). Given these strengths, transformer-based models offer a scalable, automated approach to medical text simplification, balancing clarity with medical precision.

T5 (Text-to-Text Transfer Transformer) has been widely adopted for medical text simplification due to its ability to handle complex text-to-text transformations. Fine-tuned T5 models, such as SciFive and Biomedical-T5, have demonstrated strong performance in simplifying medical literature by leveraging domain-specific training on biomedical corpora (Gao et al., 2022). Studies show that T5 excels at sentence restructuring and readability improvement, as it generates fluent, well-formed sentences while maintaining the integrity of medical information (Li et al., 2023). However, one of the key challenges with T5-based models is terminology preservation, as some studies report that fine-tuned versions may oversimplify or rephrase key medical terms inaccurately (Zhu et al., 2023). Despite this, T5’s flexibility and effectiveness in abstractive simplification make it one of the most promising transformer-based models for medical NLP applications.

BART (Bidirectional and Auto-Regressive Transformer) has gained attention for its denoising autoencoding approach, which enables it to reconstruct and simplify complex medical texts effectively (Lewis et al., 2020). Research indicates that BART performs particularly well in handling contextual ambiguity, as its bidirectional encoding allows it to infer meaning from surrounding text more effectively than standard left-to-right generative models (Gao et al., 2022). Studies using BART for summarization tasks in clinical settings have shown that it preserves factual accuracy better than other generative models, making it a reliable choice for simplifying medical texts without distorting critical information (Phatak et al., 2023). However, while BART produces more readable and accurate summaries, it sometimes lacks domain-specific knowledge, requiring additional fine-tuning on medical datasets for optimal performance (Zhu et al., 2023).

Unlike T5 and BART, which are primarily used for abstractive simplification, BERT variants like BioBERT, SciBERT, and MedBERT are more commonly applied to lexical simplification—replacing complex medical terms with simpler synonyms while maintaining meaning (Lee et al., 2020). These models excel in terminology preservation, as they leverage pre-trained embeddings from large biomedical corpora, improving their ability to identify and replace difficult words accurately (Shardlow et al., 2023). However, BERT-based approaches generally struggle with sentence restructuring and overall readability improvements, as they are not inherently designed for text generation but rather for classification and extraction tasks (Li et al., 2023). Despite this, BERT models remain useful in hybrid systems where they assist in identifying difficult terms before a generative model like T5 or BART rewrites the full text (Gao et al., 2022).

Data & Methodology:

In our study, the unfiltered MultiCochrane corpus for training and testing different transformer models is used, specifically for medical text simplification. MultiCochrane corpus was sourced from the Cochrane library, the leading database for systematic reviews in health care, and has been used in existing literature aiming to study the similar simplification objective of medical messages (Joseph et al., 2023). The simplified sentences in MultiCochrane were written and peer-reviewed by professionals. The MultiCochrane corpus mainly comprises 3 columns. Column “input_text” denotes the original medical text in sentences, the column “target_text” shows the simplified medical text of the original sentences. The third column shows the source of the original text.

The training, validation and test sets of MultiCochrane corpus were downloaded from Github, in which the training set consists of 61194 sentence pairs, the test set 395 sentence pairs and the validation set 83 sentence pairs. For the purpose of coherence and comparison to previous literature, we loaded these three sets directly for our analysis. An initial exploratory analysis on the MultiCochrane training set has found that the original sentences have the average length of about 26 words and the simplified text about 20 words.

Initial testing on t5 and BART models highlighted the challenge of ‘parroting’ back the prompt, rather than simplifying the text. To address this challenge, before conducting the simplification task on the MultiCochrane dataset, a pipeline of sequential fine-tuning was applied first on Wikilarge, a dataset consisting of aligned sentences extracted from ordinary English wikipedia and simple English wikipedia and was widely used as benchmark for text simplification (He et al., 2024). Throughout this sequential fine-tuning, the selected model was run on Wikilarge to obtain the best trial based on the training score. Afterwards, we ran the trained model on MultiCochrane and obtained the new best trial based on the new training score. We expected that by adding this sequential fine-tuning, it helped increase the simplification performance compared to running the model directly on MultiCochrane, as it first learns the task of simplifying in general on more available data, and then becomes an expert in the medical field on less available data.

As for transformer models used in this study, we focused on the family of T5 models. T5 models have been chosen given the numerous text-to-text pre-trains they have received including text-to-text summarization, translation, classification, similarity testing, etc. (Raffel et al., 2020). Such a comprehensive pre-train process helps increase the quality of text simplification given the structural similarity and is more likely to achieve better scores than other transformer models without this pre-train process. Moreover, T5 models have already been used in previous research studying the similar simplification task (Joseph et al., 2023). A research coherence and result comparison purpose also applied to our choice of T5. In the analysis, we have firstly applied T5-large and T5-small transformer models for a quick sweep on the MultiCochrane dataset. We expected the T5-large model to perform the best in the sweep due to its larger number of parameters. However, only T5-small mode has received the subsequent fine-tuning process with Wikilarge and the final process of training, validating and testing on the MultiCochrane dataset for medical text simplification. The rationale behind this was due to the computing limitation of our computers, and we believed that the fine-tuning process on T5-small could compensate for the lower number of parameters of this model. Moreover, T5-small allowed us to perform hyperparameter sweeps by optimizing the combination of learning rate, learning rate scheduler type, warmup steps, weight decay and label smoothing factor for some epochs in terms of its task performance.

Due to initially disappointing results, specifically model parroting, and the computational resources required for proper training, we were unable to include experiments with BART and BERT models in this report. However, the same strategy can be applied to these models. Moreover, an ‘anti-parroting’ setup was tried, which consisted of fine tuning a model on a metric that combines the SARI score and the jaccard-similarity, rewarding simplification whilst punishing copying.

For evaluating the result of our task simplification, we focus on the SARI metric, in correspondence to existent literature (Joseph et al., 2023; Lyu & Pergola, 2024). “SARI is a metric used for evaluating automatic text simplification systems. The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system.” (Von Werra, 2023).
In addition, we included the score of jaccard similarity to compare the generated simplified sentences with the original ones. Under successful testing, we expected a high score of SARI (0-100, thus closer to 100) and a jaccard similarity not to close to 100 (completely copying the source text), nor to 0 (completely different from the source text).

Ethical considerations of the study

Doing research in the medical field requires high awareness of the ethical side of the research, as medical data could be very personal and are to be handled carefully. Also, mistakes in input data or the output of a model could lead to severe problems for patients' health. Therefore, it is important to take a moment to dwell upon some ethical considerations of this research.

An important thing to notice straight away, is that the data we will be using in this research is anonymous, so no person-specific data was to be found in the dataset. The texts in the dataset were written in a general way that did not target or mention an identifiable person. Therefore, no extra steps seem to be necessary to make sure everyone’s privacy is respected.

Another major part of the medical considerations that has to be taken into account in this research, is that the meaning of the texts should not change, and no person should in any way end up not getting the information that is important for them in their situation. When changing a text (which simplification does), there is the risk of changing the meaning or content in such a way that it does not tell the same story as the original one. Factual accuracy should be prioritized, which could be done by including experts in the process to assess the accuracy of the model before it is implemented.

A threat to the transfer of information to its target audience could also be that people that in fact have a high medical literacy, feel patronized and want to read the texts the way they are written by the experts. Getting rid of the difficult terms in the medical texts can also have a negative effect on how serious people take the texts. Especially in authoritative positions or systems, sometimes people ‘want’ that the texts are written in such a way that you can see the person that wrote it has more knowledge about the matter. Therefore, the original text should always be available, alongside the simplified one.

A last potential issue is that of bias. If the training data is biased, for instance because it has a strong focus on western medicine and the western population, the model will display biased results as well. Therefore, it should be made sure that the dataset does not show any representation issues, and the input should be diversified if possible.

Results

To systematically evaluate the effectiveness of various model configurations for the medical text simplification task, we compared several T5 variants using standard automatic metrics: BLEU, ROUGE 1, ROUGE 2, ROUGE L, SARI, and Jaccard Similarity, SARI being of importance due to the simplification goal. These metrics assess different aspects of simplification quality with BLEU and ROUGE focussing on n-gram overlap with reference texts, whilst SARI evaluates simplification-specific operations (add, keep, delete), and Jaccard Similarity provides a basic lexical overlap measure.

Model	BLEU	ROUGE-1	ROUGE-2	ROUGE-L	SARI	Jaccard Similarity
T5 small	0.13	0.30	0.16	0.26	38.63	0.71
T5 base	0.14	0.30	0.17	0.26	39.68	0.70
T5 large	0.06	0.22	0.12	0.20	41.35	0.61
T5 small seq-fine-tune	0.21	0.39	0.23	0.34	43.95	0.72
T5 small ‘anti parrot’	0.16	0.38	0.21	0.33	38.08	0.76
T5 small first-step seq fine-tune	0.20	0.40	0.23	0.35	44.78	0.80
T5 large optimal sweep	0.20	0.41	0.24	0.36	44.14	0.81

The first three models (T5 small, base, and large) were not fine-tuned for the simplification task and serve as baseline references to contextualize the gains achieved through task-specific training.

Among the fine-tuned models, T5 small with sequential fine-tuning (i.e., trained first on general-domain data and then on medical-domain data) achieved strong performance across the board. In particular, the first-step fine-tuned model trained on the WikiLarge dataset before medical data adaptation showed the highest SARI score (44.78) and high Jaccard similarity (0.8019). Notably, the fine tuned small models outperformed all other models. The ‘T5 large optimal sweep’ did not follow the same strategy, but a simple hyperparameter sweep did increase performance, notably still below the t5-small with the sequence fine tuning strategy, indicating that this strategy increases the models ability to simplify medical tasks more than just performing a hyperparameter sweep. Strikingly, the jaccard similarity raised from 0.61 to 0.81, indicating that the ‘T5 large optimal sweep’ is more tightly bound to the source text than the other fine tuned models and the baseline models.

The difference in scoring for the ‘T5 small seq-fine-tune’ and ‘T5 small first-step seq fine-tune’ on the SARI and jaccard similarity, highlights the potential tradeoff between domain-specific alignment and a ‘higher abstract level’ of simplification: models fine-tuned on medical text tend to explain the medical terminology in different wording, improving similarity-based scores but potentially hurting the similarity score. Whereas the ‘T5 small first-step seq fine-tune’ trained on the wikilarge tends to stay close to the source, keeping the terminology as it is, but rewriting general formal words and sentence structuring to less formal alternatives.

The ‘T5 small anti parrot model’, trained with a specific metric to reduce parroting behavior, decreased the performance of the T5 architecture on this task. This suggests that overemphasizing diversity or anti-parroting strategies may hinder the model's ability to generalize good simplification behaviors consistently.

Overall, these results highlight the effectiveness of sequential fine-tuning, especially when starting from a strong general-domain base. Additionally, they illustrate the importance of balancing simplification with domain specificity, particularly in medical applications where clarity must coexist with accuracy and familiarity.

Generation Behavior
Verifying the generated texts from the models, we noticed most promising results for the ‘T5 small seq-fine-tune’ model, and we report its generations here. To further understand model behavior, we generated simplified outputs using different decoding strategies.In particular, we compared generations under low-temperature sampling and high-beam search configurations.

Low Temperature Generation (Temperature = 0.01)
At a very low temperature of 0.01 (effectively enforcing deterministic outputs), the model exhibited strong parroting behavior. It tended to retain much of the original phrasing and sentence structure, with simplification occurring primarily at the word level. Simplifications were often surface-level substitutions of general terms:

“to pool” → “to combine”
“controlled trials” → “studies”
“RCT” → “studies”

There was also evidence of structural rephrasing, particularly in converting passive voice to active:

Original: “Seven RCTs including 960 participants were identified.”
Simplified: “We identified seven studies including 960 participants.”

These examples demonstrate the model’s basic ability to improve clarity and accessibility, though the changes remained relatively conservative.

High Beam Search (e.g., num_beams = 10+)
Using high beam search (e.g., num_beams=10 or higher), the model still tended toward parroting but occasionally produced highly informative and well-simplified outputs. One notable example was:

Original: “Seven RCTs including 960 participants were identified.”
Simplified: “We identified seven randomized controlled trials (clinical studies where people are randomly put into one of two or more treatment groups) with a total of 960 participants.”

This output illustrates multiple desired behaviors:

Expansion of abbreviations (RCT → randomized controlled trials)
Insertion of parenthetical explanations to clarify jargon
Conversion from passive to active voice
Improved fluency and specificity ("including" → "with a total of")

However, such comprehensive simplifications were rare. Most generations under high beam settings resembled those under low temperature, indicating that while the model is capable of deeper simplification, it lacks consistency. This suggests the model is likely underfitted, perhaps due to insufficient training complexity, limited model complexity, limited domain coverage in the medical corpus, or a combination of the above.

Generalization and Memorization

To assess whether the model had simply memorized a specific pattern or example, we tested it with variations of the same sentence involving RCTs and different numbers or attributes. The model retained its ability to expand the acronym and insert the correct explanation consistently, even across different phrasings:

Original: “I believe Nine RCTs including 1000 patients were included in the analysis.”
Simplified: “Nine randomized controlled trials (clinical studies where people are randomly put into one of two or more treatment groups) including 1000 patients were included in the analysis.”
Original: “I believe Nine RCTs with some patients were included in the analysis.”
Simplified: “I believe Nine randomized controlled trials (clinical studies where people are randomly put into one of two or more treatment groups) with some patients were included in the analysis.”
Original: “I believe Nine RCTs with some patients with a history of stroke were picked.”
Simplified: “I believe Nine randomized controlled trials (clinical studies where people are randomly put into one of two or more treatment groups) with some patients with a history of stroke were picked.”

The model’s consistent simplification of "RCTs" across multiple phrasings and prompt types suggests that it did not merely memorize a single example but learned a more generalizable pattern for handling this domain-specific acronym. Notably, the explanation of "randomised controlled trials" appeared in full in each instance, indicating abstraction beyond surface-level imitation.

The ability to explain abbreviations was not limited to spelling out the letters, it included conceptually explaining what the abbreviations stand for, though this occurrence was quite rare

Original: “simplify: We applied the GRADE methodology to rate the certainty of evidence.”
Simplified: “We rated the certainty of evidence from studies using four levels: very low, low, moderate, or high.”

Something else that the model is able to do, is that it is able to write out the full term, where medical jargon abbreviations are used. As shown in the example below:

Original: “simplify: Anti-D can be administered by IM or IV injection.”
Simplified: “Anti-D can be administered by intramuscular or intravenous injection”

Limitations included the model's tendency to repeat the prompted sentences (despite notable improvement), to remove important information, the inability to (yet) explain most technical words and the occasional logical inconsistency. These limitations point to areas where further training or architectural modifications may be necessary to improve the depth and scope of simplification.

More interesting results, both positive and negative, can be found in Appendix A.

Discussion

This study set out to investigate how effectively transformer-based models, particularly the T5 architecture, can be fine-tuned to simplify complex medical texts while preserving essential information. To do so, we applied a structured training pipeline consisting of a general-domain fine-tuning phase on the WikiLarge dataset, followed by domain-specific fine-tuning on the MultiCochrane corpus. This sequential fine-tuning approach enabled the model to learn general simplification patterns before adapting to the specific linguistic and semantic characteristics of medical writing. This method has only been applied to the T5 small model, but our results show potential to apply this technique to bigger models.

Our results indicate that model size plays a significant role in performance. Larger models, such as T5-large, consistently outperformed smaller variants in most metrics, especially in SARI, which we used as our primary simplification evaluation measure. However, due to computational constraints, the full fine-tuning pipeline was only applied to the T5-small model. Despite its size, T5-small benefited considerably from the sequential fine-tuning approach, achieving a best SARI score of 44.78, which is competitive within the domain of sentence simplification compared to the other models in this study.

The analysis of decoding strategies showed that models often exhibited parroting behavior, particularly under deterministic settings (e.g., low temperature). High beam search settings occasionally produced deeper simplifications, including abbreviation expansions and improved fluency, but these results were not consistent. Interestingly, models with the highest Jaccard similarity often showed less simplification, suggesting a trade-off between preservation of original structure and meaningful rewriting. This further highlights the complexity of balancing clarity, fidelity, and simplicity in the medical domain.

A key observation was that sequential fine-tuning substantially improved performance compared to training solely on medical data. This supports the idea that exposing models first to a broad range of simplification patterns before domain adaptation leads to more generalizable and nuanced simplification strategies. However, while models performed well in tasks such as expanding abbreviations and simplifying terminology, they sometimes failed to restructure more complex sentence forms or maintain logical coherence, underscoring the need for more targeted training or hybrid approaches in future work.

In doing so, this research contributes to a growing but still limited body of work on domain-specific simplification, particularly in the medical field where maintaining both clarity and content accuracy is critical. Prior work such as Multilingual Simplification of Medical Texts by Joseph et al. has highlighted the potential of large-scale models like GPT-3 and Flan-T5 across multiple languages, reporting SARI scores in the range of 39–43 and relatively high BLEU and BS scores. However, those approaches do not explicitly leverage sequential domain adaptation or conduct fine-grained analyses of simplification fidelity versus parroting behavior. In comparison, our best-performing model not only surpasses their SARI scores (44.78) but also provides insight into how simplification depth can be modulated through targeted training steps.

Our study also adds empirical depth to the discussion on parroting behavior. While high Jaccard similarity (values above 0.80) might suggest improved content preservation, our results indicate that it can also signal reduced simplification effort—essentially the model echoing the source text. In contrast, models with moderately lower similarity scores (around 0.70) often performed more meaningful rewrites, as reflected in SARI gains. These findings reinforce the idea that high lexical overlap is not always indicative of better simplification quality, and that simplification models in the medical domain must be carefully tuned to avoid excessive conservatism.

Conclusion

This research demonstrates that transformer-based models, particularly from the T5 family, can be effectively fine-tuned for the task of medical text simplification. Sequential fine-tuning on general and domain-specific corpora significantly boosts performance, especially for smaller models such as T5-small, making it a viable option under limited computational resources.

Key findings include:

Larger models tend to perform better across most simplification metrics, with T5-large achieving strong results even without the full fine tuning process that t5-small underwent.
Sequential fine-tuning is highly effective, enabling smaller models to rival or even outperform larger models in SARI scores when properly adapted to domain-specific language.
There is a delicate trade-off between simplification and content preservation; overly conservative models preserve structure but under-simplify, while more aggressive strategies risk distorting meaning.

For future research, we recommend:

Applying the full sequential fine-tuning pipeline to larger T5 models (base and large), which could further improve performance.
Explore other transformer architectures.
Exploring hybrid approaches that combine BERT-style terminology replacement with T5-style sentence restructuring.
Including human evaluation, particularly with medical professionals and laypersons, to better assess real-world usability and trustworthiness of simplified outputs.

Overall, our findings suggest that with appropriate fine tuning strategies, transformer models can serve as powerful tools to make medical information more accessible to broader audiences, without sacrificing critical meaning, however, more compute and data is required to achieve the high performance needed in the medical field.

References

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Shinn, J., Askell, A., Sastry, G., & Hesse, C. (2020). Language models are few-shot learners. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), 33, 1877-1901. https://arxiv.org/abs/2005.14165

Carmona, J. (2009). Simplifying Medical Terminology in Interpreted. Fort Worth: University of North Texas Health Science Center.

Gao, Y., Dligach, D., Miller, T., Xu, D., Churpek, M. M., & Afshar, M. (2022). Summarizing patients' problems from hospital progress notes using pre-trained sequence-to-sequence models. arXiv preprint. https://arxiv.org/abs/2208.08408

Gaper. (2023). Decoding Medical Jargon: The Role of LLMs in Simplifying HealthTech Information. Retrieved from Gaper.io: https://gaper.io/decoding-medical-jargon/

He, W., Farrahi, K., Chen, B., Peng, B., & Villavicencio, A. (2024). Representation transfer and data cleaning in multi-views for text simplification. Pattern Recognition Letters, 177, 40–46. https://doi.org/10.1016/j.patrec.2023.11.011

Hendawi, R., Alian, S., & Li, J. (2022). A Smart Mobile App to Simplify Medical Documents and Improve Health Literacy: System Design and Feasibility Validation. JMIR Formative Research.

Joseph, S., Kazanas, K., Reina, K., Ramanathan, V. J., Xu, W., Wallace, B. C., & Li, J. J. (2023). Multilingual Simplification of Medical Texts. https://doi.org/10.48550/arxiv.2305.1253

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240. https://doi.org/10.1093/bioinformatics/btz682

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703

Li, Z., Belkadi, S., Micheletti, N., Han, L., Shardlow, M., & Nenadic, G. (2023). Large language models for biomedical text simplification: Promising but not there yet. arXiv preprint. https://arxiv.org/abs/2408.03871

Lyu, C., & Pergola, G. (2024). Society of Medical Simplifiers. https://doi.org/10.48550/arxiv.2410.09631

Meppelink, C., Smit, E., Buurman, B., & Weert, J. v. (2015). Should we be afraid of simple messages? The effects of text difficulty and illustrations in people with low or high health literacy. Health Communication, 1181-1189.

Phatak, K., Tang, Y., Zhang, Y., & Wang, W. (2023). Controlling factual correctness in text simplification using reinforcement learning. Findings of the Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.229

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21, 1-67. https://jmlr.org/papers/v21/20-074.html

Shardlow, M., Belkadi, S., Li, Z., & Nenadic, G. (2023). Lexical simplification for medical text using masked language models and biomedical word embeddings. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://doi.org/10.18653/v1/2023.emnlp-main.482

Universiteit Utrecht. (2023, 03 27). 30 procent van de Nederlanders vindt overheidsteksten te moeilijk. Retrieved from uu.nl: https://www.uu.nl/nieuws/30-procent-van-de-nederlanders-vindt-overheidsteksten-te-moeilijk

University of Arizona. (2023, 07 10). How Machine Learning Can Simplify Medical Jargon. Retrieved from eller.arizona.edu: https://eller.arizona.edu/news/2023/07/how-machine-learning-can-simplify-medical-jargon

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30. https://arxiv.org/abs/1706.03762

Wallace, B. C., Giorgi, S., Bertagnolli, D., & Fox, K. (2022). Context-aware medical text simplification using transformer-based models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-main.235

Werra, L. v. (2023). Metric: sari. Retrieved from Huggingface: https://huggingface.co/spaces/evaluate-metric/sari

Xingxing Zhang and Mirella Lapata. 2017. Sentence Simplification with Deep Reinforcement Learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 584–594, Copenhagen, Denmark. Association for Computational Linguistics.

Zhu, Y., Yang, X., Wu, Y., & Zhang, W. (2023). Leveraging summary guidance on medical report summarization. arXiv preprint. https://arxiv.org/abs/2302.04001

Appendix A

Original	Simplified	Observation
Seven RCTs including 960 participants were identified.	Simplified: We identified seven randomised controlled trials (clinical studies where people are randomly put into one of two or more treatment groups) including 960 participants.	Explanation of abbreviation and jargon
The intrabiliary application of corticosteroids via nasobiliary tube seems to induce severe adverse effects.	The application of corticosteroids via nasobiliary tube seems to cause severe adverse effects.	General simplification: induce -> cause Removal of difficult word (expert should judge whether this would be reasonable)
Anti-D can be administered by IM or IV injection.	Anti-D can be administered by intramuscular or intravenous injection.	abbreviation written out (but not explained)
Preterm infants with respiratory distress syndrome are at increased risk of adverse neonatal and developmental outcomes.	Babies born before 37 weeks of pregnancy are at risk of respiratory distress syndrome.	clear simplification: 'Preterm infants' -> 'Babies born before 37 weeks of pregnancy' Issue: not logically coherent: changed the characteristic 'infants wit respiratory distress syndrome' into a risk 'are at risk of respiratory distress syndrome'. So the meaning of the sentence changed.
To compare the efficacy and effectiveness of IM versus IV anti-D IgG in preventing RhD alloimmunization in RhD-negative pregnant women.	This review compared the efficacy and effectiveness of intramuscular versus intravenous anti-D in preventing rhesus alloimmunization in RhD-negative pregnant women.	Introduced 'this review' to clarify who/what. Again abbreviations written out
Twelve RCTs (1023 participants) reporting 14 comparisons were included in this review.	We included 12 randomised controlled trials (RCTs; clinical studies where people are randomly put into one of two or more treatment groups) with 1023 participants reporting 14 comparisons.	A good result: jargon and abbreviation explained, active form instead of passive form.
We applied the GRADE methodology to rate the certainty of evidence.	We rated the certainty of evidence from studies using four levels: very low, low, moderate, or high.	A good result: GRADE methodology is jargon clearly detailed in simple words (and accurately). Also 'applied' -> 'using'.
The primary objective of this review was to determine the diagnostic accuracy of the Mini-Cog for detecting Alzheimer's disease dementia and related dementias in a community setting.	The aim of this Cochrane Review was to find out how accurate the Mini-Cog is for detecting dementia in community settings.	- Generalized 'Alzheimer's disease dementia and related dementias' -> 'dementia'. Less specific, so loss of information, but much easier to read. - The primary objective - the aim (synonym, not necessarily easier?) - to determine - to find out (less formal, easier)
The evidence that shorter questionnaires are better than longer questionnaires was unclear (RR 1.04; 95% CI 1.00 to 1.08, P value = 0.07) and the evidence for questionnaires relevant to the disease/condition was also unclear (RR 1.07; 95% CI 1.01 to 1.14).	We are uncertain whether shorter questionnaires are better than longer questionnaires because the quality of the evidence is very low.	- Important, but difficult to understand, information about the results of the study left out - For the purpose of simplifying it worked, but not suitable for scientific reporting.
Excess sodium intake is associated with high blood pressure, which is common and costly and accounts for significant burden of disease.	This is a problem because eating too much salt can cause high blood pressure.	- Simplifies "Excess sodium intake" -> "eating too much salt". - 'this is a problem' seems to refer more to the behaviour, rather than 'common and costly and accounts for significant burden of disease'.
Nine initiatives permitted quantitative analysis of differential impact by sex (men and women separately).	Nine of these initiatives (including men and women separately) analysed men and women separately.	- Incoherent repetition.
To examine the effectiveness of interventions in the treatment of chronic blepharitis.	What is chronic blepharitis and what are treatments?	- Meaning and formulation of the motive changed into the formulation of an unrelated question to the original text.

Table 2: Simplification details

Appendix B

This training pipeline outlines the procedures and configurations used to train the various T5-based models evaluated in our study. It details the datasets, hardware, training schedules, and optimization strategies that enabled effective model fine-tuning for medical text simplification.

Training Pipeline

Training data

We trained on the english unfiltered MultiCochrane corpus dataset consisting of 61194 sentence pairs. Sentences were encoded using the t5 default tokenizer, which is based on SentencePiece, with a vocabulary of 32,128 subword tokens. For the input text, we added the prefix ‘simplify: ’. For sequential fine-tuning, we trained on the wikilarge dataset, (Xingxing Zhang and Mirella Lapata. 2017)

Hardware and Schedule

Our models were trained on the consumer grade nvidia RTX 4090 GPU. Training the T5-small model takes about 10 minutes for one epoch, whereas training the T5-large model takes ove

r an hour.

metric for best model

We experimented with several metrics, including a metric that combines rewarding simplification (SARI) and punishing similarity (includes metrics like jaccard similarity) to the input text. However, SARI was determined as the most suitable metric after experimentation and verification of the results.

Hyperparameter sweeps

Optuna together with the Adam optimizer is used for performing hyperparameter sweeps on learning rate, learning rate scheduler type, warmup steps, weight decay and label smoothing factor for 5 epochs. The t5-small model underwent a vast number of trials (close to 100). We ran 6 trials for t5-large. These trials were preceded with several manual runs to test initial performance.

Sequential fine-tuning

We also applied sequential fine-tuning to the t5-small model, which involved first sweeping and training on wikilarge dataset for simplification, then use this checkpoint to further train on the english unfiltered MultiCochrane corpus dataset.

Generalization and Memorization

References

Original

Simplified

Observation

Seven RCTs including 960 participants were identified.

Simplified: We identified seven randomised controlled trials (clinical studies where people are randomly put into one of two or more

treatment groups) including 960 participants.

Explanation of abbreviation and jargon

The intrabiliary application of corticosteroids via nasobiliary tube seems to induce severe adverse effects.

The application of corticosteroids via nasobiliary tube seems to cause severe adverse effects.

General simplification: induce -> cause Removal of difficult word (expert should judge whether this would be reasonable)

Anti-D can be administered by IM or IV injection.

Anti-D can be administered by intramuscular or intravenous injection.

abbreviation written out (but not explained)

Preterm infants with respiratory distress syndrome are at increased risk of adverse neonatal and developmental outcomes.

Babies born before 37 weeks of pregnancy are at risk of respiratory distress syndrome.

clear simplification: 'Preterm infants' -> 'Babies born before 37 weeks of pregnancy' Issue: not logically coherent: changed the characteristic 'infants wit respiratory distress syndrome' into a risk 'are at risk of respiratory distress syndrome'. So the meaning of the sentence changed.

To compare the efficacy and effectiveness of IM versus IV anti-D IgG in preventing RhD alloimmunization in RhD-negative pregnant women.

This review compared the efficacy and effectiveness of intramuscular versus intravenous anti-D in preventing rhesus alloimmunization in RhD-negative pregnant women.

Introduced 'this review' to clarify who/what. Again abbreviations written out

Twelve RCTs (1023 participants) reporting 14 comparisons were included in this review.

We included 12 randomised controlled trials (RCTs; clinical studies where people are randomly put into one of two or more treatment groups) with 1023 participants reporting 14 comparisons.

A good result: jargon and abbreviation explained, active form instead of passive form.

We applied the GRADE methodology to rate the certainty of evidence.

We rated the certainty of evidence from studies using four levels: very low, low, moderate, or high.

A good result: GRADE methodology is jargon clearly detailed in simple words (and accurately). Also 'applied' -> 'using'.

The primary objective of this review was to determine the diagnostic accuracy of the Mini-Cog for detecting Alzheimer's disease dementia and related dementias in a community setting.

The aim of this Cochrane Review was to find out how accurate the Mini-Cog is for detecting dementia in community settings.

- Generalized 'Alzheimer's disease dementia and related dementias' -> 'dementia'. Less specific, so loss of information, but much easier to read.

- The primary objective - the aim (synonym, not necessarily easier?)

- to determine - to find out (less formal, easier)

The evidence that shorter questionnaires are better than longer questionnaires was unclear (RR 1.04; 95% CI 1.00 to 1.08, P value = 0.07) and the evidence for questionnaires relevant to the disease/condition was also unclear (RR 1.07; 95% CI 1.01 to 1.14).

We are uncertain whether shorter questionnaires are better than longer questionnaires because the quality of the evidence is very low.

- Important, but difficult to understand, information about the results of the study left out - For the purpose of simplifying it worked, but not suitable for scientific reporting.

Excess sodium intake is associated with high blood pressure, which is common and costly and accounts for significant burden of disease.

This is a problem because eating too much salt can cause high blood pressure.

- Simplifies "Excess sodium intake" -> "eating too much salt". - 'this is a problem' seems to refer more to the behaviour, rather than 'common and costly and accounts for significant burden of disease'.

Nine initiatives permitted quantitative analysis of differential impact by sex (men and women separately).

Nine of these initiatives (including men and women separately) analysed men and women separately.

- Incoherent repetition.

To examine the effectiveness of interventions in the treatment of chronic blepharitis.

What is chronic blepharitis and what are treatments?

- Meaning and formulation of the motive changed into the formulation of an unrelated question to the original text.

Table 2: Simplification details

General simplification:
induce -> cause
Removal of difficult word (expert should judge whether this would be reasonable)

clear simplification: 'Preterm infants' -> 'Babies born before 37 weeks of pregnancy'
Issue: not logically coherent: changed the characteristic 'infants wit respiratory distress syndrome' into a risk 'are at risk of respiratory distress syndrome'. So the meaning of the sentence changed.

- Important, but difficult to understand, information about the results of the study left out
- For the purpose of simplifying it worked, but not suitable for scientific reporting.

- Simplifies "Excess sodium intake" -> "eating too much salt".
- 'this is a problem' seems to refer more to the behaviour, rather than 'common and costly and accounts for significant burden of disease'.