Researchers Experiment With Machine Translation to Simplify Medical Language

Medical Language by Machine Translation

Adequate access to healthcare is a big problem in the US. Now, a US government initiative called “Healthy People 2030” aims at tackling the problem from several angles, including language access and, specifically, health literacy. A collaborative of natural language processing (NLP) researchers from the University of Pittsburgh, among others, conducted an experiment inspired by this initiative, and summarized their findings in a September 2022 paper.

When medical materials are written at a higher reading level than what the majority of the patient population can fully comprehend (a factor the US government initiative dubs “Personal Health Literacy, or PHL), the result is “worse health outcomes and serious health disparities as patients are not able to manage their own health properly,” according to the paper.

A highlight of this experiment is the researchers focusing on language access by “machine-translating” English to English, with the goal of improving the literacy level of medical content.

Using MT to Simplify Language

Using machine translation (MT) to simplify language is not a new concept. In previous research, as cited in the paper’s reference section, experiments focused on terminology substitution and changes to grammatical structures. This was done using rules-based and statistical machine translation, adding a fair amount of postediting to improve the output.

The researchers qualified these approaches as “tedious and need[ing] lots of human manual effort.” For their experiment, they tried automating terminology and grammar simplification with neural machine translation, adding a grammar correction tool to the mix. 

SlatorCon Remote December 2022 | $150

SlatorCon Remote December 2022 | $150

A rich online conference which brings together our research and network of industry leaders.

Register Now

A critical component of the experiment was having a representative dataset. They created their own from scratch by mining data from,,, and The source dataset was called the “illiterate” set (original content) and the target set, the “literate” (simplified terms and grammatical structure).

MT training was done using 245,335 sentence pairs and validation was done with 40,000 sentence pairs. This combined dataset was considered “the silver standard.” 

Also included in the MT training were what the researchers called the “gold standard dataset,” with 497 sentence samples of language simplified by humans, as a testing dataset. 

The MT training process involved, among other things, randomly substituting one complex word in each sentence with a simplified equivalent. This was done even if more than one complex word or term occurred in a single sentence.

The results were analyzed using the Bidirectional Long Short-Term Memory (BiLSTM) MT model and Bidirectional Encoder Representations from Transformers (BERT)-based MT models. An additional analysis included a comparison of the ratio of unsimplified language in sentences.

Mixed Results

According to the researchers, “The proposed NMT models were able to identify the correct complicated words and simplify into layman language.” They also reported that the models had issues of sentence completeness, fluency, readability, and terminology.

Some medical terms were adapted unnecessarily, including commonly used medical terms (e.g., treatment, heart attack, allergic reaction). Other more complex medical terms (e.g., eosinophilic esophagitis, hemostasis, axillary lymphadenitis) were not included in the silver standard data, and thus unavailable to the model. The latter would require manual adaptation. 

Researchers acknowledge that human translation renders higher levels of fluency and readability, compared to the trained model that generated grammatical issues. These issues were particularly present in long sentences and included errors in verbal tenses, plural or singular forms, collocation, and sentence structure.

What to Try Next

Researchers recommended adding a way to perform systematic language checks for accuracy and quality to the machine learning model. They also suggested leveraging “existing language assistance models to counter check the results automatically.” 

The expert-in-the-loop model is also echoed in their recommendations, which state that “all the final translation should be approved by healthcare experts in the field to ensure translation quality before delivering to patients.”