Natural Language Processing is a field of computational linguistics, computer science and more specially artificial intelligence (AI) that uses statistics, machine learning (ML) and deep learning (DL) techniques to understand and, very recently, generate written and spoken human languages. This gives computers the ability to build language models that understand the grammar, structure and semantics of the human language. Natural Language Processing (NLP) research has seen great success in recent years especially after the advent of the deep learning architectures like BERT in 2019 which helped to build superior language models that better understand syntactic and semantic information.
There is a huge surge of interest to apply these state-of-the-art models in many domains. Healthcare is one such domain and clinical applications is a subfield. The most accessible and reliable data source for textual clinical applications is the Electronic Health Record (EHR), which consists of both structured and unstructured data. Structured data include the tabulated or annotated information of lab tests, diagnostic reports etc.; whereas clinical notes are free text and hence fall under the category of unstructured data. Information in the unstructured format is hard to analyze and leads to missed opportunities in understanding the overall patient health and well-being. It is reported that around 80% of the important clinical data is trapped in the unstructured data[1]. Clinical notes is a rich source of text for understanding patient health and detecting chronological events like past-current-future procedures, problems, medications, treatments, diagnostics, allergies, family history, etc. NLP based approaches can be used to transform unstructured clinical notes into a structured format to extract hidden clinical insights.
NLP tasks can be divided into upstream tasks and downstream tasks. The upstream tasks mostly consist of tasks that are very important in understanding the specific language at hand. In our context the language that needs to be studied is the clinical notes. The basic language tasks like tokenization, stemming, lemmatization, part-of-speech, dependency parsing etc. fall under the upstream tasks. Downstream tasks are derived from the models built on the upstream tasks and mainly include specific supervised tasks the researcher wishes to perform like entity recognition, patient classification, relation extraction, de-identification, summarization, sentiment analysis, information extraction etc. While most of the NLP upstream tasks are part of the application pipeline in a clinical setting, only a few selected downstream tasks, like named entity recognition and relation extraction, are applicable. This limited application scope can be mostly attributed to the necessity of the task itself and also to the availability of the labelled clinical datasets for the downstream tasks.
Traditionally, clinical experts would manually review the notes to extract the relevant information. Later, rule-based methods and regular expression-based methods are used in the automatic clinical text processing tasks for extracting structured data. These methods have limited scope, labor-intensive, are cumbersome, and do not scale very well with huge data. The latest state-of-the-art methods in NLP showed promising results by leveraging the benefits of big data and Artificial Intelligence (AI) architectures that produce higher performance and need lesser to no human-intervention in building models. This encouraged researchers and industry professionals alike to explore and apply these algorithms on the clinical notes. The most relevant downstream applications of NLP on clinical notes are deidentification of the records, extraction of medical entities from radiology, diagnostic, medication and lab records, identification of relations between the extracted medical entities and extracting patient clinical history for events.
Let’s start with the upstream tasks in the clinical setting.
In NLP text processing, tokenization is the first step and dictates the success of the downstream tasks like entity extraction, relation extraction and others. It is a non-trivial task given the variation of the vocabulary and grammar in the clinical domain compared to plain English [3]. Tokenization is mainly done at two levels: word and sentence.
Word Tokenization
Word tokenization is the task of identifying word boundaries in the given text. In plain English a blank space is considered as the separator between words. Clinical words can be different compared to the general English words as they might contain words with codes, abbreviations and special characters (like hyphens, slashes, apostrophes etc. ), which are to be treated as single words. Examples: “Patient is ordered a normal chest x-ray”. In this example ”x-ray” should be tokenized as a single word.
Sentence Tokenization
The task of identifying sentence boundaries in the called sentence tokenization. It is different in clinical text compared to general English sentences which can be identified by a period at the end.
English statement: “Jack and Jill went up the hill. They want to fetch some water.”
Sentence tokenization output:
Sentence 1: “Jack and Jill went up the hill.”,
Sentence 2: ”They want to fetch some water.”
Clinical statement: “Dr.Gomez identified a cyst in the descending colon. ICD code with ICD1.2.3.xx.”
Here the sentence is not supposed to be identified by periods.
Desired sentence tokenization output:
Sentence 1: “Dr.Gomez identified a cyst in the descending colon.”,
Sentence 2: ” ICD code with ICD1.2.3.xx.”
It should also be noted that different institutions might have different styles of clinical notes and sentence structures. Hence, there is a need to have this diversified data while training NLP models to capture such nuances.
Stemming and Lemmatization
In NLP applications, the text is tokenized into words which are later translated into numerical vectors for further processing. The complexity of a language model depends on the size of its vocabulary. In other words, a smaller vocabulary reduces the complexity of the model and improves its performance. Hence, many downstream tasks benefit if the words are converted to their canonical forms. Stemming and Lemmatization are the tasks of reducing the words to their canonical forms. Example: Words like “treating”, “treated”, “treatment” all derived from the base word of “treat”. This will reduce the amount of data being processed and increase the homogeneity and coherence.
Part-of-Speech (POS) Tagging
POS tagging is the task of determining the parts-of-speech in a sentence like nouns, verbs, adjectives, prepositions etc. POS Tagging is a well advanced area and generally has an accuracy of 0.97 on general English. But when the same POS taggers are applied to clinical and medical notes, there is a drastic drop in accuracy. Hence new taggers are built on clinical/medical labelled datasets for better performance.
Let us now dive into the downstream tasks of NLP that are most relevant in the context of clinical notes.
Named Entity Recognition (NER)
NER is the most relevant and the crucial task in the clinical NLP domain. It is the process of identifying and extracting clinical entities like conditions, diseases, procedures, treatments, anatomies, genes, drug prescriptions. The state-of-the-art named entity recognition models have the ability to identify entity modifiers like dosages and frequencies for drugs, location within an anatomy like “upper” arm, “descending” colon etc.
Consider the text “There was a tumor in the ascending colon. A hot forceps biopsy was performed. A single medium-sized polyp was found in the descending colon.”
Figure 1 shows the output of the open source package Stanza using a Radiology model that is trained on MIMIC- III clinical notes [5] and general English web treebank.
![Figure 1: The named entity recognition output of Stanza[4] using Radiology model.](https://www.thinkgenetic.com/wp-content/uploads/2021/10/Figure-1-1024x210.png)
Negation Detection
Consider the same example as above but with a negation in the statement indicating that there is no tumor. “There was no tumor in the ascending colon. A hot forceps biopsy was performed. A single medium-sized polyp was found in the descending colon.”
It is very important for the NLP models to detect the sentiment or the certainty status of the extracted entity or OBSERVATION, tumor in this example, to be identified correctly. The sentence indicates that there is no tumor and negation detection is the process of identifying such negations in the text.
Figure 2 shows the output of Stanza run on Radiology model. Check the UNCERTAINTY tag.

Relation Extraction (RE)
While named entity recognition is the task of identifying entities in the text, relation extraction is the task of identifying relations or semantic connections among the entities.
Consider the same example of “There was a tumor in the ascending colon. A hot forceps biopsy was performed. A single medium-sized polyp was found in the descending colon.” A relation extraction task attempts to identify that the test “hot forceps biopsy” reveals a problem “polyp”.
There are various types of relations a clinical researcher would be interested in: Test-Problem, Problem-Treatment, temporal events, Drug-Drug interactions, Drug-Effects, Drug-Protein interactions, and Drug-Dosage-Frequency-Channel.
Consider Figure 3 which indicates the output of SparkNLP [6] healthcare model that shows Drug-Dosage-Frequency-Channel relationships.

De-identification of records
The task of de-identification is very crucial for any research involving patient records, especially clinical notes, to maintain patient confidentiality and privacy. In a clinical setting, the general identifiers that need to be protected are patient name, address, social security numbers, data of birth.
NLP can perform this task with little to no supervision from the data providers. These models use the ability of entity extraction in identifying names, birth dates and other patient privacy data to mask this data in the clinical notes before passing it over to the clinical researcher. NLP also provides provisions to re-identify the masked entities.
Example: “A female patient Shillong Emily who is 49 years DOB: 6/24/1970 Associated Diagnoses: Small bowel obstruction; Periumbilical hernia”
Deidentified output: “A female patient <LAST> <FIRST> who is 49 years DOB: <DATE> Associated Diagnoses: Small bowel obstruction; Periumbilical hernia”
There are two kinds of learning when it comes to the AI-NLP methods: supervised and unsupervised learning methods. Supervised learning methods try to build models from labelled datasets and use this model knowledge to predict the behavior of the unknown data samples. Examples of supervised learning could be classification tasks to identify if new patients have a disease based on a model that is built on a known dataset of patient records with the labels of disease- “present” or “absent”. Unsupervised learning methods do not need any labelled data and most of them rely on the statistical and mathematical details of the dataset itself to produce results. Patient clustering or identifying similar patients, topic modelling etc., fall under unsupervised learning.
These automatic text extraction methods using NLP techniques come in handy when there is little to no structured data available about the patients and when most of the information is recorded as nurse or care team documentation or other free text formats especially in the EHR applications.
However, clinical NLP research is limited and slow. It is mainly due to the lack of labelled clinical or medical data sources. In general English language text processing scenarios, it is easy to build datasets by crowdsourcing. But this does not apply in the clinical domain because it requires clinical experts who are generally very busy and scarce. <Mention about legal implications of using and exposing medical data> Also, there are not many free labelled datasets.
Active learning, data augmentation, transfer learning, weak supervision and unsupervised learning helps mitigate the issue of working with low labelled data and in a few cases helps in creating new labelled datasets [2].
This area is very fresh and has a lot of scope for developing more accurate, hands-free and reliable algorithms and architectures for better clinical data processing, extraction and understanding.
References
- Martin-Sanchez, Fernando, and Karin Verspoor. “Big data in medicine is driving big changes.” Yearbook of medical informatics 23.01 (2014): 14–20.
- Spasic, Irena, and Goran Nenadic. “Clinical text data in machine learning: systematic review.” JMIR medical informatics 8.3 (2020): e17984.
- Campbell, David A., and Stephen B. Johnson. “Comparing syntactic complexity in medical and non-medical corpora.” Proceedings of the AMIA Symposium. American Medical Informatics Association, 2001.
- Zhang, Yuhao, et al. “Biomedical and clinical English model packages for the Stanza Python NLP library.” Journal of the American Medical Informatics Association 28.9 (2021): 1892–1899.
- https://mimic.mit.edu/
- Kocaman, Veysel, and David Talby. “Spark nlp: Natural language understanding at scale.” Software Impacts 8 (2021): 100058.

About Sarika Kondra: Sarika is an Applied Researcher at ThinkGenetic specializing in advancing the state of the art in natural language processing in the healthcare domain.