There are over 7000 rare genetic disorders (RGD) and only 5% of them have FDA approved treatments. Though the number of RGDs is high, they are heterogenous and geographically disparate. It is estimated that around 350M people are living with a rare disorder. 1 This number might actually be higher because most of the epidemiological data for RGDs is not known. It should be noted that the prevalence of a rare disease usually is an estimate and may change over time. RGDs are also underrepresented in research due to factors like scarcity in patient populations, lack of awareness about the disease diagnostic criteria, delays in enrolling clinical trials and insufficient research funding.
One solution to increase the representation of RGDs in the research to improve disease understanding is to take advantage of the latest breakthroughs in Artificial Intelligence (AI) which is a subfield of Computer Science. AI is regarded as an umbrella term for all the methodologies that enable machines with par-human intelligence. AI can implicitly learn the underlying complex patterns from the given data and make reliable predictions on unseen data samples. Such algorithms can dramatically reduce human intervention. AI and its branches including machine learning (ML), deep learning (DL), and natural language processing (NLP) are making a significant impact in the healthcare/medical domain.2 This encouraged AI researchers to tap into the RGD domain.
This post mainly discusses the following:
- What is the current situation of AI in the rare genetic disease space?
- What specific tasks are being solved by AI in the realm of RGDs?
- What are the popular RGDs that leverage AI capabilities?
- What kind of RGD related data is relevant for AI research?
Trends in Research
Figure 1 (below) shows the number of AI-based publications in the RGD domain from 2010 to 2019. It shows a steep increase in publications since 2015. This can be attributed to the rise of deep learning (DL) ever since 2015. DL leverages the capabilities of big data and high performance computing to make high quality predictions. DL research is predominantly dominated by computer vision that uses imagery data and this has particularly benefited healthcare and the RGD domain as it has tons of imagery data in the form of radiology reports. DeepGestalt framework diagnoses rare syndromes like Cornelia de Lange syndrome, Emanuel syndrome, and Pallister–Killian syndrome using deep learning facial analysis.8
In general, AI or ML methods that are used in the non-RGD healthcare tasks like patient classification, patient clustering, NLP-based named entity recognition, relation extraction, de-identification of patient records can be applied to RGD tasks as well. However, the differentiation comes in understanding and extracting RGD specific features from the available medical data indicated in Figure 2. For instance, identification of the affected gene and their disease-causing variants from genetic reports. AI methods, especially ensemble models, have proved to be successful in variant detection, prediction and anomaly classification.4

As RGD disease criteria is hard to identify, the latest NLP models show great success in extracting rare and complex disease names from the clinical notes of the Electronic Health Records (EHR) applications. DABLC 10 uses the deep attention neural networks to enhance RGD coding and tagging from clinical notes.
AI Applications in Rare Genetic Diseases Domain
Below are examples of where AI algorithms have shown promise in the field of RGD.
- Disease diagnosis focuses on improving the early disease diagnosis, identifying biomarkers, deep phenotyping and identifying previously undiagnosed patients. The data from literature and clinical notes are trained using Naive-Bayes classifier for diagnosis prediction of mucopolysaccharidosis type II (also known as MPS II, Hunter syndrome).9 It identified 125 out of 505,526 patients to have MPS II with 99% accuracy.
- Disease prognosis of RGDs studies the progress of the disease, survival rates and risk estimations. A research study built a simple neural network model to estimate the survival rates in patients with synovial sarcoma.5 In another study, the Biosigner algorithm identified sphingomyelins as the most relevant disease progression marker for amyotrophic lateral sclerosis (ALS).6
- Disease characterization: A stochastic gradient boosting classifier identified 15 influential predictors in narcolepsy type 1 and type 2 using the European narcolepsy network data.
- Drug repurposing helps in identifying therapeutic uses for the existing approved drugs. AI can successfully be applied in such scenarios as it can identify similar patterns in drug targets. In the context of RGDs, AI-based methods were able to predict cisplatin and resveratrol as potential candidates for refractory anemia with excess blasts and sideroblastic anemia treatment. 11
- Disease associations and clustering: Disease–disease similarity cluster networks based on phenotypic features were built employing the parameter-free clustering algorithm FLAME. Interestingly, phenotypic relationships joined several lysosomal storage diseases (LSDs) and, in their vicinity, identified two forms of spinal muscular atrophy, a phenotype not commonly associated with LSD.
- Clinical trials: Due to the small populations of RGD patients, challenges arise in patient identification and recruitment for clinical trials (CT). AI- based methods aided in developing a silico clinical trial to test bone morphogenetic protein treatment in congenital pseudarthrosis of the tibia associated with neurofibromatosis type 1.12
So far, the most frequent RGDs that appear in the AI related research are ALS, systemic lupus erythematosus (SLE), moderate and severe traumatic brain injury, cystic fibrosis, Huntington’s disease, Down syndrome, preeclampsia, acquired aneurysmal subarachnoid hemorrhage, systemic sclerosis, fragile X syndrome, retinopathy of prematurity (RoP).3 The less frequent studies include rare developmental defects during embryogenesis, rare inborn errors of metabolism, rare skin diseases and rare endocrine diseases (Figure 2).3
Conclusion
A crucial initiative in providing better medical care for RGD patients is to leverage the potential of the latest technologies like artificial intelligence and commercializing the research outcomes. The use of AI in RGDs spectrum is catching up proving impactful in the areas of disease diagnosis, prognosis, disease patterns, treatment and drug discovery. As of now, only popular RGDs like ALS and SLE are widely studied using AI techniques. The high performing AI-based techniques like Ml, DL, Computer Vision and NLP can identify the low frequent latent information and enhance the underrepresented RGD representation in databases. This also benefits information sharing and new research initiatives for RGDs. At ThinkGenetic, we build algorithms for early diagnosis of RGDs in patients and reducing diagnostic errors by exploring AI-mediated approaches.

References
- Wakap, Stéphanie Nguengang, et al. “Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database.” European Journal of Human Genetics 28.2 (2020): 165–173.
- Topol, Eric. Deep medicine: how artificial intelligence can make healthcare human again. Hachette UK, 2019.
- Schaefer, Julia, et al. “The use of machine learning in rare diseases: a scoping review.” Orphanet Journal of Rare Diseases 15.1 (2020): 1–10.
- Brasil, Sandra, et al. “Artificial intelligence (AI) in rare diseases: is the future brighter?.” Genes 10.12 (2019): 978.
- Han, Ilkyu, et al. “Deep learning approach for survival prediction for patients with synovial sarcoma.” Tumor Biology 40.9 (2018): 1010428318799264.
- Blasco, Hélène, et al. “A pharmaco-metabolomics approach in a clinical trial of ALS: Identification of predictive markers of progression.”PLoS One 13.6 (2018): e0198116.
- Hoehndorf, Robert, Paul N. Schofield, and Georgios V. Gkoutos. “Analysis of the human diseasome using phenotype similarity between common, genetic and infectious diseases.” Scientific reports 5.1 (2015): 1-14.
- Gurovich, Yaron, et al. “Identifying facial phenotypes of genetic disorders using deep learning.” Nature medicine 25.1 (2019): 60-64.
- Ehsani-Moghaddam, Behrouz, et al. “Mucopolysaccharidosis type II detection by Naïve Bayes Classifier: An example of patient classification for a rare disease using electronic medical records from the Canadian Primary Care Sentinel Surveillance Network.” PLoS One 13.12 (2018): e0209018.
- Xu, Kai, et al. “Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition.” Computers in biology and medicine 108 (2019): 122-132.
- Lee, Young-suk, et al. “A computational framework for genome-wide characterization of the human disease landscape.” Cell systems 8.2 (2019): 152-162.
- Carlier, Aurélie, et al. “In silico clinical trials for pediatric orphan diseases.” Scientific reports 8.1 (2018): 1-9.