Challenges of diagnosing using EMR data
An integrated electronic medical record system is becoming an essential part of the fabric of modern healthcare, which can collect, store, display, transmit and reproduce patient information [1, 2]. The current studies show that medical information provided by Electronic Medical Records (EMRs) is more complete and faster to retrieve than traditional paper records . Nowdays, EMRs are becoming the main source of medical information about patients . The degree of health information sharing has become one of the indicators of hospital information construction in various countries. Therefore, the research and application of EMRs have certain scales and experiences in the world. How to use the rapidly growing EMR data to support biomedical research and clinical research is an important research content .
Due to their semi-structured and unstructured form, the study of EMRs belongs to the specific domain of Natural Language Processing (NLP). Notably, recent years have witnessed a surge of interests in data analytics with patient EMRs using NLP. Ananthakrishnan et al.  developed a robust electronic medical record–based model for classification of inflammatory bowel disease leveraging the combination of codified data and information from clinical text notes using natural language processing. Katherine et al.  assessed whether a classification algorithm incorporating narrative EMR data (typed physician notes) more accurately classifies subjects with rheumatoid arthritis (RA) compared with an algorithm using codified EMR data alone. The work by Ruben et al.  studied a real-time electronic predictive model that identifies hospitalized heart failure (HF) patients at high risk for readmission or death, which may be valuable to clinicians and hospitals who care for these patients. Although some effective NLP methods have been proposed for EMRs, lots of challenges still remain, to list a few among the most relevant ones:
(1) Low-Quality. Owing to the constraint of electronic medical record template, the EMRs data are similar in a large scale, especially the content of EMRs. What’s more, the medical records writing is not standardized which sometimes shows inconsistency between records and doctor’s diagnosis.
(2) Huge-Quantity. With the increasing popularity of medical information construction, EMRs data have been growing rapidly in scale and species. There is a great intensive knowledge to explore in the EMRs databases.
(3) Imbalance. Due to the wide variety of diseases (e.g., there are more than 14,000 different diagnosis codes in terms of International Classification of Diseases – 9th Version (ICD-9)) in EMRs data, the sample distribution is expected to remain rather imbalance.
(4) Semi-structure and non-structure. The EMRs data include front sheet, progress notes, test results, medical orders, surgical records, nursing records and so on. These documents include structured information, unstructured texts and graphic image information.
Despite the above challenges, one must address the additional challenges posed by the high density of the Chinese language compared with other languages . Most of words in Chinese corpus cannot be expressed independently. Therefore, the word segmentation is a necessary preprocessing step, and its effect directly affects the following series NLP operations for EMRs .
Intelligent diagnosis using EMR data
In practice, a great deal of information is used to determine the disease, such as the patient’s chief complaint, current history, past history, relevant examinations. However, the diagnostic accuracy not only depends on individual medical knowledge but also clinical experience. Different doctors may have different diagnoses on the same patient. In particular, doctors with poor skills or in remote areas have lower diagnostic accuracy. Therefore, it’s very important and realistic to establish a intelligent dignosis model for EMRs.
Chen et al.  applied machine learning methods, including support vector machine (SVM), decision forest, and a novel summed similarity measure to automatically classify the breast cancer texts on their Semantic Space models. Ekong et al.  proposed the use of fuzzy clustering algorithm for a clinical study on liver dysfunction symptoms. Xu et al.  designed and implemented a medical information text classification system based on a KNN. Many researchers at home and abroad, who use EMRs for disease prediction, always focus on a particular department as well as a specific disease. At present, the algorithms used by researchers mostly focus on machine learning methods, such as KNN, SVM, DT. Due to the particularity of medical field and the key role of professional medical knowledge, common text classification methods often fail to achieve good classification performance and cannot meet the requirement of clinical practice .
Benefiting from big data, powerful computation and new algorithmic techniques, we have been witnessing the renaissance of deep learning, especially the combination of natural language processing and deep neural networks. Dong et al.  presented a CNN based multiclass classification method for mining named entities with EMRs. A transfer bi-directional Recurrent Neural Networks was proposed for named entity recognition (NER) in Chinese EMRs that aims to extract medical knowledge such as phrases recording diseases and treatments automatically . SA  marked the prediction of heart disease as a multi-level problem of different features or signs and constructed an IHDPS (Intelligent Heart Disease Prediction System) based on neural networks.
However, to the best of our knowledge, few significant models based on deep learning have been employed for the intelligent diagnosis with Chinese EMRs. Rajkomar et al.  demonstrated that deep learning methods outperformed state-of-art traditional predictive models in all cases with electronic health record (EHR) data, which is probably the first research on using deep learning methods in EHR model analysis.
Deep learning for natural language processing
NLP is a theory-motivated range of computational techniques for the automatic analysis and representation of human language, which enables computers to perform a variety of natural language related tasks at all levels, ranging from parsing and part-of-speech (POS) tagging, to dialog systems and machine translation. In recent years, Deep learning algorithms and architectures have already won numerous contests in fields such as computer vision and pattern recognition. Following this trend, recent NLP research is now increasingly focusing on the use of deep learning methods .
In a deep learning with NLP model, word embedding is usually used as the first data preprocessing layer. It’s because the learnt word vectors can capture general semantic and syntactical information, that word embedding produces state-of-art results on various NLP tasks [20–22]. Following the success of word embedding [23, 24], CNNs turned out to be the natural choice in view of their effectiveness in computer vision and pattern recognition tasks [25–27]. In 2014, Kim  explored using the CNNs for various sentence classification tasks, and CNNs was quickly adapted by some researchers due to its simple and effective network. Poria et al.  proposed a multi-level deep CNN to tag each word in a sentence, which coupled with a group of linguistic patterns and finally performed well in aspect detection.
Besides text classification, CNN models are also suitable for other NLP tasks. For example, Denil et al.  applied DCNN to map meanings of words that constitute a sentence to that of documents for summarization, which provided insights in automatic summarization of texts and the learning process. In the domain of Question and Answer (QA), the work by Yih et al.  presented a CNN architecture to measure the semantic similarity between a question and entries in a knowledge base (KB), which determined what supporting fact in the KB to look for when answering a question. In the domain of Information and Retrieval (IR), Chen et al.  proposed a dynamic multi-pooling CNN (DMCNN) strategy to overcome the loss of information for multiple-event modeling. In the speech recognition, Palaz et al.  performed extensive analysis based on a speech recognition systems with CNN framework and finally created a robust automatic speech recognition system. In general, CNNs are extremely effective in mining semantic clues in contextual windows.
It is well known that pediatric patients are generally depauperate, traversing from newborns to adolescents. Correspondingly, the treatment and dosage of medicine are different from those given to adult patients. Thus, it is a great challenge to build a prediction model for pediatric diagnosis that is trained to “learn” expert medical knowledge to simulate the doctor’s thinking and diagnostic reasoning.
In this research, we propose a deep learning framework to study intelligent diagnosis using Chinese EMRs, which incorporates a convolutional neural network (CNN) into an EMR classification application. This framework involves a series of operations that includes word segmentation, word embedding and model training. In real pediatric Chinese EMR intelligent diagnosis applications, the proposed model has high accuracy and a high F1-score, and achieves good results. The novelty of this paper is reflected in the following:
(1) We construct a pediatric medical dictionary based on Chinese EMRs.
(2) Word2vec is used as a word embedding method to achieve the semantic description of the content of Chinese EMRs.
(3) A fine-tuning CNN model is constructed to feed the pediatric diagnosis with Chinese EMR data.