BO-LSTM: classifying relations via long short-term memory networks along biomedical ontologies

Bioinformatics

We evaluated the performance of our BO-LSTM model on the SemEval 2013: Task 9 DDI extraction corpus [49]. This gold standard corpus consists of 792 texts from DrugBank [50], describing chemical compounds, and 233 abstracts from the Medline database [51]. DrugBank is a cheminformatics database containing detailed drug and drug target information, while Medline is a database of bibliographic information of scientific articles in Life and Health Sciences. Each document was annotated with pharmacological substances and sentence-level DDIs. We refer to each combination of entities mentioned in the same sentence as a candidate pair, which could either be positive if the text describes a DDI, or negative otherwise. In other words, a negative candidate is a candidate pair that is not described as interacting in the text. Each positive DDI was assigned one of four possible classes: mechanism, effect, advice, and int, when none of the others were applicable.

In the context of the competition, the corpus was separated into training and testing sets, containing both DrugBank and Medline documents. We maintained the test set partition and evaluated on it, as it is the standard procedure on this gold standard. After shuffling we used 80% of the training set to train the model and 20% as a validation set. This way, the validation set contained both DrugBank and Medline documents, and overfitting to a specific document type is avoided. It has been shown that the DDIs of the Medline documents are more difficult to detect and classify, with the best systems having almost a 30 point F1-score difference to the DrugBank documents [52].

We implemented the BO-LSTM model in Keras, a Python-based deep learning library, using the TensorFlow backend. The overall architecture of the BO-LSTM model is presented in Fig. 2. More details about each layer can be found in the “Methods” section. We focused on the effect of using different sources of information to train the model. As such, we tuned the hyperparameters to obtain reasonable results, using as reference the values provided by other authors that have applied LSTMs to this gold standard [18, 19]. We first trained the model using only the word embeddings of the SDP of each candidate pair (Fig. 2a). Then we tested the effect of adding the WordNet classes as a separate embedding and LSTM layer (Fig. 2b) Finally, we tested two variations of the ChEBI channel: first using the concatenation of the sequence of ancestors of each entity (Fig. 2c), and second using the sequence of common ancestors of both entities (Fig. 2d).

Table 1 shows the DDI detection results obtained with each configuration using the evaluation tool provided by the SemEval 2013: Task 9 organizers on the gold standard, while Table 2 shows the DDI classification results, using the same evaluation tool and gold standard. The difference between these two tasks is that while detection ignores the type of interactions, the classification task requires identifying the positive pairs and also their correct interaction type. We compare the performance on the whole gold standard, and on each document type (DrugBank and Medline). The first row of each table shows the results obtained using an LSTM network trained solely on the word embeddings of the SDP of each candidate pair. Then, we studied the impact of adding each information channel on the performance of the model, and the effect of using all information channels, as shown in Fig. 2.

Table 1

Evaluation scores obtained for the DDI detection task on the DDI corpus and on each type of document, comparing different configurations of the model

Word embeddings

0.7551

0.6865

0.7192

0.7620

0.7158

0.7382

0.6389

0.377

0.4742

+ WordNet

0.716

0.6936

0.7046

0.7267

0.7143

0.7204

0.5800

0.4754

0.5225

+ Common Ancestors

0.7661

0.6738

0.7170

0.7723

0.7003

0.7345

0.6667

0.3607

0.4681

+ Concat. Ancestors

0.7078

0.7489

0.7278

0.7166

0.7578

0.7366

0.6032

0.623

0.6129

+ WordNet + Ancestors

0.6572

0.8184

0.7290

0.6601

0.8385

0.7387

0.5574

0.5574

0.5574

Table 2

Evaluation scores obtained for the DDI classification task on the DDI corpus and on each type of document, comparing different configurations of the model

Word embeddings

0.5819

0.5291

0.5542

0.5868

0.5512

0.5685

0.5000

0.2951

0.3711

+ WordNet

0.5754

0.5574

0.5663

0.5845

0.5745

0.5795

0.4600

0.3770

0.4144

+ Common Anc.

0.5968

0.5248

0.5585

0.6045

0.5481

0.5749

0.5152

0.2787

0.3617

+ Concat. Anc.

0.5282

0.5589

0.5431

0.5286

0.5590

0.5434

0.4921

0.5082

0.5000

+ WordNet + Anc.

0.5182

0.6454

0.5749

0.5171

0.6568

0.5787

0.4590

0.4590

0.4590

For the detection task, using the concatenation of ancestors results in an improvement of the F1-score in the Medline dataset, contributing to an overall improvement of the F1-score in the full test set. The most notable improvement was in the recall of the Medline dataset, where the concatenation of ancestors increased this score by 0.246. The usage of ontology ancestors did not improve the F1-score of detection of DDIs in the DrugBank dataset. In every test set, it is possible to observe that the concatenation of ancestors results in a higher recall while considering only the common ancestors is more beneficial to precision. Combining both approaches with the WordNet channel results in a higher F1-score.

Regarding the classification task (Table 2), the F1-score was improved on each dataset by the usage of the ontology channel. Considering only the common ancestors led to an improvement of the F1-score in the DrugBank dataset and on the full corpus, while the concatenation improved the Medline F1-score, similarly to the detection results.

To better understand the contribution of each channel, we studied the relations detected by each configuration by one or more channels, and which of those were also present in the gold standard. Figures 4 and 5 show the intersection of the results of each channel in the full, DrugBank, and Medline test sets. We compare only the results of the detection task, as it is simpler to analyze and show the differences in the results of different configurations. In Fig. 4, we can visualize false negatives as the number of relations unique to the gold standard and the false positives of each configuration as the number of relations that does not intersect with the gold standard. The difference between the values of this figure and the sum of their respective values in Fig. 5 is due to the system being executed once for each dataset. Overall 369 relations in the full test set were not detected by any configuration of our system, out of a total of 979 relations in the gold standard. We can observe that 60 relations were detected only when adding the ontology channels.

Fig. 4

Venn diagram demonstrating the contribution of each configuration of the model to the results of the full test set. The intersection of each channel with the gold standard represents the number of true positives of that channel, while the remaining correspond to false negatives and false positives

Fig. 5

Venn diagram demonstrating the contribution of each configuration of the model to the DrugBank (a) and Medline (b) test set results. The intersection of each channel with the gold standard represents the number of true positives of that channel, while the remaining correspond to false negatives and false positives

In the Medline test set, the ontology channel identified 7 relations that were not identified by any other configuration (Fig. 5b). One of these relations was the effect of quinpirole treatment on amphetamine sensitization. Quinpirole has 27 ancestors in the ChEBI ontology, while amphetamine has 17, and they share 10 of these ancestors, with the most informative being “organonitrogen compound”. While this information is not described in the original text, but only encoded in the ontology, it is relevant to understand if the two entities can participate in a relation. However, this comes at the cost of precision, since 10 incorrect DDIs were classified by this configuration.

To empirically compare our results with the state-of-the-art of the DDI extraction, we compiled the most relevant works on this task in Table 3. The first line refers to the system that obtained the best results on the original SemEval task [38, 53]. Since then, other authors have presented approaches for this task, most recently using deep learning algorithms. In Table 3 we compare the machine learning architecture used by each system, and the results reported by the authors. Since some authors focused only on the DDI classification task, we could not obtain the DDI detection results for those systems, hence the missing values. We were only able to replicate the results of Zhang et al. [48]. Since this system followed an architecture similar to ours, we adapted the model with our ontology-based channel, as described in the “Methods” section. This modification to the model resulted in an improvement of 0.022 to the F1-score. Our version of this model is also available on our page along with the BO-LSTM model.

Table 3

Comparison of DDI extraction systems

FBK-irst [38]

SVM

0.651

SCNN [18]

CNN

0.686

Joint AB-LSTM [19]

LSTM

0.6939

Att-BLSTM [22]

LSTM

0.773

DLSTM [20]

LSTM

0.6839

BR-LSTM [21]

LSTM

0.7115

Zhang et al. 2018 [48]

LSTM

0.729

Zhang et al. 2018 + BO-LSTM

LSTM

0.751

We used the HP corpus to demonstrate the generalizability of our method. This case-study served only as a proof-of-concept, it was not our intent to measure the performance of the model, given the limited number of annotations and the dependence on the quality of using exact string matching to identify the genes. For example, we may have missed correct relations in the corpus, because they were not in the reference file or the gene name was not correctly identified.

Therefore, we used 60% (137 documents) of the corpus to train the model and 40% (91 documents) to manually evaluate the relations predicted with that model. For example, in the following sentence:

the model identified the relation between the phenotype “angiofibromas” and the gene “MEN1”. One recurrently identified relation by our model that was not present on the phenotype-gene associations file is between the phenotype ’neurofibromatosis’ and the gene ’NF2’:

Despite this relation not being described in the previous sentence, it is predicted given its presence in the phenotype-gene associations files. With a larger number of annotations in the training corpus, we expect this error to disappear.

Articles You May Like

Antibody-mediated biorecognition of myelin oligodendrocyte glycoprotein: computational evidence of demyelination-related epitopes
Maps of variability in cell lineage trees
Scientists develop machine-learning approach to identify source of Salmonella
Structural insights into chaperone addiction of toxin-antitoxin systems
Correlation Structure in Micro-ECoG Recordings is Described by Spatially Coherent Components

Leave a Reply

Your email address will not be published. Required fields are marked *