Adding Linguistic Information to Transformer Models Improves Biomedical Event Detection?

Biomedical event detection is an essential subtask of event extraction that identifies and classifies event triggers, indicating the possible construction of events. In this work we propose the comparison of BERT and four of its variants for the detection of biomedical events to evaluate and analyze the differences in their performance. The models are learned using seven manually annotated corpora in different biomedical subdomains and fine-tuned by adding a linear layer and a Bi-LSTM layer on top of the models. The evaluation is done by comparing the behavior of the original models and by adding a lexical and a syntactic features. SciBERT emerged as the highest performing model when the fine-tuning is done using a Bi-LSTM layer, without need of extra features. This result suggests that the use of a transformer model that is pretrained from scratch and uses biomedical and general data for its pretraining, allows to detect event triggers in the biomedical domain covering different subdomains.


I. INTRODUCTION
B IOMEDICAL event extraction is a complex information extraction task that identifies key information from large sets of textual data for further applications, such as the study of biomolecular mechanisms or epigenetic changes.A biomedical event is constructed from an event trigger and one or more arguments that orbit around the trigger.Event triggers generally refer to nouns or verbs that express an action, circumstance or eventuality, while the arguments refer either to biomedical entities or to other events, called nested events.Fig. 1 shows the example of a sentence annotated with two biomedical events, '-Reg' (which stands for 'Negative regulation') and 'Locl' (which stands for 'Localization').The event 'Locl' (the event is given the same type as the trigger) that is constructed from the trigger word 'excretion' presents as argument the biomedical entity of the type 'D/C' (which stands for 'Drug or compound'), who plays the role 'Th' (which stands for 'Theme').This role allows answering the question 'What is excreted?'.While the event '-Reg', constructed from the trigger word 'reduces', presents two arguments.The first argument is a biomedical entity of the type 'Drug or compound', who plays the role 'Cause'.This role allows answering the question 'What causes the reduction?'.The second argument is the nested event 'Locl' described before, who plays the role 'Theme', answering the question 'What is reduced?'.Event extraction is usually divided into three main sub-tasks, event detection, argument identification and event construction.Event detection identifies and classifies the trigger words into a set of predefined types of event triggers, while argument identification identifies and classifies the corresponding event arguments and their respective roles [1].Event construction refers to the merging of the relations that correspond to the same event.This work focuses on event detection, which has a fundamental role in the construction of events, since the triggers are the targets that allow to know that an event may exist [2].Difficulty for trigger detection comes from the sensitivity to the domain or subdomain (text can present specialized language), linguistic forms (triggers can be single words, multi-words, discontinuous markers) and ambiguity on the trigger class (a trigger word can be given different trigger classes) [3].According to different works, such as in [1], solutions to address these issues may include additional features to provide lexical, syntactic and semantic information about text, which have proven to be useful for detecting event triggers.Transformers models have been adopted for event detection due to their positive achievements in performance for solving different Natural Language Processing (NLP) tasks [4], [5].BERT [6], which stands for Bidirectional Encoder Representations from Transformers, is pretrained to generate bidirectional representations of the words, taking into account the semantics by considering both left and right directions of the text.From this pretraining, BERT can be fine-tuned by including additional layers on top of the model to solve new specific tasks.Furthermore, a series of variants from BERT have been developed for specific domains by being trained on large corpus with the same context, such as the biomedical domain.
In this work we compare BERT and four of its variants pretrained in the biomedical domain for the detection of biomedical event triggers to analyze their performance and identify which model is the most appropriate to address this task.For this purpose, BERT, BioBERT, SciBERT, Pub-MedBERT, and BioMedRoBERTa are fine-tuned using two different classifiers, a linear layer and a Bidirectional Long Short Term Memory (Bi-LSTM) layer, to detect biomedical event triggers.These BERT variants have been chosen for comparison because they share the same BERT architecture but have previously been pretrained using different data in the biomedical and/or general domain [7]- [9].The models are learned using seven manually annotated data sets merged together.These corpora were originally developed for the event extraction task in different biomedical subdomains.In addition to these data, two features are included as lexical and syntactical extra-information to the models, the stems and the parts-of-speech (POS) tags, respectively.SciBERT presented the highest performance when the fine-tuning is done using a Bi-LSTM classifier without adding any extra-features.This result suggests that using a transformer model that is pretrained from scratch using biomedical and general domain data, allows to detect biomedical event triggers addressing different biomedical subdomains.
Our main contributions refer to the (1) comparison of the capability of different pretrained transformer models to detect biomedical events, (2) evaluation of the performance of two different classifiers for the fine-tuning of event detection, (3) analysis of the impact of manually annotated corpora on different biomedical subdomains to detect event triggers, and (4) assessment of whether adding lexical and syntactic information improves biomedical event detection.

II. RELATED WORK
Current SOTA systems for event detection use neural network models due to their robust event extraction capabilities.P. V. Rahul et al. [10] used Recurrent Neural Networks (RNN) to extract higher level features through the hidden state of the network to identify biomedical event triggers.They also used the word and the entity type embeddings as features, demonstrating positive results in the MLEE [11] corpus.S. Duan et al. [12] and Y. Zhao et al. [13] explored an augmentation of the semantic information by integrating the full document representation.Both proposed the use of RNNs to extract cross-sentence features without the use of external resources.T. H. Nguyen and R. Grishman [14] presented a Graph Convolution Network (GCN) model to exploit syntactic dependency relations.They used dependency trees to link words to their informative context for event detection.H. Yan et al. [15] also proposed a GCN model, integrating aggregative attention to model and aggregate multi-order syntactic representations of the sentences, while in the case of S. Cui et al. [2], they extended the GCN by adding the relation aware concept, which exploits the syntactic relation labels and models the relation between words.DeepEventMine [16] is an end-to-end system for event extraction that consists on four main modules; BERT model, trigger and entity detection and classification, relation extraction and event identification.For each of the modules, BERT is used as base model and a linear layer is added.One of the main objectives of this system is improving the extraction of nested events, where it has achieved the new SOTA performance on seven biomedical nested event extraction tasks.B. Portelli et al. [17] compared BERT and five of its variants for the identification of Adverse Drugs and Events (ADEs).They showed that span-based pretraining, from spanBERT, provides an improvement in the recognition of ADEs, and that the pretraining of the models in the specific domain is particularly useful in comparison to train the models from scratch. A. Ramponi et al. [18] developed BEESL, a neural network model based on a sequence labeling system for the extraction of events.The system converts the event structures into a format of sequence labeling, and uses BERT as language model.Y. Chen [19] proposed the Multi-Source Transfer Learning-based Trigger Recognizer system, which is an extension on transfer learning using multiple source domains.All the datasets from the different domains are used for jointly train the neural network, achieving a higher recognition performance on the biomedical domain, having a wide coverage of events.
According to these works, transformer architectures have achieved positive results for detecting event triggers, and the use of pretrained language models has shown an improvement in the performance of this task.However, these works have been developed in a specific biomedical subdomain or in the general domain, not allowing a generalization to different biomedical subdomains.This may present a limitation in the detection of biomedical triggers because the language in biomedical texts is usually specialized and very specific.In addition, an analysis on how the pretrained language models used were selected over the other existing models is not described.Besides, according to A. Ramponi et al. [18], the detection of triggers continues to be the most important source of errors in event extraction, where around 31 % of the errors correspond to non-detection of triggers and 28 % to overdetection of triggers.

III. MATERIALS AND METHODS
Fig. 2 shows the approach followed in this work.The annotated data is given as input to the pretrained transformer models to calculate the embeddings.The models used are BERT and four of its variants, who have achieved state-of-theart performance in different NLP tasks without requiring major architectural modifications according to the specific tasks.In addition, the embeddings of a lexical and a syntactic features are also calculated.Then, a classification layer is added on top of the models for fine-tuning to detect event triggers.Fig. 2. Overview of the approach proposed to detect event triggers.

A. Transformer Model: BERT
BERT [6] is a contextualized word representation model based on a masked language model pretrained with bidirectional transformers [7].In BERT, the sequence of input tokens (words or sub-words) is constituted with initial vectors that are the combination of the token embeddings, the (token) position embeddings and the segment embeddings (text segment to which the token corresponds) through element-wise summation.The embeddings of extra features can be computed and included in this summation, such as the POS embeddings (token function in meaning and grammar within the sentence), which has demonstrated to be helpful in detecting event triggers [1].The embeddings are then passed to a set of layers of transformer modules.Each transformer layer generates a contextual representation of every token by summing the non-linear transformation of the tokens' representations from the previous layer.This representation is weighted by the attentions calculated using the representations of the previous layer as query.The last layer generates the contextual representations for all the tokens, where the information of the whole text span is combined [20].Following the BERT principle, other transformer models have been developed being pretrained with data from specific domains, e.g.biomedical data, presenting better adaptation for solving in-domain tasks.BioBERT [7] and BioMedRoBERTa [21] are some examples of BERT variants pretrained in the biomedical domain.

B. Fine-Tuning Transformer Models for Event Detection
Various downstream text mining tasks can be performed by making minimal modifications to the BERT architecture through a process of fine-tuning.Here, the transformer models are fine-tuned following the Named Entity Recognition (NER) task.NER is one of the main tasks of biomedical text mining, which aims to recognize domain-specific nouns in a biomedical corpus by giving each word s i in a sentence S = s 1 , s 2 , ..., s n (n refers to the number of words in the sentence) a predefined class l ∈ L (where L refers to the predefined collection of entity types including the no-entity class).In this work, NER is adapted to identify triggers, which implies not only identifying nouns, but also verbs and in some cases adjectives.Two different classification layers, a linear layer and a Bi-LSTM layer, are used separately for comparison.The output labels are obtained following the IOB (Inside-Outside-Beginning) tagging to identify and classify the triggers into the predefined trigger categories (in the case of the I and B tags).

A. Corpus
Table I presents the seven datasets 1 (all publicly available) used for fine-tuning the transformer models.These corpora were manually or semi-manually annotated by experts and 1 Cancer Genetics (CG) 2013 [22], Epigenetics and Post-translational Modifications (EPI) 2011 [23], GENIA 2011 [24], GENIA 2013 [25], Infectious Diseases (ID) 2011 [26], Pathway Curation (PC) 2013 [22], Multi-Level Event Extraction (MLEE) [11] released to be used in the development and improvement of event extraction models.For the development of the experiments, the training and development datasets of all the corpora are initially merged into one single dataset and split into sentences, obtaining a total of 24,819 sentences.The original test sets are not used since the annotation are not released.Then, a random data partition into 80/20 is applied to obtain the training and testing sets, containing 19,855 and 4,964 sentences, respectively.Each sentence is further split into words by spaces and then, each word into sub-words or tokens following the setting of the BERT tokenization.These tokens are then given as input to the transformer model.All the trigger classes from each corpus are considered for the final trigger classification, presenting a final set of 58 classes (some classes overlap among the different corpora).

B. Pretrained Transformer Models
The transformer model, BERT [6], and four BERT variants pretrained in the biomedical domain, BioBERT [7], SciBERT [8], PubMedBERT [20], and BioMedRoBERTa [21], are used and compared for the detection of event triggers.These models differ from each other by the corpora in which they were pretrained (all in English), the type of pretraining and the size of the vocabulary.SciBERT and PubMedBERT, were pretrained from scratch, meaning that they use a unique vocabulary on the pretraining corpus and include embeddings that are specific for in-domain words.BioBERT and BioMedRoBERTa were pretrained starting from the BERT checkpoints, which means that the vocabularies are built with general-domain texts (similar to BERT) as well as the initialization of the embeddings.

C. Lexical and Syntactic features
The embeddings of stems and POS tags are also computed and added as extra-features.Stems provide lexical information that correspond to the words reduced to their word roots, without needing to be an existing word in the dictionary.Stems are obtained by applying a set of rules to remove attached suffixes and prefixes (affixes) from terms without considering the POS or the context of the word occurrence [27].POS tags represent syntactic information that provides the categorical differences of the words according to their functions in meaning and grammatically within the sentence.POS tagging consists on automatically obtaining the POS tag of each word among the different POS categories corresponding to their syntactical role [28].For this work, the stems of the words are obtained using the 'Snowball Stemmer' module from NLTK-3.4.52 , while the POS were obtained using spaCy-3.0.03 , using 'en core web sm', a pipeline developed for biomedical data.The embeddings of the stems and POS tags are summed to the rest of the embeddings (token, position and segment) calculated by the transformer models.

D. Parameters Settings
All the experiments are done with PyTorch, using the Transformers 4 library and the models were taken from Hugging Face 5 .The transformer models are trained using the original parameters from BERT, presenting a dropout probability for the attention heads and hidden layers of 0.1, a hidden size of 768, an initializer range of 0.02, a max position embeddings of 512 and an intermediate size of 3,072.The number of attention heads and hidden layers was 12 for both.'Adam' was used as optimizer and 'gelu' as activation function.The training parameters of the classification layers, both linear and Bi-LSTM, were set as follows; batch size of training and testing sets of 16, learning rate of 1e-05 and max gradient norm of 10, since gradient clipping was included.The maximum length of the sentences was set to 256.All the models were trained during 100 epochs on the training set without applying early stopping, and evaluated by measuring the precision (P), recall (R) and F1-score.

V. RESULTS AND DISCUSSION
The evaluation results of the fine-tuning of the models for event detection are shown in Table II.The approximate time in hours for the fine-tuning of each model is presented in the last column of the table.The highest results obtained in epochs 10, 30 and 100 are presented in bold, and the highest overall results of all epochs are presented in bold and underlined.First, we observe that SciBERT, which was pretrained from scratch using biomedical and general data, obtained the best results for each number of epochs and overall, in P, R and F1.It presented higher values when Bi-LSTM was used as classifier, especially when extra features were not added or when the lexical feature is added in the case of the training for 10 epochs.When the training was done for more than 10 epochs, the performance between SciBERT+POS (syntactic feature) and SciBERT+stem (lexical feature) was very similar.When the fine tuning was done using a linear classifier, SciBERT+POS achieved the best results, having a difference of around 10 % to when the lexical feature (SciBERT+stem) is added.PubMedBERT, a model pretrained from scratch using biomedical data, achieved the second best performance, being below SciBERT by 4 % when the training is done for 30 epochs, using Bi-LSTM as classifier and no adding extra-features (which was the best overall result of SciBERT).When PubMedBERT used Bi-LSTM as classifier, the results were very similar between adding the syntactic or lexical features and not adding them.These results were also similar to when a linear classifier was used and the extra features are added, noticing that the result was worse when no features were added.In the case of BERT, which was trained from scratch using data from the general domain, it presented lower results than PubMedBERT by around 5 %.The best results of BERT were obtained using a linear classifier and not adding extra features, noticing that the results of BERT+POS and BERT+stem were slightly lower and very similar between each other.This same behavior can be noticed when Bi-LSTM was used as classifier.These three last transformer models, SciB-ERT, PubMedBERT and BERT, presented some similarities in that they were trained from scratch, used very comparable text sizes for their pretraining and had similar vocabulary sizes.The two models that presented the lowest performance are BioBERT and BioMedRoBERTa, both pretrained from the BERT weights, using biomedical and, biomedical and general data, respectively, presenting the largest text sizes of all the models.BioBERT used the smallest vocabulary for its pretraining, while BioMedRoBERTa used the largest in comparison to the rest of the models.In both models it was observed that there was not significant change when adding the extra features, although there was an improvement of around 7 % when using a Bi-LSTM classifier compared to a linear classifier.In general, what can be noticed in all the models is that adding the syntactic and lexical features does not improve the performance for detecting biomedical events.Fig. 3 shows the performance of fine-tuning SciBERT during 30 epochs using a Bi-LSTM classifier on the seven datasets separately.The F1-scores obtained using EPI, CG, ID, GE'13 and PC were similar between each other, obtaining values between 0.70 and 0.80.When GE'11 was used, the F1score reached a value of around 0.65 and when MLEE was used, the model completely failed the detection of triggers.In Fig. 4 it is observed the effect of fine-tuning SciBERT over 30 epochs using a Bi-LSTM classifier without adding extra-features by cumulatively adding each corpus one by one.Below each corpus is shown the total number of classes by adding each corpus.Recall was improved when CG and EPI were used together, and then reduced as the rest of the corpora were added.Precision was affected when EPI and GE'11 were added.The behavior of recall and precision varied differently depending on the added corpus, although when GE'13 was added both values were comparable, and as might be expected according to the observed on Fig. 3, when MLEE was added the values were negatively affected.This behavior may be due to the fact that when adding a new corpus for the finetuning of the models, some classes may overlap between the corpora while other classes do not, causing to probably have less samples in the new classes and, therefore, affecting the balance of the data.In addition, the context of the different biomedical subdomains may also affect the performance, since BERT and its variants compute embeddings considering the semantics.

VI. CONCLUSIONS AND LIMITATIONS
In this work, we analyze BERT and four of its variants for biomedical event detection using corpora of different biomedical subdomains.By comparing the performance of the models and by adding a lexical and syntactic features, we found that fine-tuning SciBERT during 30 epochs using a Bi-LSTM classifier is the best strategy to detect biomedical events, especially if the additional features are not included.Furthermore, it is shown that fine-tuning the models for 10 to 30 epochs achieves most of the model learning, while training for more epochs can only achieve a slightly better result.One of the limitations of this work is the imbalance of the data.Since some classes of the different corpora overlap, the samples for those classes are increased, while the unique classes for each corpora present fewer samples.This can negatively affect the behavior of the models between the different subdomains.Also, using external tools to get POS tags and stems can lead to errors that are learned by the models and may be one of the reasons why performance without additional features achieves better results.

TABLE II RESULTS
OF THE MODELS' FINE-TUNING FOR EVENT DETECTION