Query Specific Focused Summarization of Biomedical Journal Articles

During COVID-19, a large repository of relevant literature, termed as "CORD-19", was released by Allen Institute of AI. The repository being very large, and growing exponentially, concerned users are struggling to retrieve only required information from the documents. In this paper, we present a framework for generating focused summaries of journal articles. The summary is generated using a novel optimization mechanism to ensure that it definitely contains all essential scientific content. The parameters for summarization are drawn from the variables that are used for reporting scientific studies. We have evaluated our results on the CORD-19 dataset. The approach however is generic.


I. INTRODUCTION
W ITH the rapid rise of scholarly articles in the biomedical domain, there has been a growing urgency to explore Natural Language Processing (NLP) techniques that can process vast volumes of content to generate intelligent insights, which can then be selectively explored by the experts. This was proved once again during the current COVID-19 pandemic. There has been a stupendous rise in related biomedical articles that have been published over the period. While it undoubtedly helped medical practitioners, virologists, immunologists, policy makers, public health planners, drug manufacturers and many others associated to healthcare services, it also highlighted the need for efficient mechanisms to enable intelligent navigation through this sea of content. The needs of end-users can be quite varied in nature. For example, in the current scenario, while medical professionals need insights about drugs and procedures, a virologist would be interested in studying the nature of the virus and hence look for literature reporting the virus's transmission, incubation, susceptibility to external factors, etc. Public policy makers, on the other hand, need information to design effective policies and guidelines to keep the spread controlled. Since, time is premium for every user, a mechanism that will enable the user to grasp the key aspects covering objectives, methodology and findings or outcomes, if any, of an article is an important ask from the NLP community.
In May 2020, as Allen Institute shared a large repository called "CORD-19" 1 , which contained bio-medecial articles 1 https://allenai.org/data/cord-19 related to corona virus. Kaggle further announced a challenge, in which some of the key questions asked by the end-users were put up for the natural language processing community to find out efficient methods to answer them. A two-way communication ensued on the platform between the endusers and the NLP researchers, wherein the focus was to understand the requirements clearly. The discussion led to clearer elicitation of information components from different categories of users. As it turned out, while the information components were different for different category of users, all users wanted to view the relevant findings about the components in a contextual way, that would make it easy for them to interpret the significance of the results. For example, epidemiologists specifically wanted to know the "incubation period" of the virus, in order to design policies for prevention and control. However, as different values for incubation period were reported by scientists from different corners of the world, the epidemiologist wanted the result to be presented along with its context that included the sample type, sample size and most importantly the statistical outcomes of these results. The contextual presentation was clearly needed to help them decide whether to accept or reject the results. Similarly, a doctor may want to know about the drugs that were found to be effective, but along with it also the details about patient condition and treatment course, to help in decision making. It has to be further remembered that a single article may contain information that could be of interest to multiple categories of users, though all of it may not be of interest to any one category. Though the requirements were first published in Kaggle, subsequently, TREC also posted similar requirements from the CORD-19 collection. For a large number of short queries, it posted additional narratives stating stricter requirements for a retrieved article to qualify as relevant. It was observed that the narratives were similar to the user requirements mentioned in the Kaggle platform.
Motivated by the above requirements, in this paper, we present a mechanism that can create a query-specific contextually focused summary of an article for the end-user. The rationale of the proposed mechanism comes from commonly followed reporting style for bio-medical articles, especially for reporting experimental studies and case studies. The target of our work is to generate a uniformly-structured summary that contains all relevant information for a specific end-user. Thus, two end-users, based on their requirements, may see two different summaries of the same document, though both the summaries will be structured in a similar fashion. Section II presents more details about the structure of an ideal summary. This is achieved in three stages.
• We first provide a query representation mechanism that can accommodate the user requirements in terms of 5 parameters that comprise key aspects of a scientific study: study type, sample size, sample type, measures/results, evidence of measure. The rationale for selecting these five parameters is explained in detail in section III.
• Next, an optimization-driven method is proposed to select a minimal set of sentences that can satisfy the requirements of a query. It is done by scoring the sentences based on their information content with respect to the abovementioned parameters, with additional constraints imposed on their proximity. The proximity constraints have been designed based on commonly followed practices for reporting outcomes in bio-medical scientific publications. These sentences form a "snippet", which can provide the key outcomes at a glance. This is explained in section IV-C.
• Finally, a contextual summary creation method is proposed. The contextual summary is created by rearranging the set of sentences selected by the optimizer and augmenting them with additional content, if necessary, to create a cohesive and comprehensive summary. This is explained in section IV-D.
The proposed approach ensures that the necessary information components found in the documents are always contained in the summary. In the absence of any gold-standard data-set for evaluating the contextually focused summaries created by the proposed method, we have evaluated the summaries by comparing them with the abstracts provided along with the articles. We show that, for journals that insist on a structured summary for authors, the generated summaries are very similar to author-provided summaries. However, such journals are very few. Thus only 25% articles in the repository were found to have high-quality author-generated structured summaries. The focused summary generation method can thus be used to generate high quality summaries for a larger collection of bio-medical articles. This, by itself, is a very significant contribution to the domain of bio-medical literature analysis. The results and observations are discussed in detail in section V.
It may be noted that, the proposed mechanism is not an alternative to online document search systems which pull documents from an indexed collection in response to a query. Rather, our work is intended to augment the search results by generating a query specific summary for all articles retrieved by the search engine in response to a query. Subsequently, documents are re-ranked based on the quality of the summary. The contextual summary can be shown as a snippet to the enduser for faster comprehension.
A summary of related work in the allied area has been presented in section VI. A well-structured summary is expected to contain all required information in a compact, cohesive and comprehensible fashion. Though scientific documents usually contain abstracts that present a short and concise summary of the document, our analysis of the CORD-19 collection revealed that abstracts vary widely in size and nature, depending on the journal in which it is published. We observed that bio-medical documents contain two types of abstracts, i.e. 'Structured abstract' and 'Unstructured abstract'. Structured abstracts usually present a well-defined and detailed summary of the document. Figure  1 shows an example of a structured abstract [1], where Background, Method, Results and Conclusion of the experiment are separately presented in the abstract itself. Unstructured abstracts, as shown in Figure 2 [2], on the other hand are generally short and may not convey all the important elements included in the introduction, method, or findings sections. Both these abstracts were created by the respective authors, who selected which information goes to the abstract and which does not. In the absence of a strict requirement, the authorcreated abstract may or may not contain the information that is required by a user, even though it may be contained in the article.

II. STRUCTURE OF AN IDEAL SUMMARY OF
The proposed work intends to cover this gap by providing a mechanism to create focused well-structured summaries on the fly, which will contain the user-required information, if it is there in the document. These summaries should be similar in form to the structured abstracts shown in Figure 1. In order to do that, we exploit the inherent structure that is observed in the published articles. Bio-medical articles usually follow a specific format for reporting their findings. The findings are usually reported along with additional details about (a). the type of the study or the way the experiment or study was conducted (b). details about the subject of the experiment i.e. about the sample types, categorization of the samples, sample size etc. (c). results of experiments or observations (d). evidence of measure for different sample categories (e). the significance of the results. There is also a discipline that is maintained while reporting these items. For example, significance of a result is explained along with evidence of measure.
In the next section, we first present a few sample queries published on the Kaggle site along with the requirements of each. Subsequently, we discuss how these requirements can be mapped to the scientific parameters and converted to a slotvalue format, which is used later to construct optimization constraints. The optimizer then uses these constraints to select an optimal set of sentences that can satisfy the user requirements. Table I shows four types of questions, posted under different task categories in the CORD-19 2 challenge by various groups of users. Each question is accompanied by a narrative that specifies what kind of information is required from the documents, to answer the queries. These four questions represent four broad and exhaustive categories, which cover most of the user queries posed to the collection. We now present a mapping of these queries to the parameter requirements mentioned earlier. The mapping is done to five different slots that can be associated to specific types of values.

CONSTRUCTION
1) Study Type: describes a broad category for the type of work reported in the document. It could be a systematic review, a case study or case series, a simulation study or an experimentation. This covers almost all kind of documents, but more may be added. 2) Sample Size: is used to define the size of the study population, samples studied or papers reviewed to compute the result. For example, 50 Patients, 120 case reports, etc. 3) Sample Type: describes the sub sample of the population addressed or the type of samples that were considered for the study. For example, population addressed can be pregnant women, children, elderly, smokers, etc. 4) Measures/Results: These are the quantitative outcomes or findings presented in a document after analysis of the data. They can be statistical findings like odds ratio, hazard risk, etc. on potential risks or other outcomes like drug effectiveness, prevalence, etc. 5) Evidence of Measure: These are additional qualifiers or filters that are applied on the measures/results to quantify the level of evidence. Evidences can be expressed in terms of sets of sub-samples generated from the population. For example, the risk posed by COVID-19 to smokers can vary depending on their age and other comorbidities present. The impact of a policy or guideline depends on the country it is implemented at. Thus, these elements can be used to present the evidence of measure of various queries. Table II presents a few sample user queries from Kaggle site, along with their mapping to the question type presented in Table I, further slotted according to the type of information required. The slot-value requirements for each question type is derived from the narratives. This is further validated using the target requirements mentioned for these queries at the Kaggle site.
Slot items are associated with factor-specific constraints that are designed to ensure that only meaningful information components are picked up. For example, odds ratio is usually specified in a paper as "OR <INTEGER>, 95% CI <RANGE>", incubation period is presented as "number of days", country names can only be from a set of known entities, drug names can be recognized using Biological entity taggers. Each slot is also associated to an encapsulated information extraction procedure which hunts for feasible values for that slot. Table II also gives some examples of accepted study design types for the bio-medical domain. A list of such constraints has been curated from available literature and data on the challenge sites. This list can be extended.
In order to ensure the coverage of queries using these categories of questions and slot types, we have additionally considered the queries presented by the TREC challenge makers 3 to be addressed from the CORD-19 collection. We were able to map approximately 67% queries to these 4 broad categories mentioned in Table I  What do we know about natural history, transmission, and diagnostics for the virus? What have we learned about infection prevention and control? Range of incubation periods for the disease in humans (and how this varies across age and health status) and how long individuals are contagious, even after recovery. Prevalence of asymptomatic shedding and transmission. Persistence and stability on a multitude of substrates and sources (e.g., nasal discharge, sputum, urine, fecal matter, blood). Persistence of virus on surfaces of different materials (e,g., copper, stainless steel, plastic).

Treatment/Diagnostics Efficacy
What do we know about vaccines and therapeutics?
Effectiveness of drugs being developed and tried to treat COVID-19 patients. Clinical and bench trials to investigate less common viral inhibitors against COVID-19. Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents. Use of diagnostics such as host response markers (e.g., cytokines) to detect early disease or predict severe disease progression, which would be important to understanding best clinical practice and efficacy of therapeutic interventions.

A. Ensuring consistency of information
After identifying the required slots, they are further bucketed together to ensure meaningful information extraction. The buckets represent groups of slot items that are inter-dependent on each other with respect to the given query. The inter-dependence of these items is either expressed as a linguistic constraint or a proximity constraint. These constraints are also parsed from the narrative. For example, for a query "what is the range of incubation period for different age groups?" the slot value pairs are filled up as <Measure/Results, incubation period>, <Evidence of measure, age group> and <Sample size, #patients>. This in turn implies that the statement "The average incubation period was 4 days," found in a document wouldn't be complete. It needs additional information for the result to be accepted. A sentence found in close proximity to the above one was "We considered 157 confirmed cases, aged 44-60 years, 74 female (47.1%) and 38 imported cases (24.2%)." A complete snippet would have to contain both the sentences. By adding Measures/Results along with the Evidence of Measure in a single bucket, we can generate a more comprehensive and coherent snippet for the user query. Additionally, information about whether it was a simulation experiment or a systematic review, i.e. the study type of the document is also presented in the snippet. This is independent of the final result being reported and is therefore added in a separate bucket. Thus, there can be two buckets in the ar-  The buckets can be interpreted as context provider for the information components to ensure that randomly occurring strings or values of a certain type are not accepted just because of a keyword match.

IV. QUERY FOCUSED SUMMARY GENERATION
The task of of generating query specific focused summaries is carried out in phased manner. Initially, the slot-value pairs are searched within the document collection. This is done by locating the entities within the document. Each document is subjected to a pre-processing phase for the purpose. After that, a set of minimal number of sentences is selected that satisfies the slot requirements, with additional constraints imposed on the proximity of slots within a single bucket, using integer linear programming. A subsequent phase of enhanced document summarization is carried out to present the information in a coherent and comprehensible form.

A. Document Pre-processing
Like all document processing tasks, search for information is preceded by a one-time activity that comprises of document pre-processing and information extraction. Each document in the set is passed through a pre-processing pipeline for cleaning and tokenizing it into sentences using SciSpaCy [3]. Each sentence is then indexed according to its unique documentid and the section label where it belongs in the document. Each document is then subjected to the following processes- These entities are extracted using a BERTbased sequence labelling approach described in [4]. Additionally, biomedical entities like DNA, Cell Type, Protein, Chemical, Organ names, Drug, etc. are also extracted using SciSpaCy.
• Sentence embedding generation: Sentence embeddings are also generated using Facebook's Infersent pre-trained encoder [6] to create a 300-dimensional vector for a sentence. It uses Bidirectional LSTM with max pooling to capture the context and generic information available for a variety of tasks. These embeddings capture the semantics of a sentence better by embedding the context in the encoding.

B. Mechanism for sentence scoring
In this section, we present how the specific information components required for a query are located within the documents and scored to generate a snippet. First, the sentences are checked for the presence of any of the required slot values. Slot specific search methods are deployed for this purpose. The extraction methods commonly used for the different slots are as follows: -1) Measures/Results-As observed from the summary tables provided by the CORD-19 challenge makers, values fitting this slot (like OR, p-value, HR, etc) follow a set pattern, which can be expressed using a regular expression such as "<MeasureName>= <INTEGER>(, 95% CI <RANGE>)?". At first, we used a regular expression matching algorithm to extract instances of this type. But the pattern matching approach resulted in noisy extractions and also missed certain instances that varied slightly from this pattern. Therefore, we moved on to use a BiLSTM-CRF sequence tagger [7] to identify the measures/results in sentences, which showed an accuracy of 97%. Here, we have used the results from above pattern-matching approach along with certain hand tagged instances (that were not detected earlier) to create the annotated training data for a sequence tagger. We have excluded the noisy extractions of patternmatching approach from the training data. Since the task is to identify a set of literals/token following a pattern, we did not use any sequence tagging algorithm requiring semantic context. 2) Study Type-These are pre-specified strings and keywords found in text. A comprehensive design dictionary curated by a team of epidemiologists has been provided to help the CORD-19 research community for effective retrieval. 4

3) Sample Size-This is extracted by tagging 'Participant
Sample Size' instances in text using the biomedical entity extractor described in the previous section. 4) Sample Type-Values are extracted using the biomedical entity extraction module. For any given query, findings like patient condition, patients undergoing any surgical intervention, patients having any drug administered, etc. can be selected for this slot depending on the requirement. For example, for the query 'risk to cancer patients due to COVID-19' -<patient condition, 'Cancer'> is added to the slot. For 'effectiveness of hydroxychloroquine in treatment of COVID-19 patients' the slot-value pair <Drug, 'Hydroxychloroquine'> is added. 5) Evidence of Measure-Values are extracted using the biomedical and Named entity extraction modules explained in the previous section. Extractions like Patient Age, Gender, country, etc. are included in this slot. Any sentence that contains at least one value is retained for scoring, while the remaining ones are assigned a score of 0. The final score assigned to a sentence depends on three factors, which are explained below-Confidence score from sentence type -The section headers of the document are also taken into account while scoring sentences. Thus, sentences from "review" section score less than those coming from other sections of the document, since the latter are considered to be fundamental contributions from the document under consideration. Since, section headers are not always unambiguous, special checks are put into place to check for reference and citation patterns as well as linguistic constructs to identify such sentences. For computing the confidence value, sentences from "review" sections are penalized by a value of (ρ), and the findings fundamental to the document are rewarded with (ρ), such that 0<ρ<1.
Intra-bucket score -Sentences containing values for certain slots also gain for being in proximity of other sentences containing values in the same bucket. As a corollary, between two sentences that contain values for the same slot, the one that contains additional values for other slots belonging to the same bucket will score higher. This is referred to as intrabucket score of a sentence.
Inter-bucket score -Sentences also gain some reward from being in proximity to other sentences that contain values for slots from other buckets. The inter bucket proximity ensures that the overall context of all the findings remains consistent.
We now present the scoring equations. Proximity between two sentences S i and S j , is computed as an inverse function of the distance between the sentences in the document and also takes into account their corresponding section headers.
where, distance (S i , S j ) = abs (position (S i ) -position(S j )), position(S i ) indicates original sentence number of S i , and Section_reward (i,j) = 1, if the section header of sentences is same; otherwise 0.
Let V = {v 1 , v 2 , v 3 , . . . , v m } be the set of values required by the query. Then the scores for a sentence S i having a value v k is expressed as follows: ∀ v k , v p ǫ V, s.t. bucket(v p ) = bucket(v k ), ∀ j s.t. S j is the closest sentence that contains a value for a slot v p that belongs to the same bucket, including itself.
S j is the closest sentence that contains a value for a slot v p that belongs to a different bucket, including itself.
Score (S i ) is now computed as- We take α >0.5 to give more weightage to the Intra_Bucket scores over the Inter_Bucket scores. The sentence score is then normalized s.t. Score (S i ) ǫ [0,1].

C. Optimal snippet generation
Our goal is now to use the above scores to identify the minimal set of sentences that can form a snippet.
Let us suppose that query Q has 'm' slot values divided into different buckets. Let S = {S 1 , S 2 ,. . . , S n } be the set of sentences which have a non-zero scores after scoring. The following optimization algorithm finds the minimal set of sentences that contain all the 'm' values, if present.
Let VS(i, j) = 1, if value v j is found in S i ; otherwise 0. Let x(i) = 1, if S i is selected in optimal snippet and 0 otherwise Then the objective function for the optimization problem is expressed as-Objective Function: Subject to constraints: The value (-1) is added to ensure that minimum number of sentences are finally selected. The constraint in equation 6 ensures that at least 1 sentence is picked to cover each value, provided that value is reported by the document D. Finally, equations 7 and 8 enforce that at least 1 sentence is selected from the document and maximum number of sentences selected are no more than the type of values required to address the user given query. This is solved using Integer Linear programming. Figure 3 shows the snippet generated using the above optimization approach for two documents [8,9], along with the slot values for the queries 'Risk to Diabetes Patient' and 'Incubation period with respect to age'. It can be seen from these examples that the individual sentences by themselves are not enough. Reporting 'Fatality rate was 11.1%' doesn't convey the confidence of the finding. By additionally reporting <Patient Condition, 'Diabetes'>, <Sample Size, '258 Patients'> and <Study Type', 'Retrospective'>, a much better picture can be presented. The second example also highlights how the proximity constraint helps provide maximum information in minimum sentences, making it much more comprehensible.

D. Contextual focused summary generation
In this section, we present an enhanced summarization approach which generates a fixed length extractive summaries for documents, by checking for sentence representativeness along with the scores from the previous section. For each candidate sentence to be included in the summary, it's 300 -dimensional vector embedding is created using Infersent. Sentence score (Sc) for the i th sentence in the j th document is generated as follows - where Sc Rank (S i j ) is the representativeness score assigned using the TextRank algorithm, by checking the sentence's similarity with all other sentences, using the corresponding Infersent vectors. Sc Title (S i j ) is computed using the cosine similarity between the title and sentence vectors. Position score proves to be very effective in document summarization as it is a good indicator of significant sentences and is computed as where, Len j is the length of j th document, and Pos i is the position of i th sentence in the document. Sc Domain (S i j ) denotes the score computed in the earlier section based on the slot requirements. All these scores are normalized and added to give us the final sentence score.
In order to remove redundancy, we use an algorithm similar to the MMR algorithm [10], that focuses on ensuring diversity in the sentences being selected. The sentences are sorted based on the decreasing value of their scores Sc (S i j ) and the highest scored sentence is selected to be included in the final summary first. The next sentences are selected based on the following conditions: Sentences are added to the final summary, iff the cosine similarity of the sentence with the selected set of sentences is below a threshold β. Sentences having similarity with a selected sentence greater than the threshold β are discarded if they belong to the same section in the document.
This process is repeated for all the remaining sentences, till selected sentence count reaches a maximum count τ .
To ensure that the summaries are connected and coherent, the selected sentences are re-ordered according to their position in the document. Preserving document order guarantees that the summary has sentences from the aim and introduction presented first, followed by the methodology and finally, the results and conclusions.

A. Dataset description
The Covid-19 Open Research Dataset (CORD-19) is a collection of scientific papers on Covid-19, SARS-CoV-2, and related historical coronaviruses. The dataset contains a primary metadata file containing unique paper id, author, journal, publication date, abstract etc. and link to full-text file name. Full texts are available for some files in json format.

B. Snippet evaluation and observations
We have conducted the evaluation of the snippet generation system on recently-published articles from the CORD-19 dataset. Due to lack of gold standard data, the evaluation was done manually for 10 queries across 4 categories (Table I), on 500 documents. We consider Study type, Sample Size, Evidence of Measure, Sample Type and Measures/Results as the required slots and compare the findings with the values reported in the abstracts, also measuring the overall correctness with respect to the document as well. The manual inspection of generated snippets with respect to the documents showed that study type and sample size were retrieved correctly 70.2% and 67.4% times respectively. Out of these, it was observed that 82.24% and 66.52% of the times these values matched with study type and sample size reported in the abstract. For measures/results (i.e. the quantitative findings), we evaluate them as correct if the extraction is reported in association with the user query/keyword. We observed that of the 73.6% correctly extracted values only 26.3% of the snippet values matched with the findings in the abstract. In 47.5% cases we observed that the abstracts either did not report any statistical findings or reported findings were not relevant to the query. This could be because the main theme of the document was different from that of the query. This further emphasizes the need for generating snippets and summaries from documents that answer the user queries.

C. Evaluating summaries-results and observations
We use Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [11] scores for evaluating the summaries. It determines the quality of a summary automatically, by comparing it to human (ideal) generated summaries (we use the abstracts as model summaries here). ROUGE-N (unigram and bigram match) and ROUGE-L (Longest Common Subsequence match) scores were chosen for our experiments.
The generated summaries were grouped based on the type of abstract (structured and unstructured) in the document. We observed that only 145 documents (25.3%), out of 573 scientific documents summarized, had structured abstracts, remaining documents either had no abstract or had an unstructured one.
We have generated two different types of summaries using TextRank algorithm, as shown in the Table IV. In the baseline approach, we have generated generic summaries, using Sc Rank (S i j ), Sc Title (S i j ), and Sc Position (S i j ) scores. But in our final approach (i.e. Contextual focused summary), we have incorporated user requirements by using Sc Domain (S i j ) for scoring sentences.
In order to determine the performance, results are also compared with some existing text summarization algorithms, like LSA [12] and TextRank [13]. It can be seen from Table IV that our system performs better than these summarization algorithms. There is a 6.9% increase in ROUGE-L scores after including Sc Domain (S i j ) score in case of structured abstracts. High ROUGE scores with structured abstracts indicates that the summaries generated by our method have been able to cover the important information and findings well. Unstructured abstracts, on the other hand, seldom include results or description of the methodology. By including slots like Study Design, Sample Size, Statistical Measure/Results, the summaries generated by our approach become more informative and can present facts and details that are mostly not covered in the abstracts. Figure 4 shows the relation between number of words in abstract and the ROUGE scores for documents with unstructured abstracts. Since the longer abstracts are supposed to be more detailed and informative, it can be seen that with the increase in word count, the ROUGE scores also increase. The evaluation with low word count abstract provides a reverse indicator for measuring the quality of the summaries, as a lower overlap means that a lot of additional information has been captured in the summary as well, that was missing in the abstract.   way sentences that discuss important topics are chosen as candidates for summaries. One of the most successful text summarization systems called TextRank [13] was introduced in 2004. TextRank uses a graph-based algorithm similar to PageRank [14], in which similarity between two sentences is computed in terms of their content overlap. Later, [15] enhanced TextRank and proposed the use of longest common substrings based cosine distance between pairs of sentences. BM25 [16] can also be used as a ranking function to retrieve the candidate sentences for the summary. Single-document summarization approach was proposed in [17], that maximizes concept coverage using Integer Linear Programming(ILP). They also presented a weighing method for combining position to emphasize important concepts. The information available for clinicians and clinical researchers is growing exponentially, both in the biomedical literature and patients' health records. We need strategies to cope with this information overload as biomedical literature provides clinicians and clinical researchers with a valuable source of knowledge to assess the latest advances, develop and validate new hypotheses, conduct experiments, and interpret their results [18,19].
Several approaches have been proposed for summarization in biomedical domain. The applications mainly include summarizing treatments [20], summarizing drug information [21], summarizing clinical reports [22], and electronic health records [23]. One such work is presented in [24], a graphbased summarizer that uses the Unified Medical Language System (UMLS) to identify concepts and the semantic relations between them to construct a semantic graph that represents the document. A degree-based clustering algorithm was then used to identify different themes or topics within the text. Authors in [25] proposed a clustering and itemset mining based Biomedical Summarizer (CIBS) that also utilize UMLs to map text to concepts and then passes it to an itemset mining algorithm, for topic extraction. Sentences are clustered and related sentences from within these clusters are selected to produce a summary.
Text summarization approaches focusing on answering user queries are particularly of interest as it can aid medical practitioners identify salient and relevant information. The work in [26] presented one such approach that utilizes labeled data that is publicly available, pre-trained medical domain word embeddings along with a set of simple features for generating query focused extractive summaries.
Query-based text summarization based on common-sense knowledge and word sense disambiguation was proposed in [27]. Their technique finds semantic relatedness score between query and input text document for extracting relevant sentences. It finds correct sense of each word of a sentence with respect to the context of the sentence and hence provides query-relevant summaries.

VII. CONCLUSION
In this paper, we present summarization mechanism that can create a query-specific contextually focused summary of an article for the end-user. Initially, a query representation mechanism is defined that can accommodate the user requirements in terms of a fixed number of parameters that comprise key aspects of a scientific study. Further, an optimizationdriven mechanism is used for retrieving minimal number of sentences relevant to an elaborate scientific query. These sentences form a snippet which provides the key outcomes at a glance. Finally, a contextual summary is created by rearranging the set of sentences selected by the optimizer and augmenting them with additional content. The target of the current work is to generate a uniformly-structured summary that contains all relevant information for a specific end-user. Thus the summaries are customized to the needs of the user. The results have been evaluated using ROUGE scores. The summaries generated by the proposed method have high ROUGE scores with the author-written summary, whenever one is present. For the remaining documents, the generated summary is a useful addition. From an application point of view, we believe that our snippet generation and summarization approach can be easily applied to other data sets by updating the slot requirements.
In future, we would like to explore more on the document structures, sentence type classification and abstractive summarization approaches for reducing the information overload even further. We also intend to extend the methods to work for any scientific document collection, beyond bio-medical literature. We are also evaluating it for a larger set of queries with enough variation in their structures and design automated evaluation mechanisms, since getting manual feedback is difficult.