A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-

Marina Santini; Arne Jönsson; Mikael Nyström; Marjan Alirezai

A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-

Marina Santini, Arne Jönsson, Mikael Nyström, Marjan Alirezai

DOI: http://dx.doi.org/10.15439/2017F531

Citation: Position Papers of the 2017 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 12, pages 71–78 (2017)

Full text

Abstract. In this position paper, we put forward two claims: 1) it is possible to design a dynamic and extensible corpus without running the risk of getting into scalability problems; 2) it is possible to devise noise-resistant Language Technology applications without affecting performance. To support our claims, we describe the design, construction and limitations of a very specialized medical web corpus, called eCare\_Sv\_01, and we present two experiments on lay-specialized text classification. eCare\_Sv\_01 is a small corpus of web documents written in Swedish. The corpus contains documents about chronic diseases. The sublanguage used in each document has been labelled as``lay'' or``specialized'' by a lay annotator. The corpus is designed as a flexible text resource, where additional medical documents will be appended over time. Experiments show that the lay-specialized labels assigned by the lay annotator are reliably learned by standard classifiers. More specifically, Experiment 1 shows that scalability is not an issue when increasing the size of the datasets to be learned from 156 up to 801 documents. Experiment 2 shows that lay-specialized labels can be learned regardless of the large amount of disturbing factors, such as machine translated documents or low-quality texts that are numerous in the corpus.

References

M. Alirezaie, H. Karl, and B. Eva, “A pattern language for smart home applications,” Semantic Web, vol. 00, no. 00, p. 00, 2017.
M. Alirezaie, “Bridging the semantic gap between sensor data and ontological knowledge,” Ph.D. dissertation, Örebro university, 2015.
L. Deléger, B. Cartoni, and P. Zweigenbaum, “Paraphrase detection in monolingual specialized/lay corpora,” Building and Using Comparable Corpora, 2013.
M. Seedor, K. J. Peterson, L. A. Nelsen, C. Cocos, J. B. McCormick, C. G. Chute, and J. Pathak, “Incorporating expert terminology and disease risk factors into consumer health vocabularies,” in Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. NIH Public Access, 2013, p. 421.
L. Deléger and P. Zweigenbaum, “Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora,” in Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora. Association for Computational Linguistics, 2009, pp. 2–10.
K. F. Heppin, “Resolving power of search keys in medeval a swedish medical test collection with user groups: Doctors and patients,” Ph.D. dissertation, Ph. D. thesis, University of Gothenburg, 2010.
E. Abrahamsson, T. Forni, M. Skeppstedt, and M. Kvist, “Medical text simplification using synonym replacement: Adapting assessment of word difficulty to a compounding language,” in Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)@ EACL, 2014, pp. 57–65.
V. Haslerud and A.-B. Stenström, “The bergen corpus of london teenager language (colt),” Spoken English on computer, pp. 235–42, 1995.
R. Basili, M. T. Pazienza, and P. Velardi, “Acquisition of selectional patterns in sublanguages,” Machine Translation, vol. 8, no. 3, pp. 175–201, 1993.
G. Grigonytė, M. Kvist, M. Wirén, S. Velupillai, and A. Henriksson, “Swedification patterns of latin and greek affixes in clinical text,” Nordic Journal of Linguistics, vol. 39, no. 01, pp. 5–37, 2016.
M. Nyström, M. Merkel, L. Ahrenberg, P. Zweigenbaum, H. Petersson, and H. Åhlfeldt, “Creating a medical english-swedish dictionary using interactive word alignment,” BMC medical informatics and decision making, vol. 6, no. 1, p. 35, 2006.
M. Nyström, M. Merkel, H. Petersson, and H. Åhlfeldt, “Creating a medical dictionary using word alignment: the influence of sources and resources,” BMC medical informatics and decision making, vol. 7, no. 1, p. 37, 2007.
N. Elhadad and K. Sutaria, “Mining a lexicon of technical terms and lay equivalents,” in Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing. Association for Computational Linguistics, 2007, pp. 49–56.
V. V. Vydiswaran, Q. Mei, D. A. Hanauer, and K. Zheng, “Mining consumer health vocabulary from community-generated text,” in AMIA Annual Symposium Proceedings, vol. 2014. American Medical Informatics Association, 2014, p. 1150.
D. Kokkinakis, “The journal of the swedish medical association-a corpus resource for biomedical text mining in swedish,” in The Third Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM), an LREC Workshop. Turkey, 2012.
H. Dalianis, M. Hassel, and S. Velupillai, “The stockholm epr corpus-characteristics and some initial findings,” Women, vol. 219, no. 906, p. 54, 2009.
H. Dalianis, A. Henriksson, M. Kvist, S. Velupillai, and R. Weegar, “Health bank-a workbench for data science applications in healthcare.” in CAiSE Industry Track, 2015, pp. 1–18.
M. Baroni and S. Bernardini, “Bootcat: Bootstrapping corpora and terms from the web.” in LREC, 2004.
V. Volansky, N. Ordan, and S. Wintner, “On the features of translationese,” Digital Scholarship in the Humanities, vol. 30, no. 1, pp. 98–118, 2015.
R. Artstein and M. Poesio, “Inter-coder agreement for computational linguistics,” Computational Linguistics, vol. 34, no. 4, pp. 555–596, 2008.
J. Cohen, “A coefficient of agreement for nominal scales,” Educational and psychological measurement, vol. 20, no. 1, pp. 37–46, 1960.
K. Krippendorff, “Content analysis. beverly hills,” California: Sage Publications, vol. 7, pp. l–84, 1980.
J. Sim and C. C. Wright, “The kappa statistic in reliability studies: use, interpretation, and sample size requirements,” Physical therapy, vol. 85, no. 3, p. 257, 2005.
K. Krippendorff, “Computing krippendorff’s alpha-reliability,” 2011. [Online]. Available: http://repository.upenn.edu/asc_papers/43
I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
J. Falkenjack, K. H. Mühlenbock, and A. Jönsson, “Features indicating readability in swedish text,” in Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16, no. 085. Linköping University Electronic Press, 2013, pp. 27–40.