Logo PTI
Polish Information Processing Society
Logo FedCSIS

Annals of Computer Science and Information Systems, Volume 15

Proceedings of the 2018 Federated Conference on Computer Science and Information Systems

A New Subject-based Document Retrieval from Digital Libraries Using Vector Space Model

, , ,

DOI: http://dx.doi.org/10.15439/2018F260

Citation: Proceedings of the 2018 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 15, pages 161164 ()

Full text

Abstract. Document retrieval from digital libraries based on user's query is highly affected by the terms appeared in the query. In many cases, there are some documents in the digital libraries that do not share exactly the same terms with the query, but they are related to the user's need. We address this challenge in this paper by introducing a new subject-based retrieval approach in which, apart from ranking documents based on the terms in the query, a new subject-based scoring scheme is defined between the query and a document. We define this score by introducing a new vector space model in which a vectorized subject-based representation is defined for each document and its keywords, and the terms in the query, as well. We have tested the new subject-based scoring scheme on a database of scientific papers obtained from Web of Science. Our Experimental results show that in 83\\% of times users prefer the proposed scoring scheme with respect to the classic scoring one.


  1. S. Momtazi, M. Lease, and D. Klakow, “Effective term weighting for sentence retrieval,” in Research and Advanced Technology for Digital Libraries, M. Lalmas, J. Jose, A. Rauber, F. Sebastiani, and I. Frommholz, Eds. Springer Berlin Heidelberg, 2010, pp. 482–485.
  2. A. Singhal, C. Buckley, and M. Mitra, “Pivoted document length normalization,” in ACM SIGIR Forum, vol. 51, no. 2. ACM, 2017, pp. 176–184.
  3. S. Acid, L. M. De Campos, J. M. Fernández-Luna, and J. F. Huete, “An information retrieval model based on simple bayesian networks,” International Journal of Intelligent Systems, vol. 18, no. 2, pp. 251–265, 2003.
  4. J. Zhang, J. Gao, M. Zhou, and J. Wang, “Improving the effectiveness of information retrieval with clustering and fusion,” Computational Linguistics and Chinese Language Processing, vol. 6, no. 1, pp. 109–125, 2001.
  5. X. Wei and W. B. Croft, “LDA-based document models for ad-hoc retrieval,” in Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’06. ACM, 2006, pp. 178–185.
  6. C. Carpineto and G. Romano, “A survey of automatic query expansion in information retrieval,” ACM Comput. Surv., vol. 44, no. 1, pp. 1:1–1:50, 2012.
  7. X. Tai, M. Sasaki, Y. Tanaka, and K. Kita, “Improvement of vector space information retrieval model based on supervised learning,” in Proceedings of the Fifth International Workshop on on Information Retrieval with Asian Languages. ACM, 2000, pp. 69–74.
  8. T. Hofmann, “Probabilistic latent semantic indexing,” in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1999, pp. 50–57.
  9. W. Maitah, M. Al-Rababaa, and G. Kannan, “Improving the effectiveness of information retrieval system using adaptive genetic algorithm,” International Journal of Computer Science & Information Technology, vol. 5, no. 5, p. 91, 2013.
  10. S. Wang, S. Visweswaran, and M. Hauskrecht, “Document retrieval using a probabilistic knowledge model,” in Internation Conference on Knowledge Discovery and Information retrieval, 2009.
  11. L. M. de Campos, J. M. Fernández-Luna, and J. F. Huete, “A layered bayesian network model for document retrieval,” in Advances in Information Retrieval, F. Crestani, M. Girolami, and C. J. van Rijsbergen, Eds. Springer Berlin Heidelberg, 2002, pp. 169–182.
  12. A. Mohebi, M. Sedighi, M. Sedighi, Z. Zargaran, and Z. Zargaran, “Subject-based retrieval of scientific documents, case study: Retrieval of information technology scientific articles,” Library Review, vol. 66, no. 6/7, pp. 549–569, 2017.
  13. T. Siddiqui and U. Tiwary, “A hybrid model to improve relevance in document retrieval,” Journal of Digital Information Management, vol. 4, pp. 73 – 81, 2006 2006, cited By (since 1996):2Export Date: 10 July 2014.
  14. E. Nalisnick, B. Mitra, N. Craswell, and R. Caruana, “Improving document ranking with dual word embeddings,” in Proceedings of the 25th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, 2016, pp. 83–84.
  15. Y. Kural, S. Robertson, and S. Jones, “Clustering information retrieval search outputs,” in Proceedings of the 21st Annual BCS-IRSG Conference on Information Retrieval Research, ser. IRSG’99. Swindon, UK: BCS Learning & Development Ltd., 1999, pp. 9–9.
  16. W. Croft, Search engines : information retrieval in practice. Boston: Addison-Wesley, 2010.