Logo PTI
Polish Information Processing Society
Logo RICE

Annals of Computer Science and Information Systems, Volume 10

Proceedings of the Second International Conference on Research in Intelligent and Computing in Engineering

Metadata based Text Mining for Generation of Side Information

, , ,

DOI: http://dx.doi.org/10.15439/2017R86

Citation: Proceedings of the Second International Conference on Research in Intelligent and Computing in Engineering, Vijender Kumar Solanki, Vijay Bhasker Semwal, Rubén González Crespo, Vishwanath Bijalwan (eds). ACSIS, Vol. 10, pages 135141 ()

Full text

Abstract. Text mining is knowledge analyzing technique to find a pattern. The side information is also called as metadata in most of the metadata based text mining applications. The side information consisting of large data in terms of weblogs, metadata, and non-textual data i.e. image/video, etc. This large data present in the unprocessed form which cannot be used for further text mining. Therefore, metadata based text mining algorithms are used to mine the useful information. In this paper, the proposed approach uses the different kind of pre-processing steps i.e. splitting, tokenize, steaming, parsing and chunking. For generating the side information i.e. title, name, affiliation, email address, place etc. a natural language processing (NLP) is used. To achieve the effective clustering, the proposed approach uses a classical partitioning method with a probabilistic model. The proposed approach is compared in terms of time required for mining of words, accuracy, and efficiency. The presented result shows that, the proposed approach performs better in terms of accuracy and running time. In future, a Security is provided for metadata based side information generation using Intrusion Detection System (IDS).

References

  1. S. Bhanuse, S. Kamble and S. Kakde, “Text Mining using Metadata for Generation of Side information”, Procedia Computer Science, vol. 78, pp. 807-814, 2016.
  2. C. Aggarwal and P. Yu, “A framework for clustering massive text and categorical data streams”, international Conference on Data. Mining, pp. 477–481, 2006.
  3. S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algorithm for large databases”, in Proc. ACM SIGMOD Conf.., New York, NY, USA, 1998, pp. 73-84, 1998.
  4. S. Guha, R. Rastogi, and K. Shim, “ROCK: A robust clustering algorithm for categorical attributes”, info. Syst., vol. 25, no. 5, pp. 345-366, 2000.
  5. T. Liu, S. Liu, Z. Chen, and W.-Y. Ma, “An evaluation of feature selection for text clustering”, in Proc. ICML Conf.., Washington, DC, USA, 2003, pp. 488-495.
  6. C. Aggarwal, Yuchen Zhao, and Philip S. Yu, “On the Use of Side Information for Mining Text Data”, IEEE Transactions on knowledge and data engineering, vol. 26, ,no.6, pp. 1415-1429,2014.
  7. C. Aggarwal and H. Wang, “Managing and Mining Graph Data”, New York, NY, USA: Springer, 2010.
  8. C. Aggarwal and C. Zhai, “A survey of text classification algorithms”, in Mining Text Data. New York, NY, USA: Springer,2012.
  9. T. Yang, R. Jin, Y. Chi, and S. Zhu, “Combining link and content for community detection: A discriminative approach”, in Proc. 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 927-936, 2009.
  10. C. Aggarwal, Y. Zhao, and P. Yu, “On the Use of Side Information for Mining Text Data”, IEEE Transactions on knowledge and data engineering vol. 26,no.6, pp. 1415-1429,2014.
  11. T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data clustering method for very large databases,” in Proc., ACM SIGMOD Conf., New York, NY, USA, pp. 103–114, 1996.
  12. M. Khatri, S. Dhande “Implementation with text mining using classification”, International Journal for Technological Research In Engineering, vol. 2, Issue 10, June-20.
  13. M. Steinbach, G. Karypis and V. Kumar, “A comparison of document clustering techniques,” in Proc. Text Mining Workshop KDD, pp. 109-115, 2000.
  14. R. Feldman, J. Sanger “The Text Mining Handbook”, Cambridge University Press, 2007.
  15. H. Mahgoub, and D. Rösner, “Mining association rules from unstructured documents,” in Proc. 3rdInt. Conf. on Knowledge Mining, ICKM, Prague, Czech Republic, Aug. 25-27, pp.167-172, 2006.
  16. A. McCallum. “Bow: A Toolkit for Statistical Language Modeling, Text Retrieval”, Classification and Clustering, 1996, http://www.cs.cmu.edu/~mccallum/bow/
  17. R. Angelova and S. Siersdorfer, “A neighborhood-based approach for clustering of linked document collections,” in Proc. CIKM Conf., New York, NY, USA, pp. 778-779, 2006.
  18. A. Jain and R. Dubes, “Algorithms for Clustering Data”, Englewood Cliffs, NJ, USA: Prentice-Hall, Inc., 1988.
  19. I. Dhillon. S. Mallela, and D. Modha, “Information-theoretic Co-clustering,” In Proc. ACM KDD Conf., New York, NY, USA, pp. 89-98, 2003.
  20. A. Banerjee and S. Basu, “Topic models over text streams: A study of batch and online unsupervised learning,” In Proc. SDM Conf., pp. 437-442, 2007.
  21. Y. Sun, J. Han, J. Gao, and Y. Yu, “Topic Model: Information network integrated topic modeling,” In Proc. ICDM Conf., Miami, FL, USA, pp. 493-502, 2009.
  22. Ning Zhong, Yuefeng Li, and Sheng-Tang Wu,” Effective Pattern Discovery for Text Mining”, IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 1,2012.
  23. M. Franz, T. Ward, J. S. McCarley, and W. J. Zhu, “Unsupervised and supervised clustering for topic tracking”, In Proc. ACM SIGIR Conf., New York, NY, USA, pp. 310-317,2001.