Importance of Text Data Preprocessing & Implementation in RapidMiner

Vaishali Kalra; Rashmi Aggarwal

Importance of Text Data Preprocessing & Implementation in RapidMiner

Vaishali Kalra, Rashmi Aggarwal

DOI: http://dx.doi.org/10.15439/2017KM46

Citation: Proceedings of the 2017 International Conference on Information Technology and Knowledge Management, Ajay Jaiswal, Vijender Kumar Solanki, Zhongyu (Joan) Lu, Nikhil Rajput (eds). ACSIS, Vol. 14, pages 71–75 (2017)

Full text

Abstract. Data preparation is an important phase before applying any machine learning algorithms. Same with the text data before applying any machine learning algorithm on text data, it requires data preparation. The data preparation is done by data preprocessing. The preprocessing of text means cleaning of noise such as: cleaning of stop words, punctuation, terms which doesn't carry much weightage in context to the text, etc. In this paper, we describe in detail how to prepare data for machine learning algorithms using RapidMiner tool. This preprocessing is followed by conversion of bag of words into term vector model and describe about the various algorithms which can be applied in RapidMiner for data analysis and predictive modeling. We also discussed about the challenges and applications of text mining in recent days

References

Charu C. Aggarwal and ChengXiang Zhai: Survey of Text Classification Algorithm, chapter in book “Mining Text Data” http://dx.doi.org/10.1007/978-1-4614-3223-4_6, pp 163-222-springer US 2012.
S. B. Kotsiantis: Decion Trees: A recent Overview, article published in “Artificial Intelligence Review”, in April 2013, Volume 39, Issue 4, pp 261–283-springer.
D. M. Farid, L. Zhang, C. M. Rahman, M. A. Hossain: Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks,- Expert Systems with Applications Volume 41, Issue 4, Part 2, March 2014, Pages 1937–1946– Elsevier.
R Moraes, J. O. F Valiati, W. P. G. O. Neto: Document-levelsentiment classification: An empirical comparison between SVM and ANN, Expert Systems with Applications Volume 40, pp 621–633 2013 – Elsevier.
Thorsten Joachims: Text categorization with Support Vector Machines: Learning with many relevant features, Support Vector Learning, Machine Learning: ECML-98,Volume 1398 of the series Lecture Notes in Computer Science pp 137-142.
V Bijalwan, V Kumar, P Kumari, J Pascua: KNN based Machine Learning Approach for Text and Document Mining, International Journal of Database Theory and Application Vol.7, No.1 (2014), pp.61-70.
A. Ittoo, L. M. Nguyen, A. van den Bosch: Text analytics in industry: challenges, desiderata and trends, Computers in Industry,Volume 78, May 2016, pp 96-107 - Elsevier.
Gary Miner, John Elder: Practical text mining and statistical analysis of text mining, IV, Thomas Hill, ist edition, ISBN-978-0-386979-1, 2012 - books.google.com.
Li-Ping Jing, Hou-kuan, Hong Boshi: Improved feature selection approach using TF-IDF in Text Mining, in Proceedings of the first Internationl conference on Machine Learning and cybermetics, Bejing, pp-944 to 946, 4-5 November 2002-IEEE.
Charu C. Aggarwal and ChengXiang Zhai: A Survey of Text Clustering Algorithms chapter in book “Mining Text Data”, http://dx.doi.org/10.1007/978-1-4614-3223-4_6, pp 77-128 springer US 2012.
Rashmi Agrawal: K-Nearest Neighbor for Uncertain Data, in International Journal of Computer Applications (0975 – 8887) Volume 105 – No. 11, November 2014.
Rashmi Agrawal, Mridula Batra: A Detailed Study on Text Mining Techniques, in International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-6, January 2013.