Enhancing naive classifier for positive unlabeled data based on logistic regression approach
Mateusz Płatek, Jan Mielniczuk
DOI: http://dx.doi.org/10.15439/2023F1402
Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 225–233 (2023)
Abstract. It is argued that for analysis of Positive Unlabeled (PU) data under Selected Completely At Random (SCAR) assumption it is fruitful to view the problem as fitting of misspecified model to the data. Namely, it is shown that the results on misspecified fit imply that in the case when posterior probability of the response is modelled by logistic regression, fitting the logistic regression to the observable PU data which does not follow this model, still yields the vector of estimated parameters approximately colinear with the true vector of parameters. This observation together with choosing the intercept of the classifier based on optimisation of analogue of F1 measure yields a classifier which performs on par or better than its competitors on several real data sets considered.
References
- J. Bekker and J. Davis. Learning from positive and unlabeled data: a survey. Machine Learning, 109(4):719–760, April 2020. http://dx.doi.org/10.1007/S10994-020-05877-5.
- Jessa Bekker and Jesse Davis. Estimating the class prior in positive and unlabeled data through decision tree induction. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1):2712–2719, April 2018. https://doi.org/10.1609/aaai.v32i1.11715.
- T. Cover and J. Thomas. Elements of Information Theory. Wiley, New York, NY, 1991. http://dx.doi.org/10.1002/047174882X.
- C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 213–220, August 2008. http://dx.doi.org/10.1145/1401890.1401920.
- E. Fowlkes and C. Mallows. A method for comparing two hierarchical clusterings. Journal of American Statistical Association, 78:573–586, 1981. https://doi.org/10.2307/2288117.
- M. Łazecka, J. Mielniczuk, and P. Teisseyre. Estimating the class prior for positive and unlabelled data via logistic regression. Advances in Data Analysis and Classification, 15(4):1039–1068, June 2021. http://dx.doi.org/10.1007/S11634-021-00444-9.
- W. Lee and B. Liu. Learning with positive and unlabeled exampled using weighted logistic regression. In Proceedings of the Twentieth International Conference on Machine Learning, ICML ’03, pages 448–455, San Francisco, CA, USA, 2003. Morgan Kaufmann Publishers Inc.
- K-C. Li and N. Duan. Regression analysis under link violation. The Annals of Statistics, 17(3):1009–1052, 1989. http://dx.doi.org/10.1214/aos/1176347254.
- P. Ruud. Sufficient conditions for the consistency of maximum likelihood estimation despite misspecification of distribution in multinomial discrete choice models. Econometrica, 51:225–228, 1983. http://dx.doi.org/10.2307/1912257.
- S. Tabatabaei, J. Klein, and M Hoogendoorn. Estimating the F1 score for learning from positive and unlabeled examples. In LOD 2020. Springer, Cham, 2020. https://doi.org/10.1007/978-3-030-64583-0_15.
- P. Teisseyre, J. Mielniczuk, and M Łazecka. Different strategies of fitting logistic regression for positive and unlabeled data. In Proceedings of the International Conference on Computational Science ICCS’20, pages 3–17, Cham, 2020. Springer International Publishing. https://doi.org/10.1007/978-3-030-50423-6_1.
- Q. Vuong. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57:307–333, 1989. https://doi.org/10.2307/1912557.
- A. Wawrzenczyk and J. Mielniczuk. Strategies for fitting logistic regression for positive and unlabeled data revisited. Int.J. Appl. Math. Comp. Sci., pages 299–309, 2022. https://doi.org/10.34768/amcs-2022-0022.
- H. White. Maximum likelihood estimation of misspecified models. Econometrica, 50(1):1–25, 1982. https://doi.org/10.2307/1912526.
 
