Three-way Learnability: A Learning Theoretic Perspective on Three-way Decision

In this article we study the theoretical properties of Three-way Decision (TWD) based Machine Learning, from the perspective of Computational Learning Theory, as a first attempt to bridge the gap between Machine Learning theory and Uncertainty Representation theory. Drawing on the mathematical theory of orthopairs, we provide a generalization of the PAC learning framework to the TWD setting, and we use this framework to prove a generalization of the Fundamental Theorem of Statistical Learning. We then show, by means of our main result, a connection between TWD and selective prediction.


I. INTRODUCTION
I N the recent years, there has been an increasing interest toward exploring the connections between learning theory and different uncertainty representation theories: This trend includes both the generalization of standard learning-theoretic tools and techniques to settings that involve representation formalisms that are more general than probability theory [1], [2], as well as the theoretical study of algorithms inspired by uncertainty representation [3], [4].
Among other uncertainty representation theories, Three-way decision (TWD) is an emerging computational paradigm, first proposed by Yao in Rough Set Theory [5], based on the simple idea of thinking in three "dimensions" (rather than in binary terms) when representing and managing computational objects [6]: in the Machine Learning (ML) [7] setting, this notion is usually declined in terms of allowing ML models to abstain. This approach attracted a large interest, also justified by promising empirical results in different ML tasks such as active learning [8], [9], cost-sensitive classification [10], clustering [11], [12], [9]. Despite these promising empirical results, the theoretical foundations of TWD-based ML received so far little attention [13], [14]. Indeed, even though, in the recent years, there has been an increasing interest toward generalizing computation learning theory (CLT) to cautious inference methods such as selective prediction [15] or the KWIK (Knows what it Knows) framework [16], such results cannot be easily applied to the TWD setting: While in the TWD setting abstention is a property of single classifiers; in the latter two frameworks abstention is usually achieved by consensus voting.
In this article, we study the generalization of a standard CLT mathematical framework, the so-called Probably Approximately Correct (PAC) learning framework, to the TWD setting: In particular, we will provide a generalization of the Fundamental Theorem of Learning to the TWD setting, and we show that our result generalizes previous results in the selective prediction setting. More in detail, the rest of this article is structured as follows: In Section II we provide the necessary mathematical background on TWD (in Section II-A) and CLT (in Section II-B); in Section III we describe the generalization of the PAC learning framework to the TWD setting and we prove our main result; finally, in Section IV, we summarise our contribution and describe possible research directions.

A. Three-way Decision and Orthopairs
In this work we will refer to the formalization of TWDbased ML models (in the following, TW Classifiers) as orthopairs: Definition 1. An orthopair [17] over the universe X (which represents the instance space) is a pair of sets O = (P, N ) such that P, N ⊆ X and P ∩ N = ∅, with P and N standing, respectively, for positive and negative. The boundary is defined as Bnd = (P ∪ N ) c .
An orthopair represents an uncertain concept: Specifically, the status of the elements in the boundary is uncertain (i.e., it is not known whether they belong to the concept). Thus, a given orthopair stands as an approximation for a collection of consistent concepts: Finally, we remark that it is possible to define different orderings between orthopairs: In particular, O 2 is less infor-

B. Computational Learning Theory
Computational Learning Theory [18] (CLT) refers to the branch of Machine Learning and Theoretical Computer Science focusing on the theoretical study of learning algorithms. Various mathematical formalisms have been proposed toward this goal, in this article we will refer to the PAC (probably approximately correct) learning framework, first proposed in [19]. Formally, let X be the instance space and Y be the target space, in this article we will focus on the binary classification setting, that is Y = {0, 1}. We assume that the observable data is generated i.i.d. according to an unknown probability distribution D over X × Y . Let H be a hypothesis where l : Y 2 → R + is a loss function. Since D is unknown, the true risk cannot be computed: It is usually approximated through the so-called empirical risk based on a sample, called training set, S = ( x 1 , y 1 , . . . , x m , y m ): Given a training set S, we denote by S X the tuple S X = (x 1 , ..., x m ), and by S Y the tuple S Y = (y 1 , ..., y m ). The Empirical Risk Minimization w.r.t. the hypothesis class H is the family of algorithms ERM H,m : The Fundamental Theorem of Learning [20] establishes a relation between the true risk and empirical risk for the ERM algorithm w.r.t. a hypothesis class H which depends only on the so-called VC dimension, a combinatorial dimension of the complexity of H.
with probability greater than 1 − δ, it holds that |L D (ERM H (S)) − L S (ERM H (S))| ≤ . If, further, the realizability 1 assumption holds, then, if S is a dataset of size m ≥ n 1 , with with probability greater than 1 − δ, it holds that Few works have studied the generalization of CLT results to hypothesis that can be described as orthopairs (that is, classifiers that can abstain on selected instances), mainly under the framework of selective prediction [21]: In this setting, the goal is to design learning algorithms A H,m : is allowed to abstain on certain instances. This abstention is usually achieved either by the combination of a standard hypothesis h : X → Y with a rejection function r : X → {⊥, }, or, equivalently, by consensus voting based on a version space V ⊆ H [21]. As we show in the following sections (specifically, in Section III-A) the setting we consider is a proper generalization of selective prediction. More recently, the application of orthopairs in CLT has been studied in the setting of adversarial machine learning [22], as well as to characterize the generalization 1 Here realizability means that ∃h ∈ H s.t. L D (h) = 0. capacity of hypothesis classes under generative assumptions [23]. We note, however, that even though the above mentioned work and the framework we study in this article rely on the representation formalism of orthopairs, the aims of these three frameworks are essentially orthogonal, also in terms of the mathematical techniques adopted: Indeed, while the three-way learning framework we study relies on a generalization of the ERM paradigm, the frameworks studied in [23], [22] rely on a transductive learning approach.
III. THREE-WAY LEARNING In this Section, we provide a first study of a generalization of standard Computational Learning Theory to the setting of TW Classifiers. As hinted in Section II-A, we will represent a TW Classifier as an orthopair O; then, a hypothesis space of TW Classifier will be represented as a collection O of orthopairs over X. In the TWD literature, the risk of a TW Classifier is usually evaluated by means of a cost-sensitive generalization of the 0-1 loss: where λ a ∈ [0, 0.5) is the cost of abstention, and O(x)⊥y is the error case, that is ( Compared to the standard definition of risk adopted in the TWD literature we assume that the cost of error is always 1.
Based on the loss function l T W we can define both the true risk L T W D and the empirical risk L T W S . Evidently, the risk of O can be decomposed as the sum of two functions: The same decomposition can be similarly applied for the empirical risk. Let E D (O) = P r x∼D (O(x)⊥y) and If, furthermore, ∃O * ∈ O OP T s.t. A D (O * ) = 0, then we say that D is strongly realizable. Through this article, we will assume only weak realizability. Compared to the realizability assumption, weak realizability assumption is indeed much weaker. As an example if the vacuous TW classifier O ⊥ = (∅, ∅) ∈ O, then every distribution D is trivially weakly realizable w.r.t O, while it is clearly not strongly realizable. Let ∈ (0, 1), α ∈ (0, λ a ), then O ∈ O makes an ( , α)failure if one of the following holds: Thus, O ( , α)-fails if either its error is greater than , or if its abstention rate is greater, by a margin of at least α, than the lowest abstention rate among those TW Classifiers that make no error. We thus define the notion of Three-way learnability: We then want to provide a characterization for TW learnability, similar to Theorem 1. For this purpose, we first define a generalization of the ERM algorithm to the TWD setting, that we call Three-way Risk Minimization (TW-RM): Thus, the TWRM algorithm selects, among those TW classifiers with minimal empirical risk, the TW classifier with maximal abstention rate on the non-observed instances (that is, the instances in X\S X ). This has the goal of minimizing errors on non-observed instances, and is analogous to the maximum margin principle, and the disagreement coefficient in version space learning, active learning and selective prediction [15]. In order to characterize TW learnability, given hypothesis class O (i.e. a collection of orthopairs), we define two derived hypothesis classes. Given any orthopair O ∈ O we can define We In regard to the second derived hypothesis class, we observe that the order ≤ I defined in Section II-A defines a meet semilattice [17]  We now prove a generalization of Theorem 1 to the TWD setting, through which we show that the TW learnability of a hypothesis class O, using the TWRM algorithm, can be characterized in terms of the derived hypothesis classes H O and O . In order to do so, we consider the VC dimension of the two derived hypothesis classes H O and O as follows: Then, the following result holds: Proof. We want to guarantee that the following bound holds: Then, the results would follow by uniform convergence. By the union bound, it holds that: thus, it is sufficient to jointly upper bound the two summands by δ 2 . As regards the error rate (i.e E) bound, we note that: Since O is a binary hypothesis class, then, by Theorem 1, the above bound holds with probability greater than 1 − δ as long as |S| ≥ 1 d e + ln 1 δ . Furthermore, by uniform convergence this holds, in particular, for T W RM O (S).
For the abstention part, the same line of reasoning can be applied, however, as we only assume weak realizability, only the result in Theorem 1 that applies to agnostic learning can be used. Then, as long as |S| ≥ λa α 2 d a + ln 1 δ it holds that |A D (O) − A S (O)| < α with probability greater than 1 − δ. This holds, in particular for T W RM O (S), and thus the theorem follows by uniform convergence and Eq. (12).
As a simple corollary, in the strong realizable setting, it can be easily verified that:

A. Three-way Learning and Selective Prediction
Finally, we show that the proposed mathematical framework and the obtained results can be used to establish a connection between TWD and selective prediction. This result relies on the connection between version space theory and orthopairs [17], and allows us to derive a generalization bound, originally proven by El-Yaniv et al. [21], for selective prediction: This shows that the latter setting can be understood as a special case of TWD. Let H be a hypothesis class of binary classifiers, we call the Three-way Closure of H, denoted as T W (H), the hypothesis space obtained as: 2) With probability greater than 1 − δ it holds that: Proof. The first equality easily follows from strong realizability and by noting that, by definition of . In regard to the second statement, the first inequality follows by standard algebraic manipulations. The equality, on the other hand, follows by noting that |H O | = 2 H (as T W (H) contains a TW classifier for each possible subset of hypotheses in H).

IV. CONCLUSION
In this article, we aimed at providing an initial study on the generalization of CLT results to the TWD setting. To this purpose, we first proposed an extension of the standard PAC learning framework to the TWD setting, that we called Three-way Learning and showed that our results generalize the previously known results in the selective prediction literature. As our results represent only a first direction in the theoretical study of TWD as applied to Machine Learning, we believe that the following questions would be of particular interest: • Our analysis in Theorem 2 relies on a generalization of the VC dimension to the TWD setting. Tighter bounds can usually be obtained by relying on concepts such as Rademacher complexities or covering numbers [18]. How can these be generalized to TWD? • In Corollary 2 we proved that, in the realizable case, selective prediction can be understood as a special case of TWD learning. Does this analysis also applies to the agnostic (i.e. non-realizable) setting [15]? • PAC-Bayes bounds [24] study generalization bounds that apply when a probability distribution is defined over the hypothesis space. How can the PAC-Bayes framework be generalized to TWD? Interestingly, a very similar open problem has recently been posed also in Belief Function Theory (BFT) [25]. Due to the connection with random sets, a belief function can be seen as a probability distribution over orthopairs [26]: Then, the generalization of the PAC-Bayes framework to TWD would also enable studying the relationships between TWD and BFT.