Filtering Decision Rules Driven by Sequential Forward and Backward Selection of Attributes: An Illustrative Example in Stylometric Domain

The paper presents investigations concerning the decision rule filtering process controlled by the estimated relevance of available attributes. In the conducted study, two search directions were used, sequential forward selection and sequential backward elimination. The steps of sequential search were governed by three rankings obtained for variables, all related to characteristics of data and rules that can be induced, as follows, (i) a ranking based on the weighting factor referring to the occurrence of attributes in generated decision reducts, (ii) the OneR ranking exploiting short rule properties, and (iii) the proposed ranking defined through the operation of greedy algorithm for rule induction. The three rankings were confronted and compared from the perspective of their usefulness for the selection of rules performed in the two directions and with two strategies for rule selection. The resulting sets of rules were analysed with respect to the properties of the constituent decision rules and from the point of performance for all constructed rule-based classifiers. Substantial experiments were carried out in the stylometric domain, treating the task of authorship attribution as classification. The results obtained indicate that for all three rankings and search paths it was possible to obtain a noticeable reduction of attributes while at least maintaining the power of inducers, at the same time improving characteristics of rule sets.


I. INTRODUCTION
O NE OF the main goals of data mining is the extraction of useful knowledge from large amounts of data or phenomena described by a high number of attributes.An important element of this process is the determination and selection of the most important attributes related to the described phenomenon [1].The objective of this step, called feature selection, is to differentiate relevant variables from the entire set of features, while at the same time preserving the descriptive and representative qualities of the original set of attributes [2].
Feature selection can be accomplished by selecting a minimal subset of features that enables obtaining at least the same performance of a classifier as for the entire set of attributes [3].In this case, feature subset selection requires assessing the quality of each discovered feature subset.Another way of proceeding is to construct a ranking of features based on a specific criterion.Then the variables are ordered from the most to the least important, and the top k features are selected based on a predefined threshold.Feature ranking is also known as feature weighting and involves evaluating individual attributes by assigning weights to them based on their relevance.
A technique used to search space of variables during the attribute selection process is an important factor.Since the problem of locating an optimal subset of features, taking into account all possible variable subsets, is NP-hard, greedy techniques, such as forward selection and backward elimination, are often used instead of exhaustive search.Forward selection begins with an empty set, which is gradually expanded by adding one feature (or a group of features) at a time until specific criteria are met.Sequential backward elimination involves starting with all attributes and progressively discarding them.Depending on the adopted criterion, added or rejected attributes can correspond to the highest positions in the ranking, or, they can be the lowest ranking elements.
One of the disadvantages of sequential selection is that interactions among features are not closely studied and dependencies can be missed when only one path of selection is investigated [4].This problem can be remedied to some extent by varying the feature selection approach through patterns discovered in the data, such as decision rules, and discarding them only when they are dependant entirely on rejected variables, while keeping under consideration those that refer also to at least one attribute that is contained in the retained set.With this kind of processing, interactions among variables have more influence on the properties of recalled sets of rules.
The aim of the research presented in the paper and its contribution is the investigation and comparison of three influential factors, as follows: (i) two search strategies, i.e. sequential forward selection vs. sequential backward elimination, applied not directly to the variables in the dataset but through the filtering decision rules process, (ii) two approaches to rule selection, i.e., retaining rules that contain conditions only on the variables still in considerations vs. keeping the rules that include conditions on at least one of the attributes contained in the studied set, (iii) three ranking mechanisms, the OneR available in WEKA workbench [5], and two proposed, exploiting the properties of data and patterns discovered in them.One of those referred to the defined weighting factor, which takes into account the number of reducts in which a given attribute occurs and the cardinalities of these reducts [6], while the other was based on the properties of the greedy algorithm for the induction of decision rules, the number of occurrences in the rules and their support.
All experiments were performed on two datasets from the stylometry domain.The writing styles of the considered writers were learnt from available texts through the analysis of quantitative linguistic descriptors and advanced processing.To prevent bias on the observations, the datasets were prepared for the task of binary authorship attribution with balanced classes.The performance for induced rule-based classifiers was estimated with the help of test sets, over which the classification accuracy was averaged.
The results obtained allowed to conclude that all search paths led to increased performance for reduced sets of features while improving the characteristics of constructed rule sets.Backward elimination with keeping the rules referring to any attributes in the considered set allowed for reduction of more attributes than forward selection with limiting conditions in rules only to still present variables.The three investigated rankings produced close maximal predictions but for different numbers of attributes and rules.Greedy ranking held its ground when pitted against the other two, it even led to the one case of perfect recognition.These observations proved the merits of the described research works and again validated the methodology for ranking-driven rule selection.
The structure of the paper is organised as follows.Section II presents background information related to feature selection and induction of decision rules.Section III provides a description of stylometric analysis of texts, as the application domain.Section IV contains the explanation for the experiments performed and comments on the results obtained.Conclusions and future research plans are given in Section V.

II. BACKGROUND INFORMATION
In this section, aspects related to feature selection and decision rules are provided.Search strategies were described in the context of feature selection, and the main approaches for induction of rules were presented.Finally, the processing steps of rule filtering driven by feature selection were given.

A. Feature selection
During recent years, due to increasing demands for dimensionality reduction, extensive efforts in feature selection research have been made.It can be realised as a stage of data mining, related to data pre-processing, and then it affects such elements as visualisation, learning algorithms, and performance of classifiers.The main task of feature selection is to remove irrelevant or redundant variables so that their elimination from the set of attributes will not affect the performance of the learning algorithms [7].The process of feature selection allows for data reduction and lowering of storage requirements.Furthermore, since the goal is to find the most relevant variables, it is possible to strive to improve data quality by enhancing data mining algorithms, that is, reducing learning time and improving predictive capabilities.
A feature selection procedure can be considered to contain three stages: (i) search for potential subsets of variables, (ii) evaluation of the subset of attributes based on some criteria, and (iii) setting the stop condition for the search.The final stage is closely linked to the initial one, as the search is repeated iteratively until the stopping criterion is met.
Due to the large search space, feature selection is also perceived as a combinatorial problem-for a dataset with N attributes, the search space is 2 N .Searching for an optimal subset of features taking into account all possible variable subsets is NP-hard problem [8].An exhaustive search can be performed only if the number of attributes is relatively small.Instead, greedy [9] or meta-heuristics [10] approaches can be used.
To select a subset of variables from the input data, different search strategies can also be applied, including genetic algorithms, evolutionary computation techniques, heuristic search algorithms, and various hybrid strategies.Among greedy techniques, the sequential search performed as forward selection and backward elimination can be distinguished [11].The sequential backward elimination method starts with all the variables, and then gradually features are removed from the set, either one by one, or in groups.In each step, the eliminated variable or variables contribute the least to the criterion function.Forward selection starts with the empty set to which sequentially features are added, again either one at a time or in groups, until certain criteria are met.
Both search strategies are heuristic and cannot guarantee the optimality of the selected features.Among the alternatives to these approaches, floating, branch-and-bound, and randomised can be mentioned [12].Random search methods, for example, genetic algorithms, add some randomness to the search procedure to help escape from a local optimum.In certain cases, especially when dealing with high-dimensional datasets, an individual search is performed.Such methods evaluate each feature individually based on a specific criterion or condition.The branch-and-bound algorithm finds the optimal feature subset if the criterion function used is monotonic [3].Floating search methods prevent the situation where the variable is deleted in backward elimination, and then it cannot be reselected, and also when a feature is added in forward selection and cannot be deleted once it was selected [11].

B. Ranking construction
Feature selection can be performed in two different ways, by selecting a subset of attributes or by creating a ranking of variables [13].In the latter case, the variables are ordered according to the adopted criterion or evaluation function from the most important to the least important and the top k attributes are selected from the ranking, with k being some pre-selected threshold number.Feature ranking plays an important role in directing the search process in different machine learning tasks, especially when an exhaustive search is computationally unfeasible and a heuristic search approach is necessary.It determines the order in which the variables are explored by the algorithms within the feature space.
Feature ranking methods use different measures, for example, based on similarity score, statistics, information theory, or on some functions of the classifier's outputs [1].Traditional ranking approaches evaluate variables without incorporating any learning algorithm.This category typically consists of filter-based feature selection methods, such as referring to information gain, correlation, or Relief algorithm.However, there are also some studies on wrapper techniques, which involve methods such as recursive feature elimination [14], and the classifier-aided feature ranking approach [15].
In the paper, three ranking mechanisms were studied, related to the properties of the data, and discovered patterns in the form of decision reducts and decision rules.One ranking was based on the defined weighting factor calculated through reducts, another was related to the OneR algorithm, and the third ranking was proposed by the authors and based on the properties of the greedy algorithm for rule induction.All the rankings obtained were used as filters for sets of induced rules.
1) Ranking of attributes based on reducts: Reduct is one of the key notions in rough sets theory [16] and refers to feature selection performed within the framework of rough sets.There are many definitions of a reduct because they deal with different criteria related to the selection of attributes and computing the most relevant sets of variables, for example, decision and local reducts for decision tables, reducts for information systems, reducts based on the generalised decision, or fuzzy decision reducts.
A reduct can be defined as a minimal set of attributes that preserves the degree of dependency of the entire set of attributes.Taking into account the performance, the reduct is such a minimal subset of attributes that has the same classification power as the complete set of available attributes [17].
The problem of calculating reducts is NP-hard, therefore, different heuristic approaches are used for its construction, for example finding reducts through sampling data from a decision table [18], heuristics based on discernibility matrix [19], greedy algorithms [9], Boolean reasoning, and many others [20].In the investigation presented in the paper, the genetic algorithm [21], implemented in the Rough Sets Exploration System (RSES) [22], was used to construct the reducts.It is a binary genetic algorithm where every binary individual encodes one subset of attributes that is a potential reduct.The fitness function of a subset R has the form: where n is the length of bit strings equal to a number of attributes, and m gives a number of objects.L R denotes a number of "1"-s in the subset R, and C R denotes the number of object pairs (with different decision values) discerned by the attribute subset R. Calculating C R is the most time-consuming operation.It is accelerated by the "distinction table", a binary matrix of size (n+1)×(m 2 −m)/2.Each column corresponds to one attribute (the last column corresponds to the decision), and each row corresponds to one pair of different objects.The value "1" denotes an attribute with a different value on the pair of objects.Finding a reduct means finding the minimal subset of columns that cover the matrix.
The described genetic algorithm allows to generate a satisfactorily high number of reducts in relatively short time.The resulting reducts may contain different attributes and may also have different cardinalities.For the set of induced reducts, the weighting factor for features was proposed that takes into account the number of reducts in which a given attribute exists, and cardinalities of these reducts [6], where k min and k max are respectively the minimal and the maximal reduct cardinalities detected for the group G Red .RED(G Red , a) denotes the set of all reducts from the group G Red that include the attribute a, and RED(G Red , a, k) is the set of reducts of length k that contain the attribute a.
Then card (RED(G Red , a, k)) returns for the group G Red the number of reducts with specific length equal to k that contain the given attribute a.The values of W F range from 0 (the attribute a is not included in any of the reducts in this group) to 1/k min , when the attribute is included in all the reducts and all the reducts have the same cardinality (then k min = k max ).A higher value of the weighting factor presented indicates that the attribute appears in more reducts with lower cardinalities, and low values of W F are obtained for attributes that are included in fewer reducts containing more variables.All attributes included in a group can be ordered by the scores calculated for them, and a ranking is obtained as a result.
The described weighting factor promotes reducts with a small number of attributes.This way of reasoning follows from the fact that in a situation where we have two reducts and one of them has a smaller number of attributes, according to the definition of a reduct, this smaller number of attributes is sufficient to protect the performance of the system.Moreover, it complies with the Minimum Description Length principle [23]: "the best hypothesis for a given set of data is the one that leads to the largest compression of data".Additionally, reducts with smaller numbers of attributes are preferred from a knowledge representation perspective.
2) OneR algorithm: The OneR (One Rule) algorithm is a simple classification algorithm that is used in the field of machine learning.Its purpose is to select the most conclusive feature from all available features in the dataset, in order to create a simple classification model.This is done by calculating the number of occurrences of particular class labels for each value of a given attribute in the dataset.After this process, the OneR algorithm selects the feature for which the value is the most discriminating in the context of predicting class labels.In practice, for the selected feature, a single condition is created in a decision rule that is used to classify new instances.The algorithm generates one rule per unique attribute value of the selected best feature.
The main strength of the OneR algorithm is its ability to select the most relevant feature in the context of class prediction [24].Although the OneR algorithm is simple and does not take into account interdependencies between features, it often allows to obtain satisfactory classification accuracy.In addition, this algorithm tends to choose the value of attribute that occurs the most frequently, and in this way it allows to ignore noise existing in the data.OneR is also called one-level decision tree algorithm.It selects attributes from a dataset one by one and generates a different set of rules based on the error rate from the training set.Finally, it chooses the attribute that offers rules with minimum error [25].

3) Ranking of attributes based on greedy algorithm properties:
In the research, the authors propose a ranking mechanism exploiting the properties of the greedy algorithm for the induction of decision rules [26].Such an algorithm constructs a decision rule for each row of a decision table.In each iteration, attributes are selected to form the conditions of the rules.The selected attribute separates the maximum number of rows from a set of rows with a different class label, so a decision table is divided into sub-tables as dictated by given attribute and corresponding value.The partitioning of a table is completed when all rows in the sub-table, corresponding to the selected attribute, have the same class labels.
As shown in previous research [27], given certain assumptions about the NP class, the greedy algorithm used to induce decision rules produces results that are not far from the best approximate polynomial algorithms for minimising the length of the rules, which is important for knowledge representation.Short rules can be considered as more general so they allow to reflect patterns hidden in the data and prevent overfitting, which is important for the classification process.
During research focused on the greedy algorithm, it was observed that in the majority of cases, when constructing decision rules, the greedy algorithm at each iteration selects an attribute that separates at least 50% of the remaining rows with different decisions.
The proposed ranking was based on the attributes contained in the decision rules, the percentage of separated rows with decisions different from the decision attached to a given rule, and the support of the rule.The latter element is an important factor in assessing the quality of decision rules.In order to construct the ranking, the decision rules were induced by the greedy algorithm and duplicate rules were removed from the entire set of rules.Then, for each attribute, the number of its occurrences in the rules was determined, assigning the highest positions in the ranking to the attributes with the highest number of occurrences.If the number of occurrences was the same for several attributes, then the percentage of rows separated by the given attribute was taken into account.The third factor that played a role in determining the score for each attribute was the support of the rule in which the attribute appeared, which led to the assignment of higher positions in the ranking to attributes from the rules with higher support.

C. Decision rules
Decision rules belong to popular forms used for data representation.They are induced from datasets very often presented as a decision table T = (U, A {d}) [16], where U is a non-empty, finite set of objects, A = {a 1 , . . ., a m } is a set of condition attributes i.e., a i : U → V a , where V a is the set of values of attribute a i called the domain of a i , and d / ∈ A is a distinguished attribute called a decision, with values The decision rules take the form: Pairs (a i1 = v 1 ) are called descriptors or conditions.The number of conditions in a premise part of a rule is its length.Short rules are preferred from the point of view of knowledge representation and with regard to the MDL principle.They are easier to understand and interpret.When assessing the quality of decision rules, support is another important factor.It is a number of such objects from the decision table whose attribute values satisfy the premise part of the rule, and they have the same decision as the one attached to the rule.This measure allows to discover major patterns present in the data.
There are a wide variety of approaches for induction of decision rules.Among the exact ones, Boolean reasoning and extensions of dynamic programming should be mentioned [28].The construction of decision rules with maximum support or minimum length is considered an NP-hard problem, so different heuristics are used.They are based on modifications of exact approaches, different kinds of greedy algorithms, methods relying on sequential covering, genetic algorithms, and many others.In the rough set theory, the popular approach is also induction of rules based on a reduct.Then each rule has length equal to the cardinality of the reduct, and each object from a decision table has assigned values corresponding to condition attributes included only in this reduct.
Apart from using decision rules as a form of knowledge representation, they are very often used as classifiers.In this situation, the rule filtering process can be treated as a method of pruning the rule set to fine-tune the classifier by reducing the number of rules.The use of filtering rules in the framework of the feature selection process often leads to improved classification accuracy.
In the experiments performed, the decision rules were induced by the exhaustive algorithm implemented in the RSES system.It constructs all minimal decision rules, i.e. rules with minimal numbers of descriptors (pairs attribute = value) in their premise parts.Then, they were filtered sequentially, according to the search strategy added or removed, driven by the studied rankings of attributes.

III. STYLOMETRIC DATA
A writing style is an individual characteristic, based to some extent on social and cultural background, education, lifetime experiences, elements that are learnt, but also on personal linguistic preferences and habits.To obtain a definition of an authorial profile, access to some representative samples 836 PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023 of writing is needed.Comparative analysis and stylometric data mining lead to the discovery of patterns specific to writers and the construction of approximating descriptions that can be applied to text samples of unknown or unconfirmed authorship to find the closest match.This way of carrying out the authorship attribution task means solving a classification problem [29], therefore, a dataset to be prepared needs to include some training and test samples, all relying on a set of selected efficient style-markers [30].
Stylometric descriptors that work best refer to common language elements as they are used almost subconsciously, so they are less prone to forgery or imitation.Lexical and syntactic markers are often employed for the task [31].They provide quantitative characteristics through frequency of occurrence for function words and punctuation marks, which results in real-valued features.In the experiments reported, the set of markers contained 24 elements with values calculated over text samples obtained by partitioning long novels by four acclaimed writers into smaller chunks.The authors studied, Edith Wharton, Mary Johnston, Jack London, and James Oliver Curwood, were paired according to gender [32], in order to form two datasets with binary authorship attribution.
The division of long texts into smaller parts resulted in imposing a specific stratification of the input space [33].To avoid bias when evaluating the performance of a classifier, the datasets (the male writer dataset and the female writer dataset) prepared included one train set and two test sets.The samples contained in sets of different types were based on separate novels.With binary classification, balanced data and the same importance of all classes, classification accuracy was used as a measure of performance, providing information on the average portion of correctly attributed text samples from test sets.
Among popular data mining approaches, those that involve induction of decision rules belong to the most advantageous.They not only enable assigning authors to samples, but also enhance understanding of the stylometric domain by providing an inside view on linguistic patterns detected for authors by the transparent form of discovered rules.Short rules, with a few conditions in their premises, are preferred over long rules [24].The former are more general, while the latter with their too detailed definitions can cause over-fitting.
The datasets were discretised with the Fayyad and Irani algorithm [34].It is one of the top-down supervised methods, which starts with assigning one large interval to represent in the discrete domain all the values of a transformed variable.Then, referring to the MDL principle and calculation of entropy [23], candidates for cut-points are evaluated to discover which are most supportive to distinction of classes.If further partitioning is disadvantageous to entropy, the processing stops.As a consequence, it is possible that some variables are removed from consideration in the discrete domain when they have a single categorical representation.In the experiments, for the female writer dataset 20 out of the total of 24 features received more than a single bin, and for the male writer dataset the set of attributes was reduced to 22.

IV. PERFORMED EXPERIMENTS
The experimental process of the research works consisted of the following stages: • Preparation of two datasets (female writers and male writers), which included discretisation by Fayyad and Irani algorithm applied to all condition attributes; • Construction of three rankings of attributes: -Reducts-based on reducts and the proposed weighting factor; Using a genetic algorithm implemented in the RSES system, one group of 150 reducts was generated.
Obtained reducts consisted of different attributes from the whole set of available features and had different cardinalities.The weighting factor defined in Eq. ( 2) took into account all these elements and returned scores for the variables.The ordering of attributes by their scores resulted in the ranking.-OneR-based on the OneR algorithm implemented in WEKA software [5]; -Greedy-based on the properties of the greedy algorithm for induction of decision rules, that is, the number of rules in which a given attribute occurs, the percentage of separated rows with different decisions and the support of decision rules.
• Induction of decision rules by exhaustive algorithm, for the input datasets; • Filtration of sets of rules accordingly to sequential forward selection and sequential backward elimination driven by attributes included in a given ranking; • Evaluation of performance for rule-based classifiers with test sets; • Assessment of the quality of rule sets from the point of view of knowledge representation, i.e., taking into account the number of rules, average length and average support; • Comparative study of results, for two search directions, two rule selection strategies, and three rankings.Details of all steps are provided below, along with comments on the results obtained.

A. Rankings
For the female and male writer datasets, the rankings obtained were presented in rows of Table I (where the letters F and M indicate the female and male writer datasets, respectively).The row Position denotes the position of the given attribute in a ranking, and 1 is considered the highest ranking position, assigned to the most important feature.
From the ranking created based on the accuracy of the greedy algorithm, in addition to the attributes with a single categorical representation, other variables were also excluded because they did not appear in the induced decision rules.Therefore, these rankings were shorter and contained 16 attributes for the female dataset and 17 for the male dataset.
It is worth noting that for the male writer dataset, all three rankings assigned the highest position to the same attribute: attr23.In the case of the female set, this attribute was ranked second only in the ranking related to the greedy algorithm.Furthermore, the features disregarded by the greedy ranking (attr9, attr12, attr15, and attr20 for female writers, and attr4, attr5, attr13, attr17, attr20 for male writers) were not recognised as irrelevant or close to irrelevant by other rankings, for example, for the male writers attr17 was found as the second ranking for the OneR algorithm.

B. Strategies employed in decision rule filtering
Forward selection was performed by sequentially filtering and increasing the set of decision rules.Starting with the highest ranking attribute, from the entire set of rules those were selected that contained conditions (in their premises) relating only to this attribute.Then, in the second step, a subset of recalled attributes was extended to the top two positions, and such rules were selected that relied only on these two variables as conditions.Next, three top ranking features were studied, and so on.In each step of the sequential search, the conditions in the rules were limited only to the currently selected subset.The forward rule filtering process continued until all available features and rules were included in the set considered.
The backward elimination was achieved by sequentially decreasing the set of decision rules.Starting with the attribute in the lowest ranking position, those rules were selected from the entire set of rules, which contained in their premises the condition referring to this very attribute.If a rule included some other attributes that worked as conditions, then that rule was not removed from the set of rules.The second step of backward reduction meant rejection of rules with conditions limited to the two lowest ranking variables, and so on, until the set of rules was exhausted.The difference between the two strategies involved is shown in the illustrative small example.
Let us assume a set of five condition attributes, for simplicity ranked as follows, where 1 is considered the top ranking position, and 5 the bottom of the ranking: The set of rules, subject to filtering driven by ranking, consists of eight elements.
Rule 1: with condition on attr1 Rule 2: with conditions on attr2 and attr3 Rule 3: with conditions on attr1 and attr5 Rule 4: with conditions on attr1 and attr4 Rule 5: with condition on attr3 Rule 6: with conditions on attr3 and attr5 Rule 7: with conditions on attr2 and attr5 Rule 8: with condition on attr4 For backward elimination, the processing starts with all rules included in the recalled set.Then, for the filtering steps, the resulting sets are as follows.
Step 1: Selected attributes: attr1, recalled rules: 1 Step 2: Selected attributes: attr1, attr2, recalled rules: 1 Step 3: Selected attributes: attr1, attr2, attr3, recalled rules: 1, 2, 5 Step 4: Selected attributes: attr1, attr2, attr3, attr4, recalled rules: 1, 2, 4, 5, 8 Step 5: Selected attributes: all, recalled rules: all rules The process of rule filtering carried out for the greedy rankings was slightly different than for the other two rankings, because the rule sets induced by the exhaustive algorithm included rules with conditions on such features that were absent in the greedy ranking.Therefore, the first step of rule elimination was to remove rules containing only attributes that did not appear in these rankings, while the last step for forward selection was to add these rules.

C. Performance of rule-based classifiers
For all rule-based classifiers obtained in the decision rule filtering process, performance was evaluated with test sets.Fig. 1 presents the average classification accuracy obtained.As can be observed in Fig. 1, in the case of the Greedy column and the backward elimination strategy, rejecting rules only with attributes not included in the ranking resulted in the same classification accuracy as for the entire set of attributes.For forward selection, the results for the last step of selecting features included in the rankings differed slightly from the ones given in the bottom row, after adding the rules with attributes that did not appear in this ranking.For female writers, a small improvement was noted, and for male writers, a small decrease was visible.
When the backward elimination strategy was combined with the Reduct and OneR rankings and applied to the female writer dataset, it should be noted that for all ranking positions considered the classification accuracy was always at least at the reference level, even in the last step of filtering for the attribute in the first position in the rankings.The highest value of the classification accuracy of 0.989 was obtained for the Greedy and OneR rankings, and was related to the third position in these rankings.For ranking based on reducts this value was slightly smaller (0.983) and happened in processing of the fourth position in the ranking.
In the case of forward selection executed for the female writer dataset, the highest possible classification quality equal to 1.0 existed for 11 attributes placed at top positions in the Greedy ranking.For the OneR algorithm the maximum was equal to 0.989 and for the Reduct ranking 0.978, and both were detected when the twelfth positions were processed.
For the male writer dataset for the top position in the three rankings, backward elimination obviously returned the same results.The highest classification accuracy of 0.967 was obtained for the Greedy ranking related to the sixth top position in the ranking.It was also the highest improvement noted for this dataset.Apart from the top two ranking positions, for all the rest of filtering steps, the classification was either the same or improved over the reference point.For the OneR ranker, the best performance (0.956) referenced the fifth top ranking position.With the exception of the top ranking position, for the OneR ranking in the entire rule filtering path, the reported performance was at least as good as for the entire sets of rules and attributes considered.The ranking based on reducts brought the worst results among the three rankings, however, even here they were still detected cases of maintaining or increasing performance for the reduced sets of rules.
In the forward search applied to the male dataset, the OneR ranking was most advantageous: for the fifteenth ranking position the maximal classification accuracy 0.961 was recorded.The second best level of predictions (0.956) was obtained for the nineteenth position of the Reduct ranking.The Greedy algorithm came last with the highest accuracy of 0.928, however, it resulted from processing the ninth ranking position, so more decision rules and features were discarded than for the other two cases.

D. Characteristic of rule-based models
The entire process of rule filtering driven by rankings involved two search directions, two strategies for rule selection, and three rankings.For all the sets and subsets of decision rules constructed, their characteristics were observed, as shown in Table II.These observations included the number of rules (NoR column), average rule length (Len column), and average rule support (Supp column).The column Attr points to the ranking position considered.
As could be expected, analysis of the rule sets showed that as the number of rules in the set decreased, their average lengths tended to decrease, and the average supports increased.This was particularly evident in the rows at the bottom or close to the bottom of the tables.The average values relating to the shortest rules with the highest support were marked in bold.
In the case of forward selection, the differences regarding the number of rules, their length, and support were more visible than in the case of backward elimination.It was due to the nature of how the strategies employed for rule selection in each case worked, as they were not the same.
With forward as a search direction, the processing started with the empty set of rules and then, gradually, in each step some recalled rules were added.These rules could include conditions limited to the variables in the subset considered.In the first step only the top ranking attributes were taken into account, in the second step the top two were accepted, and so on.If a rule also contained conditions on other features (placed somewhere lower in the ranking), then it was not included in the recalled set.Therefore, always a ranking position that was processed directly gave the number of variables studied, and a subset of features present in the rules was explicitly visible.
The strategy applied in the backward elimination of decision rules started with the entire set of rules and then the groups of rules were gradually excluded, taking into account conditions on attributes from the lowest positions in a ranking.In this case, the assumption was that the eliminated rules should contain only attributes considered and discarded so far in the ranking.If a rule also included conditions on other features that were higher ranking, then such a rule was kept in the remaining set.This processing resulted in operation on higher numbers of rules for the same ranking position than when compared to the strategy applied in the forward selection.In fact, for each ranking position a set of rules recalled by forward selection was a subset of rules retained by backward elimination.It was especially striking in the case of the Greedy ranking and the number of rules obtained as characteristics for the constructed rule sets.
The advantage of this strategy was visible in the classification results, in particular for the female data set, where for almost every position of the ranking, the accuracy of rulebased classifiers was at least as good as the reference level considered for all variables from the set.Thus, this direction and the filtering rule strategy contributed to enhancing the power of the classifier.The drawback of such processing lies in keeping in considerations the higher numbers of attributes, and a lack of clear specification of their subset taken into account in each step.If there was a rule referring to all features, such rule would be kept to the very end, to the last step of filtering process, despite its significant length that indicates too close In the case of the backward elimination strategy, the cut in the number of rules was generally smaller than for the forward search.The smallest reduction occurred for the Greedy ranking, for the female set.For the male set, the number of rules corresponding to the attribute in the highest ranking position was the same for all rankings, similarly the average rule length.Furthermore, for this dataset, the number of rules decreased about 10 times under this search strategy.For the Reduct and OneR rankings and female writers it was even greater.
The experiments carried out with varying search directions and strategies for rule selection enabled studying the effectiveness of the three rankings in the rule filtering process.The proposed Greedy ranking held its ground against the other two, leading to noticeably improved predictions for rule sets of decreased cardinalities, which is evidenced by the fact how often it led to the best results given in Table IV, and which clearly illustrates its merits.

V. CONCLUSIONS
The paper provides an illustrative example for the proposed research methodology dedicated to decision rule filtering governed by attribute rankings.The process of rule selection was executed with sequential backward reduction, where an entire set of induced rules is available at the beginning and then some elements from this set are discarded; and with sequential forward search, where the processing starts with the empty set to which recalled elements are added gradually.Along with two search directions, two strategies for rule selection were used, one with recalling rules including conditions only on variables from the currently considered subset, and the other with finding rules dependent on at least one of the attributes in the studied set.
In the investigations, three rankings of attributes were employed.The proposed ranking based on the percentage of separated rows and the properties of the greedy algorithm was confronted with the previously defined ranking referring to decision reducts, and the OneR ranker available in the popular WEKA environment.For the three rankings, the selection of rules was performed in the two directions, and the resulting rule sets were analysed with respect to the properties of constituent decision rules, such as their numbers, average length, and average support, but also from the point of evaluation of performance for all constructed rule-based classifiers when applied for labelling of samples from test sets.
The results from the experiments indicate that for all three rankings and search paths it was possible to obtain a noticeable reduction of attributes while at least maintaining the power of inducers, at the same time improving characteristics of rule sets.The special focus on Greedy ranking enabled to discover that it not only led to discarding some variables from the available sets, treating them as irrelevant, but also proved effective for rule filtering.
Future research will include application of the Greedy ranking in the feature selection process for other types of inducers, with different mathematical backgrounds and modes of operation.Also, the influence of discretisation step will be studied, as one of the factors greatly influencing representation of data and the patterns present in it.
BEATA ZIELOSKO ET AL.: FILTERING DECISION RULES DRIVEN BY SEQUENTIAL FORWARD AND BACKWARD SELECTION OF ATTRIBUTES 837 838PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023Decision rules induced by the exhaustive algorithm were selected through the backward elimination (column Back) and forward search (column Forw) strategies, with the conditions for recalling rules governed by the three rankings (groups of columns, Reduct, OneR, Greedy).The results shown in the bottom row provide the reference point, because they correspond to the case where the entire sets of attributes and rules were taken into account.The X mark denotes the situation where no rules were included in the set of recalled rules.The coloured cells indicate where the classification accuracy exceeded the reference point.The intensity of cell colour depends on how much the accuracy was improved.For each step of the rule filtering process, the columns Attr indicate the ranking position considered (which for forward search corresponds to the number of variables taken into account).

Figure 1 .
Figure 1.Accuracy of rule-based classifiers, for female and male writers