Towards Reliable Rule Mining about Code Smells: The McPython Approach

CODE smell is a risky pattern in code that can lead, in the future, to problems with code maintenance. One of the approaches to identifying smells in the code is metric-based smell detection. A classic example is the God Class smell which can be detected by using three metrics (see, e.g., [1]–[3]): • Weighted Method Count (WMC - sum of McCabe's complexity of all methods in the analysed class), • Tight Class Cohesion (TCC - relative number of directly connected methods within the analysed class), and • Access to Foreign Data (ATFD - number of classes containing attributes referenced by the analysed class directly or via get/set methods).

• Weighted Method Count (WMC -sum of McCabe's complexity of all methods in the analysed class), • Tight Class Cohesion (TCC -relative number of directly connected methods within the analysed class), and • Access to Foreign Data (ATFD -number of classes containing attributes referenced by the analysed class directly or via get/set methods).
To make a decision (smelly / not smelly), computed metrics are compared against predefined thresholds.So, the quality of smell detection depends not only on a set of chosen metrics, but also on their thresholds.
Unfortunately, the quality of the existing smell detectors is still not satisfactory (cf.[4]) and there is a need for more research in the area.One of the issues worth investigation is the impact of a set of code smells on severity of the detriment caused by them.To conduct this research in a clear and reproducible way one needs an appropriate workbench (a critical review of the literature in the area is presented in [5]).

THE PROPOSED WORKBENCH
In this paper, it is postulated that empirical research on smell detectors should be based on (1) precise definitions of the analysed smells, and (2) smell detection rules (including metric thresholds) should be mined from software repositories using machine learning (ML).
The overall architecture of the proposed workbench is illustrated in Fig. 1.Given a code repository, a code smell detector identifies all smelly classes while the issue detector identifies troublesome classes (e.g., defective classes -here one can use an idea proposed by Śliwerski et al. in [6]).The reports generated by both detectors are consolidated to produce a decision table (the decision table of Fig. 1 refers to the God Class smell with three thresholds, WMC, TCC, and ATFD, corresponding to the three metrics mentioned earlier).Given a decision table, one can use e.g.C4.5 algorithm to get a decision tree (see [7] or [8]).Another option is to apply rough-set approach (see e.g., [9]).• to translate it to Python 3, and • to generate a model of the analysed code (smell detector expects on the input a code model, not the code itself).The process is illustrated in Fig. 2.An advantage of running smell detector on a code model instead of the code itself is possibility of using the same definition of a code smell on repositories written in different programming language, provided that one has a code modeller for a given language.
Current version of McPython translator is written in Python 3 (Python accepts Unicode as an input).Model of the analysed code is implemented as a list of all its entities (position of an entity on the list serves as its identifier) and it is read with the library function json.loads.McPython constructs concerning operations over sets, e.g., a universal quantifier (∀) or summation ( ), are translated as calls to an appropriate function (definitions of those functions are added to the generated code).Those functions have two parameters: a set of code entities (represented by their identifiers) and a condition or expression that is evaluated for each element of a given set (here lambda expressions of Python proved very useful).
Code modeller for the Python language (cf.Fig. 2) is built with the help of Python's ast module and the NodeVisitor class contained in it.First all class nodes of a given abstract syntax tree are visited and then their method are analysed.The collected data are stored as an array of dictionaries and transformed into JSON with the dumps function of the json module.