CADM: Big Data to Limit Creative Accounting in Saudi-Listed Companies

Global financial scandals have demonstrated the harmful impact of creative accounting, a practice where managers creatively manipulate financial reports to conceal a company's actual performance and influence stakeholders” decision-making. Studies showed that Saudi-listed companies use it in preparing financial statements. Despite posing a significant risk to the Saudi financial market, detecting it using ordinary auditing procedures remains challenging. Big data analytics has provided practical applications in auditing, and recently, the employment of Deep Learning in fraud detection has delivered remarkably accurate results. Still, limited research has considered it in detecting creative accounting. This study proposes a novel framework using a hybrid learning approach. It suggests training on a simulated dataset of financial statements prepared (i.e., deliberately manipulated) based on financial statements available in the literature for supervised learning. It is then tested on real-world financial reports from the Saudi Open Data and Saudi Statistics. Our framework contributes to the literature with a new governing approach to limit creative accounting and improve financial reporting quality.


I. INTRODUCTION
REATIVE accounting (CA) practices have negatively affected the financial reporting quality and disturbed the trust in the information extracted from financial statements.The issue with CA is that it does not necessarily violate the International Financial Reporting Standards (IFRS), yet it has the same severe consequences as Financial Statement Fraud (FSF) [1].Besides, CA is more enigmatic and almost impossible to detect using traditional auditing techniques.Some studies consider CA and Earnings Management (EM) as FSF, while others identify the thin line between them [1].Another harmful practice that could be identified along CA and FSF is Window Dressing (WD), where managers invest in the freedom of interpretation area to manipulate the presentation of reports.The term 'creative' gives a positive impression about the practice whereas, in reality, this practice has been considered to be the primary cause behind many financial scandals such as Enron, World-Com, in the U.S., and Parmalat, Royal Ahold, and Vivendi C  Universal in Europe [2][3][4] [5][6] [7] [8].These incidents confirm that account manipulation is designed to gain a temporary benefit, eventually leading to financial scandals and substantial losses.In Saudi Arabia, cases of CA exist and, according to the literature, employ the same accounting techniques for similar reasons.Considering the proposition that less efficient markets tend to have greater tolerance to manipulations, the weak-form efficiency of the Saudi stock market Tadawul, as proved by [9], indicates the high possibility of manipulations.Many studies have investigated the practice in the region, but non include real-time case studies [10] [11] [12] [13].The results of these studies agreed that financial statements do not represent the true and fair position of a company, although being approved by auditing procedures.
However, big data analytics and AI models are currently employed in different business sectors.Many models and techniques have been developed and validated to replace or supplement traditional accounting and auditing procedures.In our context, the literature is rich with significant contributions in FSF detection using data mining and machine learning ML.The availability of data types like financial (FIN) and non-financial (N-FIN) and the possibility to include these data types in advanced intelligent models motivated researchers to develop many applications that meet business needs.Indeed, detecting CA using traditional techniques (e.g., accrual-based detection) requires non-public, inaccessible, and time-consuming data to reach [14].On the other hand, the literature is rich with examples of Machine Learning (ML) and Deep Learning (DL) models that can learn from publicly available FIN and N-FIN data and produce predictions and results with high accuracy.
The use of big data analytics made it possible to include N-FIN data and aggregate accounting data with other sectors' data.Consequently, accounting results are now more accurate and credible [15].Moreover, big accounting firms are adopting server-based platforms that support auditors by implementing real-time financial data collected directly from clients.For example, in April 2022, PwC announced the use of a new cloud-based auditing platform named Aura that provides many services powered by advanced analytics.These capabilities motivate us to further add to the capacity and efficiency of these analytical tools for enhanced detection procedures.
This study aims to overcome the misrepresentation of information in financial reports in Saudi Arabia and improve financial reporting quality by proposing a framework for the Creative Accounting Detection Model (CADM).The model suggests the employment of a Hybrid Deep Learning (HDL) that implements Artificial Neural Network (ANN), Recurrent Neural Network (RNN) and Long Short-term Memory (LSTM).It uses FIN and N-FIN data related to selected Saudi-listed companies in different sectors.CADM introduces a new approach to evaluate published financial information more accurately and limit CA.It will also create new insights into the quality of financial statements and indicate the extent to which these practices are incorporated.

II. RESEARCH BACKGROUND
A. Overview of CA CA is a term to describe the accounting procedures used to present an enhanced image of an enterprise that misleads users [16].The word 'creative' means using new expertly invented ways of preparing accounts [6] to make the company more appealing to stakeholders without committing fraud.The practice is considered legal within IFRS but deviates from its goal and spirit as it operates in the grey area between legitimacy (in the context of IFRS) and fraud.As in Fig 1 ., CA can exceed the regulations and become fraudulent, yet in this case easier to detect.
The quality of financial reporting (measured by different methods such as discretionary accruals, accounting conservatism, and asymmetry of information [17]) is significantly affected by CA practices [18].On the international scale, more financial problems have been discussed in the literature [19], such as low liquidity, tightening credit conditions, and price volatility.However, the literature has no particular definition for the practice [20].Yet the most recent definition by [6] described these practices as follows:" They are the methods which deviate from the rules and regulations, it is an excessive complication and use of innovative ways to visualise income, assets, and liabilities, it is an innovative and aggressive way of reporting financial statements, it is a systematic misrepresentation of the true and fair financial statements."It can be concluded that any accounting procedure meant to present non-realistic financial information is a form of CA.

B. CA Incentives
Incentives differ according to the business type and size.The reason for a manager in a private company to engage in this practice is different from the reason for a manager in a public company since the practice has different impacts on various parties, and sometimes more than one party in the same incident.Regardless of all the different impacts, incentives have been initiated from the Agency Problem (AP).Accordingly, we classify the incentives based on three main areas of primary impacts: internal affairs, stock market, and third-party's decisions., as shown in TABLE I. Whether these incentives existed from personal, organisational, or political backgrounds, they ultimately motivated report preparers individually or in groups to engage in the CA practice.

C. CA Techniques
The financial report preparation involves many techniques that can be innovatively different from one firm to another.There are rules and policies for these techniques, yet managers apply the approach that fulfils their personal or institutional interests.However, CA techniques can be categorised regarding the report type being prepared [21], the accounting area operating in [12] [22], the accounting items being used [23], and the type of creativity being applied [24].This study categorises CA techniques using two layers: the relative accounting area, as in [12], and the accounting items, as in [23].TABLE II shows accounting techniques categorised by both approaches and highlights the techniques applied in this study.

D. CA Detection
The recognition of CA practices as a crime is quite an argument [19] [2] [6] [21].Therefore, big data analytics studies were limited to FSF detection [25][26] and FSF prediction models [27].Earlier FSF detection models, like the M-score model [28], were mainly quantitative depending on numerical FIN data.In contrast, recent models are qualitative intelligent models (i.e., ML models) that have proved to outperform the
Since the speed of uncovering FSF can limit its consequences [8], the need to find faster and more accurate models is becoming essential.Unfortunately, the literature has no research on detecting CA as an IFSR practice using big data analytics as far as this study.Due to the ambiguous nature of the practice and the sophisticated historical actions involved, no specific financial ratios or traditional mathematical model can accurately detect it.However, FSF prediction scenarios can be considered in our proposal for many reasons.First, FSF prediction models use historical FSs (i.e., time-series dataset) labelled as fraudulent to learn from and reflect which variables can be used to classify the case.This can help predict future fraud or financial distress.Presuming that CA practices eventually lead to FSF (e.g., Enron started using SPEs legally, then it became 'increasingly doubtful' over time [1]), CA detection has similar domain characteristics of predicting FSF.It can be adapted to detect further activities that don't exceed IFRS limits.Another reason is that both procedures aim to prevent future fraud and provide alerting flags that don't normally appear to stakeholders in their usual financial information reviews.

E. The case of Saudi Arabia
According to the ACEF occupational fraud report in 2022, Saudi Arabia was the second-highest number of occupational fraud cases in the middle east [34].Many studies investigated the problem in Saudi-listed companies.Most of these studies were quantitative, as opinions from accounting academics and professionals were gathered and analysed for a better understanding of the incentives and techniques [35][36][4][37] [13].Still, some studies were empirical papers focused on one business type [11] [35] or one business size [36][13], applying a 1 A term used to refer to the 4 big accounting firms: PwC, KPMG, EY, and Deloitte.detection model with successful results.However, the application of big data and ML in the Saudi business research domain was focused on marketing [38] and finance [39] but never on accounting.On the other hand, accounting professionals have been using ML models, especially in some audit procedures has been employed in accounting firms.Apart from The Big Four 1 , although these platforms guarantee efficiency for both the client and the auditing firm itself, they are not used due to cost and technical training limitations.
Recently, Saudi Arabia has been through several economic transformation steps that influenced the governing materials of accounting.An instance of these steps is joining the World Trade Organization (WTO) and adopting the International Financial Reporting Standards (IFRS) [40].In addition, during the last 20 years, the Kingdom has established new institutions to regulate businesses and control the Saudi stock market.Some of these institutions were specifically designed to fulfil an essential Saudi vision for the future: Vision 2030 2 [41].An instance of these institutions is the Saudi Data and Artificial Intelligence Authority (SADAIA), which has provided many valuable services and facilities for market and academic research.It is a helpful attempt to support the effort to improve financial reporting and leverage the Saudi business environment with innovative technologies through adequate investment in the country's prospects.

III. BIG DATA IN ACCOUNTING AND AUDITING
The accounting literature is rich with innovative analytical models that have the potential to enrich the accounting environment, develop accounting regulations and reduce the profession's defects [42].The JETA has 51.5% of its publications between 2005-2015 on data analytics [33].Moreover, 13% of published research on emerging technologies in the accounting domain was about big data analytics [43].Although studies address the limitation in the literature regarding the use of big data in accounting [44] [45], there exists a consensus that  ML algorithms in big data used in accounting research are growing remarkably [46], providing new services to the business environment that are never possible before.Further, standardising accounting data through new unified formats like XBRL and secure data structures like Blockchain added promising opportunities for efficient research and improved accounting outcomes.By professional means, accounting and auditing embrace big data analytics in different procedures.The Big 4 are investing heavily in data analytics and artificial intelligence [47] and promoting embracing big data technologies.For instance, the recent adaptation of the Halo online platform by PwC implemented the inclusion of whole population analysis, which outperforms the sampling techniques that are usually used in auditing procedures along with many recalculations and risk assessment tools (e.g., journal entry testing and general ledger analysis) that become possible by its enhanced connectivity and high server-based processing capabilities.Moreover, regulatory agencies' results have been enhanced by incorporating non-financial data as a supplement to the traditional financial data in their systems (e.g., the UK government's tax authority uses different sources of data from the internet, social media, land registry records, international tax authorities, and banks [48]).
The research on the application of big data in accounting (summarised in TABLE III) is more focused on auditing.The analytical nature of auditing procedures made it more likely to benefit from these applications.Many ML frameworks were used in the research giving insightful results.Each framework used different models, datasets, and features for multiple objectives.However, recent studies in this domain used a subfield of ML, namely DL.It is recently trending in every field, particularly in accounting research, because of its potential to learn from massive amounts of data.ML and DL models can be used parallelly to build a model with improved capabilities, as in Hybrid Learning (HL).They can also be combined, and the output of one model is the input for the second, and that is called Ensembled Learning (EL) as in [49] [29].Accordingly, the training process of CADM uses an HL approach for enhanced performance.Training is performed in two stages: the feature extraction stage and the CA detection stage.Therefore, the CADM framework is designed to apply an HL approach for its suitability to our proposal.The following section will briefly describe the CADM framework and models used.

IV. CADM PROPOSED FRAMEWORK
This study considers previous FSF detection and prediction research in designing the CADM framework.The peculiarity of our detecting model is found in the DL models' hybrid nature and the datasets originality.Another critical point of distinction is our innovative learning process design.FSF prediction models are built using real publicly available FRs that regulators have recognised as fraudulent, whereas no labelled CA incidents are open to learning from.Consequently, this study intends to simulate manipulated accounting datasets to learn from and then test the model on real-time Saudi FSs that are publicly available.

A. Datasets and Data Sources
Datasets are divided into two groups: training data and test data.Training data is a simulated dataset of FSs prepared (i.e., deliberately manipulated) based on pre-FSF statements available in the literature for supervised learning.The second data set is a real-world FRs chosen from the Saudi Open Data Portal and Saudi Statistics Authority (for price history and market information), companies' websites (for reports on corporate governance information, marketing information, discretionary disclosures, and investments), corporation social media accounts, Saudi Press Agency, Zakat, Tax and Customs Authority (ZATCA) (for historical levels of compliance), and Capital Market Authority (CMA) (for observations and statistics about the company).Both training and testing datasets consist of two types of data: FIN and N-FIN, from relevant sources, as shown in TABLE IV.As CA evolves, the datasets will be time-series datasets of Saudi-listed companies for 2012-2022.Finally, we know the usefulness of including audio and video data types like GPS and CCTV recordings, yet we postpone their inclusion for future work.

B. Variables
The data types and sources described above mean that features will be scattered between multiple types of datasets.We will select variables from prior FSF studies and test them to validate their inclusion.Researchers have tested many FIN features, such as indices and financial ratios, and N-FIN features, such as changes in the board, market-adjusted stock returns, governance measures, economic changes, and changes in regulations.Other N-FIN features can be included according to their availability in Saudi sources, such as board meetings, meeting minutes, and meeting content, as tested in [26].As shown in TABLE IV., FIN and N-FIN features selection will follow the criteria in [50], [49], and [51] accordingly.For any AI model to be accurate, variables should be limited [52], though this study proposes to test extended obtainable variables and exclude some in the exploration process if needed.

C. Modeling and Testing Methodology
As shown in Fig 2, our proposed framework consists of multiple steps.First, data collection starts by accessing open Saudi databases to gather information about listed companies and then gathering FRs from available sources.Second, in the data pre-processing stage, we set up a wide selection of variables to be included (based on available datasets), and then they will be limited while training.Third, feature extraction will be performed using ML models.Then, the feature selection process that gives the best results will be chosen.Finally, we will use some DL models, such as RNN and LSTM, for training.Finally, we test the model on a real-time group dataset described in the following section.

D. Deep Learning DL
DL is a subfield of ML that can learn patterns and structures that reside in data and find relations in data.Many models are used in DL, but the most used models are deep belief networks DBN and convolutional neural network CNN [53], but they perform successfully for image recognition problems.Since our dataset design needs a model that can handle sequential interconnected data, we propose using ANN that can take such features as the LSTM can.In the CADM framework, we suggest using ANN and RNN to train our model using the simulated dataset.We will also recommend using LSTM for the improved performance of CADM.

E. Long short-Term Memory LSTM
LSTM is a type of RNN that can handle time-series datasets and include previous states in the calculation.It is commonly used in DL application research in finance, like stock price predictions and portfolio management [54].It can remember short-term and long-term values, which makes it useful in sequential data [55].CA techniques evolve, and the probability of detecting them increases significantly when considering changes over time.

F. Python and Hadoop
Python has many efficient packages to use when dealing with big data.Our framework suggests the use of Python on the Hadoop platform.Hadoop is a distributed file system that scales up thousands of computers to store, process, and control big data operations.Since we plan to train a rather complex model, Hadoop with Apache Spark integration will be used as it can run distributed programs (e.g., using the MapReduce method) in the most reliable, error-tolerable, scalable, and portable way.

V. CONCLUSION
CA is proven to be harmful, and the consequences exceed the long-lasting financial and reputational damage for corporations and their CEOs, to national-level consequences like a country-wide lower level of financial market participation [8].This paper aims to protect the Saudi financial market from the negative impacts of CA and offer stakeholders an opportunity to depend on enhanced sources of financial information for better decision-making.We introduced the framework of a new analytical model CADM that inspects the features available to detect CA used in preparing FRs.We suggested the inclusion of FIN data and N-FIN in the learning and testing stages through two DL models: ANN and LSTM.We also suggested two sources of the dataset that represent a sample of Saudi-listed companies from different sectors for a period of 10 years.

VI. LIMITATIONS AND FUTURE WORK
This research is part of ongoing research on big data applications in CA.Accessibility and availability of applicable datasets remain a challenge.However, opensource Saudi platforms like SDAIA, CMA Open Data, and ZATCA Open Data are expected to provide the required dataset size to train and test our model.Additionally, the implementation and testing stages of the CADM framework may encounter some pre-processing restrictions in the translation and standardisation stages.Yet, there is a reasonable chance that most FIN data are already standardised, given that Saudi companies are obligated to publish their FSs in XBRL format.Moreover, we are preparing a short survey designed for independent auditors in Saudi Arabia to validate the suitability and accessibility of our selected variables before starting the data collection process.The next stage's outcomes are expected to unveil new insights and contribute to the existing accounting and big data literature.
Fig 3  shows LSTM unit has three types of gates that regulate input.The unit receives an input and a previous state, then calculates the output and updates the memory state.This type of network is needed to train the model due to the nature of the time-series datasets we use.We argue that

TABLE II .
CA TECHNIQUES MAYSOON BINEID ET AL.: CADM: BIG DATA TO LIMIT CREATIVE ACCOUNTING IN SAUDI-LISTED COMPANIES 105

TABLE III .
BIG DATA IN FSF RESEARCH

TABLE IV .
DATA TYPES AND SOURCES MAYSOON BINEID ET AL.: CADM: BIG DATA TO LIMIT CREATIVE ACCOUNTING IN SAUDI-LISTED COMPANIES 107