U A mixed-methods measurement and evaluation methodology for mobile application usability studies

— Low usability of mobile application is thought to diminish the perceived level of the quality by a user whose experiences substantially determines its market success or failure. However, while a single method of measurement employed to study usability may produce an unreliable or incomplete evaluation outcome, in this paper, contrarily we propose to take advantage of both qualitative and quantitative methods adequate to collect data, that would describe all usability attributes. In particular, in the scope of mobile application usability studies, this paper ( i ) depicts the main assumptions of elaborated M4MAUME methodology, ( ii ) describes the self-developed software tool (RVDA) for retrospective video data analysis, ( iii ) specifies the experimental setup, and ( iv ) discusses the preliminary results obtained from 20 experiments performed on four different groups of mobile applications. Eventually our findings lead us to the conclusion that mixing different methods have produced reliable and valuable outcomes which may be used to improve and manage usability in current and future projects, as well as to enhance existing software quality assurance (SQA) programs.

The long lived marriage of hardware and communication technologies has brought about the inevitable shift from desktop to mobile computing [14,15], leading to new user requirements regarding mobile applications [16,17,18].As a consequence, usability engineers face the issue of usability measurement and evaluation in the instance of new software settings, including user interface (UI) design [19], connectivity [20], context-awareness [21], as well as hardware capabilities, concerning screen size [22], storage space [23] and overall performance [24].
In light of systematic literature review and analysis, covering the volume of 791 documents, indexed by Scopus database and published between 2001 and 2018, we determined the research methods applied to measure and evaluate usability of mobile applications [25].Our findings also show that in only a few studies, collecting data concerned more than one technique, since very few utilized the retrospective video data analysis as well.This research gap inspired us to establish a new laboratory, equipped with both hardware apparatus and software tools, as well as elaborate congruent methodology with all the necessary data collection techniques.
The remainder of the paper is organized as follows.In Section II, a brief research background is described, followed by the M4MAUME methodology (Section III).In Section IV the RVDA tool is specified.In Section V we detail the general settings of the experimental setup, followed by a discussion of the preliminary results obtained from the undertaken experiments.Finally, Section VII concludes the paper.

II. DEFINITIONS, MEASURES, AND METHODS
The generally accepted definition of usability is the one, given by ISO in 92411-11, which states that usability is the "extent to which a system, product or service can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use" [26].
This definition has been adopted in the majority of studies in the subject of mobile application usability.
Efficiency is the ability of a user to complete a task with speed and accuracy [27].Efficiency is measured in a number of ways, such as the duration to complete a given task, or the duration to finish a set of tasks [27,28].In general, two methods are put into use: controlled observation, and survey.
Satisfaction is a user's perceived level of comfort and pleasure, or a user's perceived level of fulfillment of their expectations and needs [29].Satisfaction is measured by using questionnaires and other qualitative techniques, typically used to capture a user's intangible attitude towards an application [29,30].
Effectiveness is the ability of a user to complete a task in a given context [31].It is measured by the number of successfully completed tasks, the number of steps required to complete a task, the number of double taps unrelated to the U operation of an application, and the number of times that a back button is used on the mobile device [31,32].To collect all necessary data to estimate measures, two methods are in common use: controlled observation, and survey.
Learnability is defined twofold: first-time and over-time.The former refers to the degree of ease with which a user can interact with a newly-encountered application without seeking guidance or referring to documentation.It is measured by the number of attempts to solve a task, the number of assists when performing a task, and the number of errors performed by a user [33].Contrarily, the latter is the capacity of a user to achieve proficiency with an application.Typically, a user's performance during a series of tasks is observed to measure how long it takes these participants to reach a pre-specified level of proficiency [33,34].
Memorability is the degree of ease with which a user can remember how to use an application effectively.It is measured by asking users to perform a series of tasks after having become proficient with the use of the application, and afterwards asking them to perform similar tasks after a period of inactivity.To determine how memorable the application was, a comparison is made between the two sets of results [35,36].In a few cases, eye tracking techniques were utilized to a greater extent in the usability studies.
Cognitive load refers to the amount of mental activity imposed on a user's working memory during application usage [37].Cognitive load theory differentiates cognitive load into three types: extraneous, intrinsic and germane [38].Extraneous refers to instructional and presentation schemas, caused by the mental activities and elements that do not directly support application usage; intrinsic refers to the task complexity, caused by the number of elements in a task and the degree to which these elements are related to each other; germane regards to the amount of mental effort used to form schemas and actively integrate new information with prior knowledge during application usage.In the practice of cognitive load measurement, instruments such as a subjective rating scale, a thinking aloud dual task protocol or eye tracking have been used [32,39].
Errors relates to the amount and type of errors which occur during task performance by a user [40].On the other hand, it is the ability of an application to recover from occurred errors.Both these definitions also respectively reflect a way the attribute is measured [40,41].Controlled observation and survey are the only methods used to observe both application and users' performance or to collect a users' perceived level of application usage without error.
Simplicity is the degree of being easy to understand or being uncomplicated in form or design, described by such characteristics as the number of menu levels, the number of performed gestures to reach a destination object, and the duration of searching a button to perform a specific function.On the other hand, simplicity is the level of comfort with which a user is able to complete a task, measured by predefined statements with the Likert-scale rating.
Ease of use is the perceived level of the user's effort related to usage of the application.The survey instrument is used to collect data from users on perceptions concerning their experienced interaction with the application [40,42].
Navigation is the perceived level of the user's effort to access relevant information.Similarly, the survey instrument is applied to collect data from users on perceptions concerning their understanding of the information architecture [43,44,45].

Theoretical background
By definition, methodology is a body of methods, rules and postulates employed by a discipline [46].In brief, a method is a process of doing something [47], a rule is a prescribed guide for conduct [48] and postulate is a hypothesis advanced as an essential premise of a train of reasoning [49].
Theoretical triangulation is the use of multiple theories or hypotheses when examining a phenomenon [50], while methodical triangulation has been defined as multimethod, methods triangulation, or simply mixed-method [51].The application of triangulation, in the scope of multiple sources of data, enhances the reliability of results [52].There is a direct link between data triangulation and data saturation, where the former is a method to establish the latter.
The quantitative paradigm is based on positivism [53], where evidence is characterized by empirical research.This group of methods emphasize objective measurements and the numerical analysis of data collected through polls, surveys and questionnaires.In contrast, the qualitative paradigm is based on interpretivism [54] and constructivism [55].The three most common methods are: focus groups (group discussions), individual in-depth interviews, and participant observation [56].
Verbal protocol analysis (VPA) is a method for collecting and analyzing verbal data regarding cognitive processing [57].In other words, VPA is the record of spoken thoughts, provided by subject when thinking aloud during, or immediately after completing a task [58].It is based on the premise that in order to capture a participants real and authentic experiences, we must allow them to express themselves freely [59].Alternatively, in a narrow sense, this method (protocol) is also defined as think aloud or thinking aloud, and is relatively free-form [60] and open-ended.

Main assumptions
The Mixed-Methods Methodology for Mobile Application Usability Measurement and Evaluation (3M4MAUME) is a body of three integrated methods, namely: (1) survey, (2) participant observation and (3) verbal protocol analysis.In this regard, this multi-faceted approach is designed to take advantage of both quantitative (1) and qualitative (2 & 3) approaches, which in a specified combination are able to provide a full panorama on an application's quality-in-use characteristics, as well as user's attitudes and perceptions.
In particular, each survey is implemented according to a series of steps, each of which includes an applicationspecific functionality and defined set of formats and procedures.
Participant observation simultaneously takes place in conjunction with thinking aloud protocol.To preserve privacy, and encourage subject confidence, only an observer is present during the testing session.Moreover, each participant can ask questions to clarify aspects of the task, but all answers given are as brief as possible to minimize the time burden during the application testing session.
The observer has access to an external monitor which displays the participant's interaction with the application in real-time mode, while a third-party application records the session, including both video and voice data.
After the test, the participant is asked to run through the ideas generated and explain the thinking behind them, where not already mentioned.

Task no 1. Data collection
The usability testing procedure aims to collect data for analysis, and is the process described by a fixed sequence of three steps: 1) Pre-testing questionnaire collects the demographic data and ten statements regarding a mobile applications usage and observed usability issues, and following the self-assessment of the skills used to perform five tasks which correspond to the main application functionality, as well as to the tasks assumed to be performed afterwards by the participant.2) Application testing session is described by the protocol which assumes participant recording during application usage, coded by the audio/video hardware apparatus in order to collect both voice and video data, without assistance or guidance.Moreover, participants are always encouraged to think aloud about the application properties and behavior, as well as to speak frankly of any other important issues.3) Post-testing questionnaire which aims to reproduce the perceived quality in use, specified by ten usability attributes where each is typified by at least five statements.The pre and post questionnaires are individually administered to the participants, before and after the test respectively.Therefore, the output of this task concerns two questionnaires, and an audio-video binary file, which gathered from the whole group, are afterwards verified, and eventually serve as the input for the data analysis task.

Task no 2. Data analysis
The data analysis task is the process described by a fixed sequence of the following three steps: 1) Inspecting video content that comprises annotation procedures in which the user's actions and application responses are identified, separated and marked on the timeline.2) Documenting all identified application bugs, defects, errors, and any reported usability issue.3) Extracting numerical values required to calculate usability attribute measures.It is worth noting that to perform the above task, specific software tools are required to obtain reliable outcomes with acceptable accuracy.Moreover, all necessary calculations are undertaken, regarding quantifying attributes measures and estimating structural model parameters.

Task no. 3. Information visualization
The third stage involves information visualization on the dashboard to empower cognition of the extracted and analyzed data which precedes and facilitates usability evaluation.For that purpose we specified a weighted and labeled graph.Vertices show the sequence of user's actions, and weights are used to represent the duration of application responses.In other words, the graph is a reconstructed image of the interaction which occurred between user and application, enriched by the duration of particular actions and responses.

Task no 4. Usability evaluation
Having measured and estimated all usability measures (see Section II), as well as the visualization of all relevant information, one can further analyze, classify and interpret the obtained outcome.On the other hand, some might go through the audio-video recordings, investigating these cause-and-effect relationships that may lead to a loss of effectiveness and a decrease in satisfaction.
In our opinion, the enriched action-response model provides an effective approach to evaluating particular usability attributes.At the interpretive level of research, the results of the quantitative analysis, provide explanations for evaluators to compile consistent and proper judgments.
The outcome of this step is the report, which in general presents results and conclusions, as well as a list of recommendations with applicable participant's reviews.In particular, the report categorizes usability issues into three groups: (1) bugs and errors, (2) design and (3) performance.In this line of thinking, the addressed groups may respectively concern testers, designers and developers.

IV. THE RVDA TOOL
To develop the RVDA tool (Retrospective Video Data Analyzer) we used the Electron open source library.Developed by GitHub, and devoted for building crossplatform desktop applications with HTML, CSS, and JavaScript, combines Chromium and Node.js into a single runtime environment [62].We also used npm to manage packages for JavaScript runtime environment.In general, the above are the major devkit components.

PAWEŁ WEICHBROTH: A MIXED-METHODS MEASUREMENT AND EVALUATION METHODOLOGY FOR MOBILE APPLICATION USABILITY STUDIES
The architecture of the tool has been conceived to allow for an easy and scalable video content analysis, supporting all common video file formats available for playback, where one or more files can be simultaneously opened and analyzed in separate windows.
From a technical viewpoint, there are main modules and rendering modules.The former is responsible for operations on windows, user interface properties, and communication with the operating system.The latter includes tasks on the video content and all associated functions.By design, neither communication nor data exchange occur within modules.
The user interface (UI) is divided into three sections: (1) the video preview area, (2) menu panel, and (3) display manipulators and the timeline.The size of the sections can be easily modified, along with the position of the tool on the computer screen.There are four levels available, which are used to break down the tasks to the user's actions and the application's responses (events).Accordingly, the Start and Stop buttons are used to drop pins on the timeline which mark the beginning and the end of each event.Moreover, there are also Save, Reset, and Clear features available.
The Export feature allows timeline data to be saved to the external CSV file which contains the event: identifier, name, start time, end time, and the level number.In this case, data obtained from a series of experiments can be consolidated for further reporting and analysis.

V. EXPERIMENTAL SETUP
The inception of the laboratory was in June 2018, and it is located in the WSB University in Gdansk.Both hardware and software apparatus were chosen to meet specified requirements regarding project scope and budget.
Each application testing session begins with a short introduction, including goals and research agenda.Afterwards the usability testing procedure is described and executed.Our experience shows that a single session lasts approximately 15 minutes.

Hardware apparatus
For real-time image and voice capture and recording we used the Genee Vision 150 document camera.The design and specification combine five major components: optics, camera, lighting system and motherboard with firmware, and software tool.The built-in camera resolution is 2 megapixel, resulting in image dimensions of 1920 by 1080.While resolution is an important specification, those described above proved to be adequate to extract all required data in the context of the study.Optical and digital zoom is 8x and 10x, and the camera can rotate vertically 180 o and 180 o horizontally.The device is connected via a USB port to the local computer, equipped with the software apparatus and the latest hardware drivers and software libraries.

Software apparatus
In total we used four software tools.Firstly, the VideoCap captures audio-video data, transmitted by the document camera, to the mp4 file format.Secondly, the input mp4 data is analyzed under the RVDA tool.Thirdly, in order to document and to actively collaborate with all interested parties on the recognized usability issues, on the external virtual server we have installed, configured and deployed self-hosted Git service, namely Gogs.Fourthly, for advanced statistical analysis we take advantage of commerce and open-source tools to make sense of quantitative data.

VI. METHODOLOGY TESTING AND VALIDATION
The elaborated methodology was tested from January to May 2019, in a group of four simultaneous student projects.Each project concerned a different set of mobile applications.Usability testing included five adult participants, both males and females with relevant knowledge and skills, who were using their own smartphones during the session.
Moreover, the methodology underwent one proof of concept (PoC), despite the preliminary stage of its maturity.During the PoC meeting, the interested parties have underlined its practicality, as well as the possible benefits to both designers and developers.Here, it is worth highlighting just one voice of appraisal: "Despite economic constraints, analysis of video has strong software quality validity, offering rich insight into both application and user behavior.Studying the interactions in a retrospective manner provides valuable lessons and affirmation for quality assurance practices".

VII. CONCLUSIONS
We are aware that the methodology needs further consultation and discussion with the software industry, with the goal of reaching an optimal cost-benefit ratio.From the economic perspective, the methodology is time-consuming and labor-intensive.While mobile application development projects have relatively low budgets, there may be no allowance to reserve sufficient funds for this area.
However, in conclusion, while the methodology has been acknowledged to be application-agnostic, fully adaptive and replicative, we believe that in the near future some interested parties will decide to implement our methodology, or at least incorporate some of its components.