«Applications of data mining in software engineering Quinn Taylor* Department of Computer Science, Brigham Young University, Provo, UT 84602, USA ...»
Int. J. Data Analysis Techniques and Strategies, Vol. 2, No. 3, 2010 243
Applications of data mining in software engineering
Department of Computer Science,
Brigham Young University,
Provo, UT 84602, USA
Department of Computer Science,
Brigham Young University,
Provo, UT 84602, USA
Abstract: Software engineering processes are complex, and the related activities often produce a large number and variety of artefacts, making them well-suited to data mining. Recent years have seen an increase in the use of data mining techniques on such artefacts with the goal of analysing and improving software processes for a given organisation or project. After a brief survey of current uses, we offer insight into how data mining can make a significant contribution to the success of current software engineering efforts.
Keywords: data mining; software engineering; applications.
Reference to this paper should be made as follows: Taylor, Q.
and Giraud-Carrier, C. (2010) ‘Applications of data mining in software engineering’, Int. J. Data Analysis Techniques and Strategies, Vol. 2, No. 3, pp.243–257.
Biographical notes: Quinn Taylor is a student in the MS degree programme in Computer Science at Brigham Young University and a Researcher in the SEQuOIA Lab where he focuses on understanding and visualising software structure and development processes, including through the use of data mining techniques. His research interests include software architectures, software evolution, code maintenance and decay, software reverse engineering and refactoring.
Christophe Giraud-Carrier is an Associate Professor and the Director of the Data Mining Laboratory in the Department of Computer Science at Brigham Young University. His research interests include metalearning, social network analysis, medical informatics and applications of data mining. He received his BS, MS and PhD in Computer Science at BYU in 1991, 1993 and 1994, respectively.
Copyright ⃝ 2010 Inderscience Enterprises Ltd.
c 244 Q. Taylor and C. Giraud-Carrier 1 Introduction Software systems are inherently complex and difficult to conceptualise. This complexity, compounded by intricate dependencies and disparate programming paradigms, slows development and maintenance activities, leads to faults and defects and ultimately increases the cost of software. Most software development organisations develop some sort of processes to manage software development activities. However, as in most other areas of business, software processes are often based only on hunches or anecdotal experience, rather than on empirical data.
Consequently, many organisations are ‘flying blind’ without fully understanding the impact of their process on the quality of the software that they produce. This is generally not due to apathy about quality, but rather to the difficulty inherent in discovery and measurement. Software quality is not simply a function of lines of code, bug count, number of developers, man-hours, money or previous experience – although it involves all those things – and it is never the same for any two organisations.
Software metrics have long been a standard tool for assessing quality of software systems and the processes that produce them. However, there are pitfalls associated with the use of metrics. Managers often rely on metrics that they can easily obtain and understand which may be worse than using no metrics at all. Metrics can seem interesting, yet be uninformative, irrelevant, invalid or not actionable. Truly valuable metrics may be unavailable or difficult to obtain. Metrics can be difficult to conceptualise and changes in metrics can appear unrelated to changes in process.
Alternatively, software engineering activities generate a vast amount of data that, if harnessed properly through data mining techniques, can help provide insight into many parts of software development processes. Although many processes are domain – and organisation – specific, there are many common tasks which can benefit from such insight, and many common types of data which can be mined. Our purpose here is to bring software engineering to the attention of our community as an attractive testbed for data mining applications and to show how data mining can significantly contribute to software engineering research.
The paper is organised as follows. In Section 2, we briefly discuss related work, pointing to surveys and venues dedicated to recent applications of data mining to software engineering. Section 3 describes the sources of software data available for mining and Section 4 provides a brief, but broad, survey of current practices in this domain. Section 5 discusses issues specific to mining software engineering data and prerequisites for success. Finally, Section 6 concludes the paper.
2 Related work
Although the application of data mining to software engineering artefacts is relatively new, there are specific venues in which related papers are published and authors that have created resources similar to this survey.
Perhaps the earliest survey of the use of data mining in software engineering is the 1999 Data and Analysis Center for Software (DACS) state-of-the-art report (Mendonca and Sunderhaft, 1999). It consists of a thorough survey of data mining techniques, with emphasis on applications to software engineering, including a list of 55 data mining products with detailed descriptions of each product and summary information along a number of technical as well as process-dependent features.
Applications of data mining in software engineering 245 Since then, and over the years, Xie (2010) has been compiling and maintaining an (almost exhaustive) online bibliography on mining software engineering data. He also presented tutorials on that subject at the International Conference on Knowledge Discovery in Databases in 2006 and at the International Conference on Software Engineering in 2007, 2008 and 2009 (e.g., see Xie et al., 2007). Many of the publications we cite here are also included in Xie’s bibliography and tutorials.
The Mining Software Repositories (MSR) Workshop, co-located with the International Conference on Software Engineering, was originally established in 2004.
Papers published in MSR focus on many of the same issues we have discussed in this survey and the goal of the workshops is to increase understanding of software development practices through data mining. Beyond tools and applications, topics include assessment of mining quality, models and meta-models, exchange formats, replicability and reusability, data integration and visualisation techniques.
Finally, Kagdi et al. (2007) have recently published a comprehensive survey of approaches for MSR in the context of software evolution. Although their survey is narrower in scope than the overview given here, it has greater depth of analysis, presents a detailed taxonomy of software evolution data mining methodologies and identifies a number of related research issues that require further investigation.
3 Software engineering data
The first step in the knowledge discovery process is to gain understanding about the data that is available and the business goals that drive the process. This is essential for software engineering data mining endeavours, because unavailability of data for mining is a factor that limits the questions which can be effectively answered.
In this section, we describe software engineering data that are available for data mining and analysis. Current software development processes involve several types of resources from which software-related artefacts can be obtained. Software ‘artefacts’ are a product of software development processes. Artefacts are generally lossy and thus cannot provide a full history or context, but they can help piece together understanding and provide further insight. There are many data sources in software engineering. In this paper, we focus only on four major groups and describe how they may be used for mining software engineering data.
First, the vast majority of collaborative software development organisations utilise revision control software1 (e.g., CVS, Subversion, Git, etc.) to manage the ongoing development of digital assets that may be worked on by a team of people. Such systems maintain a historical record of each revision and allow users to access and revert to previous versions. By extension, this provides a way to analyse historical artefacts produced during software development, such as number of lines written, authors which wrote particular lines or any number of common software metrics.
Second, most large organisations (and many smaller ones) also use a system for tracking software defects. Bug tracking software (such as Bugzilla, JIRA, FogBugz, etc.) associates bugs with meta-information (status, assignee, comments, dates and milestones, etc.) that can be mined to discover patterns in software development processes, including the time-to-fix, defect-prone components, problematic authors, etc.
Some bug trackers are able to correlate defects with source code in a revision system.
246 Q. Taylor and C. Giraud-Carrier Third, virtually all software development teams use some form of electronic communication (e-mail, instant messaging, etc.) as part of collaborative development (communication in small teams may be primarily or exclusively verbal, but such cases are inconsequential from a data mining perspective). Text mining techniques can be applied to archives of such communication to gain insight into development processes, bugs and design decisions.
Fourth, software documentation and knowledge bases can be mined to provide further insight into software development processes. This approach is useful to organisations that use the same processes across multiple projects and want to examine a process in terms of overall effectiveness or fitness for a given project. Although knowledge bases may contain source code, this approach focuses primarily on retrieval of information from natural languages.
4 Mining software engineering data: a brief survey
In this section, we give a technique-oriented overview of how traditional data mining techniques have been applied in the context of software engineering, followed by a more task-oriented view in which we show how software tasks in three broad groups can benefit from data mining.
4.1 Data mining techniques in software engineering
In this section, we discuss several data mining techniques and provide examples of ways they have been applied to software engineering data. Many of these techniques may be applied to software process improvement. We attempt to emphasise innovative and promising approaches and how they can benefit software organisations.
4.1.1 Association rules and frequent patterns Zimmermann et al. (2005) have developed the Reengineering of Software Evolution (ROSE) tool to help guide programmers in performing maintenance tasks. The goals of
ROSE are to:
1 suggest and predict likely changes 2 prevent errors due to incomplete changes 3 detect coupling undetectable by program analysis.
Similar to Amazon’s system for recommending related items, they aim to provide guidance akin to “programmers who changed these functions also changed... ”. They use association rules to distinguish between change types in CVS and try to predict the most likely classification of a change-in-progress.
Livshits and Zimmermann (2005) collaborated to create DynaMine, an automated tool that analyses code check-ins to discover application-specific coding patterns and identify violations which are likely to be errors. Their approach is based on a classic a priori algorithm, combined with pattern categorisation and dynamic analysis. Their tool has been able to detect previously unseen patterns and several pattern violations in studies of the Eclipse and jEdit projects.