The panel discussed many things including the "missing links" in the data mining literature between theory and practice.
Some notes on those gaps are offered below.
Data-driven management:
Data mining is useless unless management listens
to the conclusions. Our data
repositories become data cemeteries if no one utilizes those repositories. How
can we foster a data-driven management culture[1]?
From anecdote to evidence:
· For a supposedly data-driven field, there are
surprisingly few (a) exemplar case studies in the literature; or (b) lists of
lessons learned from all this work. If a business user, or a graduate student
asks “in this field, what works best and why do you think so?” We know of some
isolated successes [2,3] but can we build a compendium of successful projects
delivery actionable, timely, and insightful results to industry?
Conclusions to impact:
· Making predictions about (say) defect densities
or effort estimation is all well and good, but what about the myriad other
issues that industry wants to address?
For example, what about value-based concerns[4,5] or other management
information needs [6]? How can we turn our prediction models into
most-impactful prediction models that list the fewest factors that most impact
the results [7]?
Learning and re-learning:
· It is now possible for company X to access data
from past projects of company Y,Z,etc. Should they? How can old experience
apply to new projects? Alternatively, how much new data is required to learn
effective models?
User feedback:
· Much our research is “one-shot”; i.e. a learner
generates a model and the that is the end of the process. But real-world
reasoning involves extensive feedback where old answers prompt new questions.
Does any current commercial data mining process support such feedback? If not,
then what?
From prediction to decision making:
· Prediction is a well-studied problem. However,
after prediction comes decision making. Can we bridge the gap between prediction
and decision making ? [8,9]
Beyond open source:
· Much of the latest publication on mining
software repositories concerns itself with open source software. Can we bridge
the gap between reasoning about open source systems and (say) embedded software
or close source proprietary software? [10]
Scaling up:
· We live in the age of big data. In practice, do
these techniques scale up? Or do we need new data miners to handle (e.g.) text
mining data sets?
Precise precision vs general impact:
· Is there a
tension between the precision of the data and the impact of the
conclusion? There are many
examples in the literature of effects reported to three significant figures,
but are not of much interest to industry.
For some of the publications on mining software repositories reports “small results”; e.g. a 3%
improvement in accuracy. Such results are not so impressive to industry. What industry often need are “big results” offering (e.g.) a 30% increase
in productivity. So can we
document such “big results” in industrial data mining?
Training the next generation:
· Finally, what are the skills needed for the industrial
data mining for software engineering? Do we need new kinds of training to
develop the skills needed for
industrial data mining?
References
The following papers partially address the issues raised above. We
list them here to help bootstrap this discussion. However, it must be said that
any serious read of the following papers shows that they raise far more questions
than they answer.
- Moneyball: The Art of Winning an Unfair Game, Michael Lewis, W.W. Norton & Company Inc., 2003
- CRANE: Failure Prediction, Change Analysis and Test Prioritization in Practice - Experiences from Windows, , Jacek Czerwonka, Rajiv Das, Nachiappan Nagappan, Alex Tarvo and Alex Teterev, ICST’11
- AI-Based Software Defect Predictors: Applications and Benefits in a Case Study Ayse Tosun, Ayse Bener, Resat Kale, IAAI’10
- Value-based software engineering. Barry Boehm, SIGSOFT Softw. Eng. Notes 28, 2 (March 2003)
- Understanding the Value of Software Engineering Technologies. Phillip Green II, Tim Menzies, Steve Williams, Oussama El-Rawasa ASE’09
- Information Needs for Software Development Analytics, Raymond P.L. Buse, Thomas Zimmermann, ICSE’12.
- Finding the Right Data for Software Cost Modeling. Zhihao Chen, Barry W. Boehm, Tim Menzies, Daniel Port. IEEE Software 22(6): 38-46 (2005)
- Case-based reasoning vs parametric models for software quality optimization, Adam Brady and Tim Menzies. Promise’10
- Learning to Change Projects, Raymond Borges and Tim Menzies. Promise’12
- Privacy and Utility for Defect Prediction: Experiments with MORPH , Fayola Peters and Tim Menzies, ICSE’12