Wednesday, June 13, 2012

Gaps in the literature


The panel discussed many things including the "missing links" in the data mining literature between theory and practice. 
Some notes on those gaps are offered below.

Data-driven management:

       Data mining is useless unless management listens to the conclusions. Our data repositories become data cemeteries if no one utilizes those repositories. How can we foster a data-driven management culture[1]?

From anecdote to evidence:

·         For a supposedly data-driven field, there are surprisingly few (a) exemplar case studies in the literature; or (b) lists of lessons learned from all this work. If a business user, or a graduate student asks “in this field, what works best and why do you think so?” We know of some isolated successes [2,3] but can we build a compendium of successful projects delivery actionable, timely, and insightful results to industry?

Conclusions to impact:

·         Making predictions about (say) defect densities or effort estimation is all well and good, but what about the myriad other issues that industry wants to address?  For example, what about value-based concerns[4,5] or other management information needs [6]? How can we turn our prediction models into most-impactful prediction models that list the fewest factors that most impact the results [7]?

Learning and  re-learning:

·         It is now possible for company X to access data from past projects of company Y,Z,etc. Should they? How can old experience apply to new projects? Alternatively, how much new data is required to learn effective models? 

User feedback:

·        Much our research is “one-shot”; i.e. a learner generates a model and the that is the end of the process. But real-world reasoning involves extensive feedback where old answers prompt new questions. Does any current commercial data mining process support such feedback? If not, then what?

From prediction to decision making:

·        Prediction is a well-studied problem. However, after prediction comes decision making. Can we bridge the gap between prediction and decision making ? [8,9]

Beyond  open source:

·        Much of the latest publication on mining software repositories concerns itself with open source software. Can we bridge the gap between reasoning about open source systems and (say) embedded software or close source proprietary software? [10]

Scaling up:

·        We live in the age of big data. In practice, do these techniques scale up? Or do we need new data miners to handle (e.g.) text mining data sets?

Precise precision vs general impact:

·        Is there a  tension between the precision of the data and the impact of the conclusion?  There are many examples in the literature of effects reported to three significant figures, but are not of much interest to industry.  For some of the publications on mining software repositories  reports “small results”; e.g. a 3% improvement in accuracy. Such results are not so impressive to industry.  What industry  often need are “big results” offering (e.g.) a 30% increase in productivity.  So can we document such “big results” in industrial data mining?

Training the next generation:

·         Finally, what are the skills needed for the industrial data mining for software engineering? Do we need new kinds of training to develop the skills  needed for industrial data mining? 

References

The following papers partially address the issues raised above. We list them here to help bootstrap this discussion. However, it must be said that any serious read of the following papers shows that they raise far more questions than they answer.
  1. Moneyball: The Art of Winning an Unfair Game, Michael Lewis, W.W. Norton & Company Inc., 2003
  2. CRANE: Failure Prediction, Change Analysis and Test Prioritization in Practice - Experiences from Windows, , Jacek Czerwonka, Rajiv Das, Nachiappan Nagappan, Alex Tarvo and Alex Teterev, ICST’11
  3. AI-Based Software Defect Predictors: Applications and Benefits in a Case Study Ayse Tosun, Ayse Bener, Resat Kale, IAAI’10
  4. Value-based software engineering. Barry Boehm, SIGSOFT Softw. Eng. Notes 28, 2 (March 2003)
  5. Understanding the Value of Software Engineering Technologies. Phillip Green II, Tim Menzies, Steve Williams, Oussama El-Rawasa ASE’09
  6. Information Needs for Software Development Analytics, Raymond P.L. Buse, Thomas Zimmermann, ICSE’12.
  7. Finding the Right Data for Software Cost Modeling. Zhihao Chen, Barry W. Boehm, Tim Menzies, Daniel Port. IEEE Software 22(6): 38-46 (2005)
  8. Case-based reasoning vs parametric models for software quality optimization, Adam Brady and Tim Menzies. Promise’10
  9. Learning to Change Projects, Raymond Borges and Tim Menzies. Promise’12
  10. Privacy and Utility for Defect Prediction: Experiments with MORPH , Fayola Peters and Tim Menzies, ICSE’12

3 comments:

  1. I wonder, is there a place where we can gather information on the different short comings and how far we as a research field progressed?

    ReplyDelete
  2. Well... we could start a comment thread here.

    ReplyDelete
  3. very good insight into data mining, it is simply analyzing ones
    existing database to find patterns? and/or behavior to enhance the decision Making Process..


    http://www.websitescrapingservice.com

    ReplyDelete