Data Science for Software Engineering: Sharing Data and Models presents guidance and procedures for reusing data and models between projects to produce results that are useful and… Read more
Purchase Options
Save 50% on book bundles
Immediately download your ebook while waiting for your print delivery. No promo code is needed.
Data Science for Software Engineering: Sharing Data and Models presents guidance and procedures for reusing data and models between projects to produce results that are useful and relevant. Starting with a background section of practical lessons and warnings for beginner data scientists for software engineering, this edited volume proceeds to identify critical questions of contemporary software engineering related to data and models. Learn how to adapt data from other organizations to local problems, mine privatized data, prune spurious information, simplify complex results, how to update models for new platforms, and more. Chapters share largely applicable experimental results discussed with the blend of practitioner focused domain expertise, with commentary that highlights the methods that are most useful, and applicable to the widest range of projects. Each chapter is written by a prominent expert and offers a state-of-the-art solution to an identified problem facing data scientists in software engineering. Throughout, the editors share best practices collected from their experience training software engineering students and practitioners to master data science, and highlight the methods that are most useful, and applicable to the widest range of projects.
Shares the specific experience of leading researchers and techniques developed to handle data problems in the realm of software engineering
Explains how to start a project of data science for software engineering as well as how to identify and avoid likely pitfalls
Provides a wide range of useful qualitative and quantitative principles ranging from very simple to cutting edge research
Addresses current challenges with software engineering data such as lack of local data, access issues due to data privacy, increasing data quality via cleaning of spurious chunks in data
Researchers, graduate software engineering students, and practitioners with an interest in data science.
Why this book?
Foreword
List of Figures
Chapter 1: Introduction
1.1 Why read this book?
1.2 What do we mean by “sharing”?
1.3 What? (our executive summary)
1.4 How to read this book
1.5 But what about …? (what is not in this book)
1.6 Who? (about the authors)
1.7 Who else? (acknowledgments)
Part I: Data Mining for Managers
Chapter 2: Rules for Managers
Abstract
2.1 The inductive engineering manifesto
2.2 More rules
Chapter 3: Rule #1: Talk to the Users
Abstract
3.1 Users biases
3.2 Data mining biases
3.3 Can we avoid bias?
3.4 Managing biases
3.5 Summary
Chapter 4: Rule #2: Know The Domain
Abstract
4.1 Cautionary tale #1: “discovering” random noise
4.2 Cautionary tale #2: jumping at shadows
4.3 Cautionary tale #3: it pays to ask
4.4 Summary
Chapter 5: Rule #3: Suspect Your Data
Abstract
5.1 Controlling Data Collection
5.2 Problems With Controlled Data Collection
5.3 Rinse (and Prune) Before Use
5.4 On the Value of Pruning
5.5 Summary
Chapter 6: Rule #4: Data Science is Cyclic
Abstract
6.1 The Knowledge Discovery Cycle
6.2 Evolving Cyclic Development
6.3 Summary
Part II: Data Mining: A Technical Tutorial
Chapter 7: Data Mining and SE
Abstract
7.1 Some Definitions
7.2 Some Application Areas
Chapter 8: Defect Prediction
Abstract
8.1 Defect Detection Economics
8.2 Static Code Defect Prediction
Chapter 9: Effort Estimation
Abstract
9.1 The Estimation Problem
9.2 How To Make Estimates
Chapter 10: Data Mining (Under The Hood)
Abstract
10.1 Data carving
10.2 About the data
10.3 Cohen pruning
10.4 Discretization
10.5 Column pruning
10.6 Row pruning
10.7 Cluster pruning
10.8 Contrast pruning
10.9 Goal pruning
10.10 Extensions for continuous classes
Part III: Sharing Data
Chapter 11: Sharing Data: Challenges and Methods
Abstract
11.1 Houston, We Have A Problem
11.2 Good News, Everyone
Chapter 12: Learning Contexts
Abstract
12.1 Background
12.2 Manual Methods for Contextualization
12.3 Automatic Methods
12.4 Other Motivation To Find Contexts
12.5 How To Find Local Regions
12.6 Inside Chunk
12.7 Putting It All Together
12.8 Using Chunk
12.9 Closing Remarks
Chapter 13: Cross-Company Learning: Handling The Data Drought
Abstract
13.1 Motivation
13.2 Setting the ground for analyses
13.3 Analysis #1: can CC data be useful for an organization?
13.4 Analysis #2: how to cleanup CC data for local tuning?
13.5 Analysis #3: how much local data does an organization need for a local model?
13.6 How trustworthy are these results?
13.7 Are these useful in practice or just number crunching?
13.8 What's new on cross-learning?
13.9 What's the takeaway?
Chapter 14: Building Smarter Transfer Learners
Abstract
14.1 What is actually the problem?
14.2 What do we know so far?
14.3 An example technology: TEAK
14.4 The details of the experiments
14.5 Results
14.6 Discussion
14.7 What are the takeaways?
Chapter 15: Sharing Less Data (Is a Good Thing)
Abstract
15.1 Can We Share Less Data?
15.2 Using Less Data
15.3 Why Share Less Data?
15.4 How To Find Less Data
15.5 What's Next?
Chapter 16: How To Keep Your Data Private
Abstract
16.1 Motivation
16.2 What is PPDP and why is it important?
16.3 What is considered a breach of privacy?
16.4 How to avoid privacy breaches?
16.5 How are privacy-preserving algorithms evaluated?
16.6 Case study: privacy and cross-company defect prediction
Chapter 17: Compensating for Missing Data
Abstract
17.1 Background notes on see and instance selection
17.2 Data sets and performance measures
17.3 Experimental conditions
17.4 Results
17.5 Summary
Chapter 18: Active Learning: Learning More With Less
Abstract
18.1 How does the quick algorithm work?
18.2 Notes on active learning
18.3 The application and implementation details of quick
18.4 How the experiments are designed
18.5 Results
18.6 Summary
Part IV: Sharing Models
Chapter 19: Sharing Models: Challenges and Methods
Abstract
Chapter 20: Ensembles of Learning Machines
Abstract
20.1 When and why ensembles work
20.2 Bootstrap aggregating (bagging)
20.3 Regression trees (RTs) for bagging
20.4 Evaluation framework
20.5 Evaluation of bagging + RTs in SEE
20.6 Further understanding of bagging + RTs in SEE
20.7 Summary
Chapter 21: How to Adapt Models in a Dynamic World
Abstract
21.1 Cross-company data and questions tackled
21.2 Related work
21.3 Formulation of the problem
21.4 Databases
21.5 Potential benefit of CC data
21.6 Making better use of CC data
21.7 Experimental analysis
21.8 Discussion and implications
21.9 Summary
Chapter 22: Complexity: Using Assemblies of Multiple Models
Abstract
22.1 Ensemble of methods
22.2 Solo methods and multimethods
22.2.3 Experimental conditions
22.3 Methodology
22.4 Results
22.5 Summary
Chapter 23: The Importance of Goals in Model-Based Reasoning
Abstract
23.1 Introduction
23.2 Value-based modeling
23.3 Setting up
23.4 Details
23.5 An experiment
23.6 Inside the models
23.7 Results
23.8 Discussion
Chapter 24: Using Goals in Model-Based Reasoning
Abstract
24.1 Multilayer Perceptrons
24.2 Multiobjective evolutionary algorithms
24.3 HaD-MOEA
24.4 Using MOEAs for creating see models
24.5 Experimental setup
24.6 The relationship among different performance measures
24.7 Ensembles based on concurrent optimization of performance measures
24.8 Emphasizing particular performance measures
24.9 Further analysis of the model choice
24.10 Comparison against other types of models
24.11 Summary
Chapter 25: A Final Word
Abstract
Bibliography
No. of pages: 406
Language: English
Published: December 15, 2014
Imprint: Morgan Kaufmann
Paperback ISBN: 9780124172951
eBook ISBN: 9780124173071
TM
Tim Menzies
Tim Menzies, Full Professor, CS, NC State and a former software research chair at NASA. He has published 200+ publications, many in the area of software analytics. He is an editorial board member (1) IEEE Trans on SE; (2) Automated Software Engineering journal; (3) Empirical Software Engineering Journal. His research includes artificial intelligence, data mining and search-based software engineering. He is best known for his work on the PROMISE open source repository of data for reusable software engineering experiments.
Affiliations and expertise
Professor, Computer Science, North Carolina State University, Raleigh, NC, USA
EK
Ekrem Kocaguneli
Ekrem Kocaguneli received his Ph.D. from the Lane Department of Computer Science and Electrical Engineering, West Virginia University. His research focuses on empirical software engineering, data/model problems associated with software estimation and tackling them with smarter machine learning algorithms.
Affiliations and expertise
Software Development Engineer at Microsoft
BT
Burak Turhan
Burak Turhan is a Professor of Software Engineering, University of Oulu, Finland. His research interests include empirical studies of software engineering on software quality, defect prediction, and cost estimation, as well as data mining for software engineering.
Affiliations and expertise
Burak Turhan, Professor of Software Engineering, University of Oulu, Finland
LM
Leandro Minku
Leandro L. Minku is a Research Fellow II at the Centre of Excellence for Research in Computational Intelligence and Applications (CERCIA), School of Computer Science, the University of Birmingham (UK). His research focuses on software prediction models, and he is the co-author of the first approach able to improve the performance of software predictors based on cross-company data over single-company data by taking into account the changeability of software prediction tasks' environments.
Affiliations and expertise
Research Fellow II, Centre of Excellence for Research in Computational Intelligence and Applications (CERCIA), University of Birmingham, UK
FP
Fayola Peters
Fayola Peters is a PostDoctoral Researcher at LERO, the Irish Software Engineering Research Center, University of Limerick, Ireland. Along with Mark Grechanik, she is the author of one of the two known algorithms (presented at ICSE’12) that can privatize algorithms while still preserving the data mining properties of that data.
Affiliations and expertise
PostDoctoral Researcher at LERO, the Irish Software Engineering Research Center, University of Limerick, Ireland