Holiday book sale: Save up to 30% on print and eBooks. No promo code needed.
Save up to 30% on print and eBooks.
Perspectives on Data Science for Software Engineering
1st Edition - July 12, 2016
Authors: Tim Menzies, Laurie Williams, Thomas Zimmermann
9 7 8 - 0 - 1 2 - 8 0 4 2 0 6 - 9
9 7 8 - 0 - 1 2 - 8 0 4 2 6 1 - 8
Perspectives on Data Science for Software Engineering presents the best practices of seasoned data miners in software engineering. The idea for this book was created during the… Read more
Save 50% on book bundles
Immediately download your ebook while waiting for your print delivery. No promo code is needed.
Perspectives on Data Science for Software Engineering
presents the best practices of seasoned data miners in software engineering. The idea for this book was created during the 2014 conference at Dagstuhl, an invitation-only gathering of leading computer scientists who meet to identify and discuss cutting-edge informatics topics.
At the 2014 conference, the concept of how to transfer the knowledge of experts from seasoned software engineers and data scientists to newcomers in the field highlighted many discussions. While there are many books covering data mining and software engineering basics, they present only the fundamentals and lack the perspective that comes from real-world experience. This book offers unique insights into the wisdom of the community’s leaders gathered to share hard-won lessons from the trenches.
Ideas are presented in digestible chapters designed to be applicable across many domains. Topics included cover data collection, data sharing, data mining, and how to utilize these techniques in successful software projects. Newcomers to software engineering data science will learn the tips and tricks of the trade, while more experienced data scientists will benefit from war stories that show what traps to avoid.
Presents the wisdom of community experts, derived from a summit on software analytics
Provides contributed chapters that share discrete ideas and technique from the trenches
Covers top areas of concern, including mining security and social data, data visualization, and cloud-based data
Presented in clear chapters designed to be applicable across many domains
Software engineering data scientists; may be of interest to students in a graduate seminar.
Perspectives on data science for software engineering
Why This Book?
About This Book
Software analytics and its application in practice
Six Perspectives of Software Analytics
Experiences in Putting Software Analytics into Practice
Seven principles of inductive software engineering: What we do is different
Different and Important
Principle #1: Humans Before Algorithms
Principle #2: Plan for Scale
Principle #3: Get Early Feedback
Principle #4: Be Open Minded
Principle #5: Be smart with your learning
Principle #6: Live With the Data You Have
Principle #7: Develop a Broad Skill Set That Uses a Big Toolkit
The need for data analysis patterns (in software engineering)
The Remedy Metaphor
Software Engineering Data
Needs of Data Analysis Patterns
Building Remedies for Data Analysis in Software Engineering Research
From software data to software theory: The path less traveled
Pathways of Software Repository Research
From Observation, to Theory, to Practice
Why theory matters
How to Use Theory
How to Build Theory
In Summary: Find a Theory or Build One Yourself
Mining apps for anomalies
The Million-Dollar Question
Detecting Abnormal Behavior
A Treasure Trove of Data …
… but Also Obstacles
Embrace dynamic artifacts
Can We Minimize the USB Driver Test Suite?
Still Not Convinced? Here’s More
Dynamic Artifacts Are Here to Stay
Mobile app store analytics
Understanding End Users
The naturalness of software
Transforming Software Practice
Advances in release readiness
Predictive Test Metrics
Universal Release Criteria Model
Best Estimation Technique
Using Models in Release Management
Research to Implementation: A Difficult (but Rewarding) Journey
How to tame your online services
Service Analysis Studio
Measuring individual productivity
No Single and Simple Best Metric for Success/Productivity
Measure the Process, Not Just the Outcome
Allow for Measures to Evolve
Goodhart’s Law and the Effect of Measuring
How to Measure Individual Productivity?
Stack traces reveal attack surfaces
Another Use of Stack Traces?
Attack Surface Approximation
Visual analytics for software engineering data
Gameplay data plays nicer when divided into cohorts
Cohort Analysis as a Tool for Gameplay Data
Play to Lose
Case Studies of Gameplay Data
Challenges of Using Cohorts
A success story in applying data science in practice
Communication Process—Best Practices
There's never enough time to do all the testing you want
The Impact of Short Release Cycles (There's Not Enough Time)
Learn From Your Test Execution History
The Art of Testing Less
Tests Evolve Over Time
The perils of energy mining: measure a bunch, compare just once
A Tale of Two HTTPs
Let's ENERGISE Your Software Energy Experiments
Identifying fault-prone files in large industrial software systems
A tailored suit: The big opportunity in personalizing issue tracking
Many Choices, Nothing Great
The Need for Personalization
Developer Dashboards or “A Tailored Suit”
Room for Improvement
What counts is decisions, not numbers—Toward an analytics design sheet
The Decision-Making Process
The Analytics Design Sheet
Example: App Store Release Analysis
A large ecosystem study to understand the effect of programming languages on code quality
Study Design and Analysis
Code reviews are not for finding defects—Even established tools need occasional evaluation
The Interview Guide
Collecting Background Data
Conducting the Interview
Post-Interview Discussion and Notes
Now Go Interview!
Look for state transitions in temporal data
Bikeshedding in Software Engineering
Summarizing Temporal Data
Card-sorting: From text to themes
Tools! Tools! We need tools!
Tools in Science
The Tools We Need
Recommendations for Tool Building
Evidence-based software engineering
The Aim and Methodology of EBSE
Strength of Evidence
Evidence and Theory
Which machine learning method do you need?
Do additional Data Arrive Over Time?
Are Changes Likely to Happen Over Time?
If You Have a Prediction Problem, What Do You Really Need to Predict?
Do You Have a Prediction Problem Where Unlabeled Data are Abundant and Labeled Data are Expensive?
Are Your Data Imbalanced?
Do You Need to Use Data From Different Sources?
Do You Have Big Data?
Do You Have Little Data?
Structure your unstructured data first!: The case of summarizing unstructured data with tag clouds
Unstructured Data in Software Engineering
Summarizing Unstructured Software Data
Parse that data! Practical tips for preparing your raw data for analysis
Use Assertions Everywhere
Print Information About Broken Records
Use Sets or Counters to Store Occurrences of Categorical Variables
Restart Parsing in the Middle of the Data Set
Test on a Small Subset of Your Data
Redirect Stdout and Stderr to Log Files
Store Raw Data Alongside Cleaned Data
Finally, Write a Verifier Program to Check the Integrity of Your Cleaned Data
Natural language processing is no free lunch
Natural Language Data in Software Projects
Natural Language Processing
How to Apply NLP to Software Projects
Aggregating empirical evidence for more trustworthy decisions
What Does Data From Empirical Studies Look Like?
The Evidence-Based Paradigm and Systematic Reviews
How Far Can We Use the Outcomes From Systematic Review to Make Decisions?
If it is software engineering, it is (probably) a Bayesian factor
Causing the Future With Bayesian Networks
The Need for a Hybrid Approach in Software Analytics
Use the Methodology, Not the Model
Becoming Goldilocks: Privacy and data sharing in “just right” conditions
The “Data Drought”
Change is Good
Don’t Share Everything
Share Your Leaders
The wisdom of the crowds in predictive modeling for software engineering
The Wisdom of the Crowds
So… How is That Related to Predictive Modeling for Software Engineering?
Examples of Ensembles and Factors Affecting Their Accuracy
Crowds for Transferring Knowledge and Dealing With Changes
Crowds for Multiple Goals
A Crowd of Insights
Ensembles as Versatile Tools
Combining quantitative and qualitative methods (when mining software data)
Prologue: We Have Solid Empirical Evidence!
Correlation is Not Causation and, Even If We Can Claim Causation…
Collect Your Data: People and Artifacts
Build a Theory Upon Your Data
Conclusion: The Truth is Out There!
A process for surviving survey design and sailing through survey deployment
The Lure of the Sirens: The Attraction of Surveys
Navigating the Open Seas: A Successful Survey Process in Software Engineering
Log it all?
A Parable: The Blind Woman and an Elephant
Misinterpreting Phenomenon in Software Engineering
Using Data to Expand Perspectives
Why provenance matters
What are the Key Entities?
What are the Key Tasks?
Open from the beginning
Why the Difference?
Be Open or Be Irrelevant
Reducing time to insight
What is Insight Anyway?
Time to Insight
The Insight Value Chain
What To Do
A Warning on Waste
Five steps for success: How to deploy data science in your organizations
Step 1. Choose the Right Questions for the Right Team
Step 2. Work Closely With Your Consumers
Step 3. Validate and Calibrate Your Data
Step 4. Speak Plainly to Give Results Business Value
Step 5. Go the Last Mile—Operationalizing Predictive Models
How the release process impacts your software analytics
Linking Defect Reports and Code Changes to a Release
How the Version Control System Can Help
Security cannot be measured
Gotcha #1: Security is Negatively Defined
Gotcha #2: Having Vulnerabilities is Actually Normal
Gotcha #3: “More Vulnerabilities” Does not Always Mean “Less Secure”
Gotcha #4: Design Flaws are not Usually Tracked
Gotcha #5: Hackers are Innovative Too
An Unfair Question
Gotchas from mining bug reports
Do Bug Reports Describe Code Defects?
It's the User That Defines the Work Item Type
Do Developers Apply Atomic Changes?
Make visualization part of your analysis process
Leveraging Visualizations: An Example With Software Repository Histories
How to Jump the Pitfalls
Don't forget the developers! (and be careful with your assumptions)
Are We Actually Helping Developers?
Some Observations and Recommendations
Limitations and context of research
Small Research Projects
Data Quality of Open Source Repositories
Lack of Industrial Representatives at Conferences
Research From Industry
Actionable metrics are better metrics
What Would You Say… I Should DO?
Cyclomatic Complexity: An Interesting Case
Are Unactionable Metrics Useless?
Replicated results are more trustworthy
The Replication Crisis
Reliability and Validity in Studies
So What Should Researchers Do?
So What Should Practitioners Do?
Diversity in software engineering research
What Is Diversity and Representativeness?
What Can We Do About It?
Once is not enough: Why we need replication
Motivating Example and Tips
Exploring the Unknown
Types of Empirical Results
Do's and Don't's
Mere numbers aren't enough: A plea for visualization
Numbers Are Good, but…
Case Studies on Visualization
What to Do
Don’t embarrass yourself: Beware of bias in your data
Dewey Defeats Truman
Impact of Bias in Software Engineering
Which Features Should I Look At?
Operational data are missing, incorrect, and decontextualized
A Life of a Defect
What to Do?
Data science revolution in process improvement and assessment?
Correlation is not causation (or, when not to scream “Eureka!”)
What Not to Do
Examples from Software Engineering
What to Do
In Summary: Wait and Reflect Before You Report
Software analytics for small software companies: More questions than answers
The Reality for Small Software Companies
Small Software Companies Projects: Smaller and Shorter
Different Goals and Needs
What to Do About the Dearth of Data?
What to Do on a Tight Budget?
Software analytics under the lamp post (or what star trek teaches us about the importance of asking the right questions)
Learning from Data
Which Bin is Mine?
What can go wrong in software engineering experiments?
Evaluate Different Design Alternatives
Match Data Analysis and Experimental Design
Do Not Rely on Statistical Significance Alone
Do a Power Analysis
Find Explanations for Results
Follow Guidelines for Reporting Experiments
Improving the reliability of experimental results
One size does not fit all
While models are good, simple explanations are better
How Do We Compare a USB2 Driver to a USB3 Driver?
The Issue With Our Initial Approach
“Just Tell us What Is Different and Nothing More”
Users Prefer Simple Explanations
The white-shirt effect: Learning from failed expectations
The Right Reaction
Simpler questions can lead to better insights
Context of the Software Analytics Project
Providing Predictions on Buggy Changes
How to Read the Graph?
(Anti-)Patterns in the Error-Handling Graph
How to Act on (Anti-)Patterns?
Continuously experiment to assess values early on
Most Ideas Fail to Show Value
Every Idea Can Be Tested With an Experiment
How Do We Find Good Hypotheses and Conduct the Right Experiments?
Lies, damned lies, and analytics: Why big data needs thick data
How Great It Is, to Have Data Like You
Looking for Answers in All the Wrong Places
Beware the Reality Distortion Field
Build It and They Will Come, but Should We?
To Classify Is Human, but Analytics Relies on Algorithms
Lean in: How Ethnography Can Improve Software Analytics and Vice Versa
Finding the Ethnographer Within
The world is your test suite
Watch the World and Learn
Crashes, Hangs, and Bluescreens
The Need for Speed
Protecting Data and Identity
Discovering Confusion and Missing Requirements
Monitoring Is Mandatory
No. of pages: 408
Published: July 12, 2016
Imprint: Morgan Kaufmann
Paperback ISBN: 9780128042069
eBook ISBN: 9780128042618
Tim Menzies, Full Professor, CS, NC State and a former software research chair at NASA. He has published 200+ publications, many in the area of software analytics. He is an editorial board member (1) IEEE Trans on SE; (2) Automated Software Engineering journal; (3) Empirical Software Engineering Journal. His research includes artificial intelligence, data mining and search-based software engineering. He is best known for his work on the PROMISE open source repository of data for reusable software engineering experiments.
Affiliations and expertise
Professor, Computer Science, North Carolina State University, Raleigh, NC, USA
Laurie Williams, Full Professor and Associate Department Head CS, NC State. 180+ publications, many applying software analytics. She is on the editorial boards of IEEE Trans on SE; (2) Information and Software Technology; and (3) IEEE Software.
Affiliations and expertise
Professor and Associate Department Head of Computer Science, North Carolina State University, Raleigh, NC, USA
is a researcher in the Research in Software Engineering (RiSE) group at Microsoft Research, adjunct assistant professor at the University of Calgary, and affiliate faculty at University of Washington. He is best known for his work on systematic mining of version archives and bug databases to conduct empirical studies and to build tools to support developers and managers. He received two ACM SIGSOFT Distinguished Paper Awards for his work published at the ICSE '07 and FSE '08 conferences.