Guerrilla Analytics
A Practical Approach to Working with Data
- 1st Edition - September 23, 2014
- Author: Enda Ridge
- Language: English
- Paperback ISBN:9 7 8 - 0 - 1 2 - 8 0 0 2 1 8 - 6
- eBook ISBN:9 7 8 - 0 - 1 2 - 8 0 0 5 0 3 - 3
Doing data science is difficult. Projects are typically very dynamic with requirements that change as data understanding grows. The data itself arrives piecemeal, is added to, re… Read more

Purchase options
Institutional subscription on ScienceDirect
Request a sales quoteDoing data science is difficult. Projects are typically very dynamic with requirements that change as data understanding grows. The data itself arrives piecemeal, is added to, replaced, contains undiscovered flaws and comes from a variety of sources. Teams also have mixed skill sets and tooling is often limited. Despite these disruptions, a data science team must get off the ground fast and begin demonstrating value with traceable, tested work products. This is when you need Guerrilla Analytics.
In this book, you will learn about:
The Guerrilla Analytics Principles:
simple rules of thumb for maintaining data provenance across the entire analytics life cycle from data extraction, through analysis to reporting.Reproducible, traceable analytics:
how to design and implement work products that are reproducible, testable and stand up to external scrutiny.Practice tips and war stories
: 90 practice tips and 16 war stories based on real-world project challenges encountered in consulting, pre-sales and research.Preparing for battle:
how to set up your team's analytics environment in terms of tooling, skill sets, workflows and conventions.Data gymnastics:
over a dozen analytics patterns that your team will encounter again and again in projects- The Guerrilla Analytics Principles: simple rules of thumb for maintaining data provenance across the entire analytics life cycle from data extraction, through analysis to reporting
- Reproducible, traceable analytics: how to design and implement work products that are reproducible, testable and stand up to external scrutiny
- Practice tips and war stories: 90 practice tips and 16 war stories based on real-world project challenges encountered in consulting, pre-sales and research
- Preparing for battle: how to set up your team's analytics environment in terms of tooling, skill sets, workflows and conventions
- Data gymnastics: over a dozen analytics patterns that your team will encounter again and again in projects
Data Analytics consultants and contractors, Industry Data Analysts in internal Business Intelligence roles, Data Analytics Managers; (business) students and academics studying data analytics
- Preface
- Part 1: Principles
- Chapter 1: Introducing Guerrilla Analytics
- Summary
- 1.1. What is data analytics?
- 1.2. Types of data analytics projects
- 1.3. Introducing Guerrilla Analytics projects
- 1.4. Guerrilla Analytics definition
- 1.5. Example Guerrilla Analytics projects
- 1.6. Some terminology
- 1.7. Wrap up
- Chapter 2: Guerrilla Analytics: Challenges and Risks
- Summary
- 2.1. The Guerrilla Analytics workflow
- 2.2. Challenges of managing analytics projects
- 2.3. Risks
- 2.4. Impact of failure to address analytics risks
- 2.5. Wrap up
- Chapter 3: Guerrilla Analytics Principles
- Summary
- 3.1. Maintain data provenance despite disruptions
- 3.2. The principles
- 3.3. Applying the principles
- 3.4. Wrap up
- Chapter 1: Introducing Guerrilla Analytics
- Part 2: Practice
- Chapter 4: Stage 1: Data Extraction
- Summary
- 4.1. Guerrilla Analytics workflow
- 4.2. Pitfalls and risks
- 4.3. Practice tip 1: freeze the source system during data extraction
- 4.4. Practice tip 2: extract data into an agreed file format
- 4.5. Practice tip 3: calculate checksums before data extraction
- 4.6. Practice tip 4: capture front-end reports
- 4.7. Practice tip 5: save raw copies of web pages
- 4.8. Practice tip 6: consistency check OCR data
- 4.9. Wrap up
- Chapter 5: Stage 2: Data Receipt
- Summary
- 5.1. Guerrilla Analytics workflow
- 5.2. Pitfalls and risks
- 5.3. Practice tip 7: have a single location for all data received
- 5.4. Practice tip 8: create unique identifiers for received data
- 5.5. Practice tip 9: store data tracking information in a data log
- 5.6. Practice tip 10: never modify raw data files
- 5.7. Practice tip 11: keep supporting material near the data
- 5.8. Practice tip 12: version-control data received
- 5.9. Bringing it all together
- 5.10. Wrap up
- Chapter 6: Stage 3: Data Load
- Summary
- 6.1. Guerrilla Analytics Workflow
- 6.2. Pitfalls and risks
- 6.3. Practice tip 13: minimize modifications to data before load
- 6.4. Practice tip 14: do data load preparations on a copy of raw data files
- 6.5. Practice tip 15: add identifiers to raw data before loading
- 6.6. Practice tip 16: prefer one-to-one Data Loads
- 6.7. Practice tip 17: preserve the raw file name and data UID
- 6.8. Practice tip 18: load data as plain text
- 6.9. Common challenges
- 6.10. Wrap up
- Chapter 7: Stage 4: Analytics Coding for Ease of Review
- Summary
- 7.1. Guerrilla Analytics workflow
- 7.2. Pitfalls and risks
- 7.3. Practice tip 19: use one code file per data output
- 7.4. Practice tip 20: produce clearly identifiable data outputs
- 7.5. Practice tip 21: write code that runs from start to finish
- 7.6. Practice tip 22: favor code that is not embedded in proprietary file formats
- 7.7. Practice tip 23: clearly label the running order of code files
- 7.8. Practice tip 24: drop all datasets at the start of code execution
- 7.9. Practice tip 25: break up data flows into “data steps”
- 7.10. Practice tip 26: don’t jump in and out of a code file
- 7.11. Practice tip 27: log code execution
- 7.12. Common Challenges
- 7.13. Wrap up
- Chapter 8: Stage 4: Analytics Coding to Maintain Data Provenance
- Summary
- 8.1. Guerrilla Analytics workflow
- 8.2. Examples
- 8.3. Pitfalls and risks
- 8.4. Practice tip 28: clean data at a minimum of locations in a data flow
- 8.5. Practice tip 29: when cleaning a data field, keep the original raw field
- 8.6. Practice tip 30: filter data with flags, not deletions
- 8.7. Practice tip 31: identify fields with metadata
- 8.8. Practice tip 32: create a unique identifier for DATA records
- 8.9. Practice tip 33: rename data fields with a field mapping
- 8.10. Wrap up
- Chapter 9: Stage 6: Creating Work Products
- Summary
- 9.1. Guerrilla Analytics workflow
- 9.2. Examples
- 9.3. The essence of a work product
- 9.4. Pitfalls and risks
- 9.5. Practice tip 34: track work products with a Unique Identifier (UID)
- 9.6. Practice tip 35: keep work product generators and outputs close together
- 9.7. Practice tip 36: avoid clutter in the file system
- 9.8. Practice tip 37: avoid clutter in the DME
- 9.9. Practice tip 38: give output data records a UID
- 9.10. Practice tip 39: version control work products
- 9.11. Practice tip 40: use a convention to name complex outputs
- 9.12. Practice tip 41: log all Work Products
- 9.13. Wrap up
- Chapter 10: Stage 7: Reporting
- Summary
- 10.1. Guerrilla Analytics workflow
- 10.2. What is a report?
- 10.3. Why reports are complicated
- 10.4. Report components
- 10.5. Pitfalls and risks
- 10.6. Practice tip 42: liaise with report writers
- 10.7. Practice tip 43: create one work product per report component
- 10.8. Practice tip 44: make presentation quality work products
- 10.9. Extreme reporting
- 10.10. Wrap up
- Chapter 11: Stage 5: Consolidating Knowledge in Builds
- Summary
- 11.1. Introduction
- 11.2. Pitfalls and risks
- 11.3. Example: the customer address problem
- 11.4. Sources of variation
- 11.5. Definition of a build
- 11.6. The customer address example using a Build
- 11.7. Data Builds
- 11.8. Service Builds
- 11.9. When to start a build
- 11.10. Wrap up
- Chapter 4: Stage 1: Data Extraction
- Part 3: Testing
- Chapter 12: Introduction to Testing
- Summary
- 12.1. Guerrilla Analytics workflow
- 12.2. What is testing?
- 12.3. Why do testing?
- 12.4. Areas of testing
- 12.5. Comparing expected and actual
- 12.6. The challenge of testing Guerrilla Analytics
- 12.7. Practice Tip 61: establish a testing culture
- 12.8. Practice Tip 62: test early
- 12.9. Practice Tip 63: test often
- 12.10. Practice Tip 64: give tests unique identifiers
- 12.11. Practice Tip 65: organize test data by test UID
- 12.12. Next chapters on testing
- 12.13. Wrap up
- Chapter 13: Testing Data
- Summary
- 13.1. Guerrilla Analytics workflow
- 13.2. The five C’s of testing data
- 13.3. Testing data completeness
- 13.4. Testing data correctness
- 13.5. Testing consistency
- 13.6. Testing data coherence
- 13.7. Testing accountability
- 13.8. Implementing data testing
- 13.9. Wrap up
- Chapter 14: Testing Builds
- Summary
- 14.1. Structure of a data build
- 14.2. An illustrative example
- 14.3. Types of build tests
- 14.4. Test code development
- 14.5. Organizing build test code
- 14.6. Organizing test data
- 14.7. Wrap up
- Chapter 15: Testing Work Products
- Summary
- 15.1. Types of testable work products
- 15.2. Ordinary work products
- 15.3. General tips on testing ordinary work products
- 15.4. Testing statistical models
- 15.5. General tips on testing models
- 15.6. Wrap up
- Chapter 12: Introduction to Testing
- Part 4: Building Guerrilla Analytics Capability
- Introduction
- Chapter 16: People
- Summary
- 16.1. That question again – what is data analytics?
- 16.2. Guerrilla Analytics skills
- 16.3. Programming
- 16.4. Substantive expertise
- 16.5. Communication
- 16.6. “Maths and stats”
- 16.7. Visualization
- 16.8. Software engineering
- 16.9. Mindset
- 16.10. Wrap up
- Chapter 17: Process
- Summary
- 17.1. What is workflow management?
- 17.2. Workflows in Analytics
- 17.3. Levels of review
- 17.4. Linking work products
- 17.5. Classifying work products
- 17.6. Granularity
- 17.7. When to use workflow management
- 17.8. Wrap up
- Chapter 18: Technology
- Summary
- 18.1. Analytics capabilities
- 18.2. Data manipulation environment
- 18.3. Source code control
- 18.4. Access to the command line
- 18.5. High-level scripting language
- 18.6. Visualization
- 18.7. Build tool
- 18.8. Access to the internet
- 18.9. Encryption
- 18.10. Code libraries for data wrangling
- 18.11. Machine learning and statistics libraries
- 18.12. Centralized and controlled file system
- 18.13. Additional technology capabilities
- 18.14. Wrap up
- Chapter 19: Closing Remarks
- 19.1. What was this book about?
- 19.2. Next steps for Guerrilla Analytics
- 19.3. Keep in touch
- Acknowledgments
- Appendix: Data Gymnastics
- References
- Index
- Introduction
- No. of pages: 276
- Language: English
- Edition: 1
- Published: September 23, 2014
- Imprint: Morgan Kaufmann
- Paperback ISBN: 9780128002186
- eBook ISBN: 9780128005033
ER
Enda Ridge
He has consulted to clients in the public and private sectors including financial services, insurance, audit and IT security. Enda is an expert in agile analytics for real world projects where data and requirements change often, resources and tooling are sometimes very limited and results must be traceable and auditable for high profile stakeholders. His experience includes analytics to support the forensic investigation of a major US bankruptcy and the remediation a UK bank’s mis-selling of financial products. He has also applied machine learning and NoSQL approaches to problems in document classification, surveillance and IT access controls. His PhD used Design of Experiments techniques to methodically evaluate algorithm performance.
Enda has authored or co-authored 12 academic research papers, is an invited contributor to edited books and has spoken at several analytics practitioner conferences.
Enda holds a Bachelor’s degree in Mechanical Engineering and Master’s in Applied Computing from the National University of Ireland at Galway and was awarded the National University of Ireland’s Travelling Studentship in Engineering. His PhD was awarded by the University of York, UK.