Foreword
- Extending the Sports Car Analogy to Higher Performance
- What Exactly Is The Unfair Advantage?
- Peak Performance Versus Drivable/Usable Performance
- How Does The Unfair Advantage Relate to This Book?
- Closing Comments
Preface
- Sports Car Tutorial: Introduction for Many-Core Is Online
- Parallelism Pearls: Inspired by Many Cores
- Organization
- Structured Parallel Programming
- What’s New?
- lotsofcores.com
Section I: Knights Landing
Introduction
Chapter 1: Introduction
- Abstract
- Introduction to Many-Core Programming
- Trend: More Parallelism
- Why Intel® Xeon Phi™ Processors Are Needed
- Processors Versus Coprocessor
- Measuring Readiness for Highly Parallel Execution
- What About GPUs?
- Enjoy the Lack of Porting Needed but Still Tune!
- Transformation for Performance
- Hyper-Threading Versus Multithreading
- Programming Models
- Why We Could Skip To Section II Now
- For More Information
Chapter 2: Knights Landing overview
- Abstract
- Overview
- Instruction Set
- Architecture Overview
- Motivation: Our Vision and Purpose
- Summary
- For More Information
Chapter 3: Programming MCDRAM and Cluster modes
- Abstract
- Programming for Cluster Modes
- Programming for Memory Modes
- Query Memory Mode and MCDRAM Available
- SNC Performance Implications of Allocation and Threading
- How to Not Hard Code the NUMA Node Numbers
- Approaches to Determining What to Put in MCDRAM
- Why Rebooting Is Required to Change Modes
- BIOS
- Summary
- For More Information
Chapter 4: Knights Landing architecture
- Abstract
- Tile Architecture
- Cluster Modes
- Memory Interleaving
- Memory Modes
- Interactions of Cluster and Memory Modes
- Summary
- For More Information
Chapter 5: Intel Omni-Path Fabric
- Abstract
- Overview
- Performance and Scalability
- Transport Layer APIs
- Quality of Service
- Virtual Fabrics
- Unicast Address Resolution
- Multicast Address Resolution
- Summary
- For More Information
Chapter 6: μarch optimization advice
- Abstract
- Best Performance From 1, 2, or 4 Threads Per Core, Rarely 3
- Memory Subsystem
- μarch Nuances (tile)
- Direct Mapped MCDRAM Cache
- Advice: Use AVX-512
- Summary
- For More Information
Section II: Parallel Programming
Introduction
Chapter 7: Programming overview for Knights Landing
- Abstract
- To Refactor, or Not to Refactor, That Is the Question
- Evolutionary Optimization of Applications
- Revolutionary Optimization of Applications
- Know When to Hold’em and When to Fold’em
- For More Information
Chapter 8: Tasks and threads
- Abstract
- OpenMP
- Fortran 2008
- Intel TBB
- hStreams
- Summary
- For More Information
Chapter 9: Vectorization
- Abstract
- Why Vectorize?
- How to Vectorize
- Three Approaches to Achieving Vectorization
- Six-Step Vectorization Methodology
- Streaming Through Caches: Data Layout, Alignment, Prefetching, and so on
- Compiler Tips
- Compiler Options
- Compiler Directives
- Use Array Sections to Encourage Vectorization
- Look at What the Compiler Created: Assembly Code Inspection
- Numerical Result Variations With Vectorization
- Summary
- For More Information
Chapter 10: Vectorization advisor
- Abstract
- Getting Started With Intel Advisor for Knights Landing
- Enabling and Improving AVX-512 Code With the Survey Report
- Memory Access Pattern Report
- AVX-512 Gather/Scatter Profiler
- Mask Utilization and FLOPs Profiler
- Advisor Roofline Report
- Explore AVX-512 Code Characteristics Without AVX-512 Hardware
- Example — Analysis of a Computational Chemistry Code
- Summary
- For More Information
Chapter 11: Vectorization with SDLT
- Abstract
- What Is SDLT?
- Getting Started
- SDLT Basics
- Example Normalizing 3d Points With SIMD
- What Is Wrong With AOS Memory Layout and SIMD?
- SIMD Prefers Unit-Stride Memory Accesses
- Alpha-Blended Overlay Reference
- Alpha-Blended Overlay With SDLT
- Additional Features
- Summary
- For More Information
Chapter 12: Vectorization with AVX-512 intrinsics
- Abstract
- What Are Intrinsics?
- AVX-512 Overview
- Migrating From Knights Corner
- AVX-512 Detection
- Learning AVX-512 Instructions
- Learning AVX-512 Intrinsics
- Step-by-Step Example Using AVX-512 Intrinsics
- Results Using Our Intrinsics Code
- For More Information
Chapter 13: Performance libraries
- Abstract
- Intel Performance Library Overview
- Intel Math Kernel Library Overview
- Intel Data Analytics Library Overview
- Together: MKL and DAAL
- Intel Integrated Performance Primitives Library Overview
- Intel Performance Libraries and Intel Compilers
- Native (Direct) Library Usage
- Offloading to Knights Landing While Using a Library
- Precision Choices and Variations
- Performance Tip for Faster Dynamic Libraries
- For More Information
Chapter 14: Profiling and timing
- Abstract
- Introduction to Knight Landing Tuning
- Event-Monitoring Registers
- Efficiency Metrics
- Potential Performance Issues
- Intel VTune Amplifier XE Product
- Performance Application Programming Interface
- MPI Analysis: ITAC
- HPCToolkit
- Tuning and Analysis Utilities
- Timing
- Summary
- For More Information
Chapter 15: MPI
- Abstract
- Internode Parallelism
- MPI on Knights Landing
- MPI Overview
- How to Run MPI Applications
- Analyzing MPI Application Runs
- Tuning of MPI Applications
- Heterogeneous Clusters
- Recent Trends in MPI Coding
- Putting it All Together
- Summary
- For More Information
Chapter 16: PGAS programming models
- Abstract
- To Share or Not to Share
- Why use PGAS on Knights Landing?
- Programming with PGAS
- Performance Evaluation
- Beyond PGAS
- Summary
- For More Information
Chapter 17: Software-defined visualization
- Abstract
- Motivation for Software-Defined Visualization
- Software-Defined Visualization Architecture
- OpenSWR: OpenGL Raster-Graphics Software Rendering
- Embree: High-performance Ray Tracing Kernel Library
- OSPRay: Scalable Ray Tracing Framework
- Summary
- Image Attributions
- For More Information
Chapter 18: Offload to Knights Landing
- Abstract
- Offload Programming Model—Using With Knights Landing
- Processors Versus Coprocessor
- Offload Model Considerations
- OpenMP Target Directives
- Concurrent Host and Target Execution
- Offload Over Fabric
- Summary
- For More Information
Chapter 19: Power analysis
- Abstract
- Power Demand Gates Exascale
- Power 101
- Hardware-Based Power Analysis Techniques
- Software-Based Knights Landing Power Analyzer
- ManyCore Platform Software Package Power Tools
- Running Average Power Limit
- Performance Profiling on Knights Landing
- Intel Remote Management Module
- Summary
- For More Information
Section III: Pearls
Introduction
Chapter 20: Optimizing classical molecular dynamics in LAMMPS
- Abstract
- Acknowledgment
- Molecular Dynamics
- LAMMPS
- Knights Landing Processors
- LAMMPS Optimizations
- Data Alignment
- Data Types and Layout
- Vectorization
- Neighbor List
- Long-Range Electrostatics
- MPI and OpenMP Parallelization
- Performance Results
- System, Build, and Run Configurations
- Workloads
- Organic Photovoltaic Molecules
- Hydrocarbon Mixtures
- Rhodopsin Protein in Solvated Lipid Bilayer
- Coarse Grain Liquid Crystal Simulation
- Coarse-Grain Water Simulation
- Summary
- For More Information
Chapter 21: High performance seismic simulations
- Abstract
- High-Order Seismic Simulations
- Numerical Background
- Application Characteristics
- Intel Architecture as Compute Engine
- Highly-efficient Small Matrix Kernels
- Sparse Matrix Kernel Generation and Sparse/Dense Kernel Selection
- Dense Matrix Kernel Generation: AVX2
- Dense Matrix Kernel Generation: AVX-512
- Kernel Performance Benchmarking
- Incorporating Knights Landing’s Different Memory Subsystems
- Performance Evaluation
- Mount Merapi
- 1992 Landers
- Summary and Take-Aways
- For More Information
Chapter 22: Weather research and forecasting (WRF)
- Abstract
- WRF Overview
- WRF Execution Profile: Relatively Flat
- History of WRF on Intel Many-Core (Intel Xeon Phi Product Line)
- Our Early Experiences With WRF on Knights Landing
- Compiling WRF for Intel Xeon and Intel Xeon Phi Systems
- WRF CONUS12km Benchmark Performance
- MCDRAM Bandwidth
- Vectorization: Boost of AVX-512 Over AVX2
- Core Scaling
- Summary
- For More Information
Chapter 23: N-Body simulation
- Abstract
- Parallel Programming for Noncomputer Scientists
- Step-by-Step Improvements
- N-Body simulation
- optimization
- Initial Implementation (Optimization Step 0)
- Thread parallelism (optimization step 1)
- Scalar Performance Tuning (Optimization Step 2)
- Vectorization with SOA (optimization step 3)
- Memory traffic (optimization step 4)
- Impact of MCDRAM on Performance
- Summary
- For More Information
Chapter 24: Machine learning
- Abstract
- Convolutional Neural Networks
- OverFeat-FAST Results
- For More Information
Chapter 25: Trinity workloads
- Abstract
- Out of the Box Performance
- Optimizing MiniGhost OpenMP Performance
- Summary
- For More Information
Chapter 26: Quantum chromodynamics
- Abstract
- LQCD
- The QPhiX Library and Code Generator
- Wilson-Dslash Operator
- Configuring the QPhiX Code Generator
- The Experimental Setup
- Results
- Conclusion
- For More Information