High Performance Parallelism Pearls Volume One

Multicore and Many-core Programming Approaches

1st Edition - November 3, 2014
Latest edition
Authors: James Reinders, James Jeffers
Language: English

High Performance Parallelism Pearls shows how to leverage parallelism on processors and coprocessors with the same programming – illustrating the most effective ways to better ta… Read more

Description

High Performance Parallelism Pearls shows how to leverage parallelism on processors and coprocessors with the same programming – illustrating the most effective ways to better tap the computational potential of systems with Intel Xeon Phi coprocessors and Intel Xeon processors or other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as chemistry, engineering, and environmental science. Each chapter in this edited work includes detailed explanations of the programming techniques used, while showing high performance results on both Intel Xeon Phi coprocessors and multicore processors. Learn from dozens of new examples and case studies illustrating "success stories" demonstrating not just the features of these powerful systems, but also how to leverage parallelism across these heterogeneous systems.

Key features

Promotes consistent standards-based programming, showing in detail how to code for high performance on multicore processors and Intel® Xeon Phi™
Examples from multiple vertical domains illustrating parallel optimizations to modernize real-world codes
Source code available for download to facilitate further exploration

Readership

software engineers in high-performance computing and system developers in vertical domains hoping to leverage HPC

Foreword

Humongous computing needs: Science years in the making
Open standards
Keen on many-core architecture
Xeon Phi is born: Many cores, excellent vector ISA
Learn highly scalable parallel programming
Future demands grow: Programming models matter

Preface

Inspired by 61 cores: A new era in programming

Chapter 1: Introduction

Abstract
Learning from successful experiences
Code modernization
Modernize with concurrent algorithms
Modernize with vectorization and data locality
Understanding power usage
ISPC and OpenCL anyone?
Intel Xeon Phi coprocessor specific
Many-core, neo-heterogeneous
No “Xeon Phi” in the title, neo-heterogeneous programming
The future of many-core
Downloads

Chapter 2: From “Correct” to “Correct & Efficient”: A Hydro2D Case Study with Godunov’s Scheme

Abstract
Scientific computing on contemporary computers
A numerical method for shock hydrodynamics
Features of modern architectures
Paths to performance
Summary

Chapter 3: Better Concurrency and SIMD on HBM

Abstract
The application: HIROMB-BOOS-Model
Key usage: DMI
HBM execution profile
Overview for the optimization of HBM
Data structures: Locality done right
Thread parallelism in HBM
Data parallelism: SIMD vectorization
Results
Profiling details
Scaling on processor vs. coprocessor
Contiguous attribute
Summary

Chapter 4: Optimizing for Reacting Navier-Stokes Equations

Abstract
Getting started
Version 1.0: Baseline
Version 2.0: ThreadBox
Version 3.0: Stack memory
Version 4.0: Blocking
Version 5.0: Vectorization
Intel Xeon Phi coprocessor results
Summary

Chapter 5: Plesiochronous Phasing Barriers

Abstract
What can be done to improve the code?
What more can be done to improve the code?
Hyper-Thread Phalanx
What is nonoptimal about this strategy?
Coding the Hyper-Thread Phalanx
Back to work
Data alignment
The plesiochronous phasing barrier
Let us do something to recover this wasted time
A few “left to the reader” possibilities
Xeon host performance improvements similar to Xeon Phi
Summary

Chapter 6: Parallel Evaluation of Fault Tree Expressions

Abstract
Motivation and background
Example implementation
Other considerations
Summary

Chapter 7: Deep-Learning Numerical Optimization

Abstract
Fitting an objective function
Objective functions and principle components analysis
Software and example data
Training data
Runtime results
Scaling results
Summary

Chapter 8: Optimizing Gather/Scatter Patterns

Abstract
Gather/scatter instructions in Intel® architecture
Gather/scatter patterns in molecular dynamics
Optimizing gather/scatter patterns
Summary

Chapter 9: A Many-Core Implementation of the Direct N-Body Problem

Abstract
N-Body simulations
Initial solution
Theoretical limit
Reduce the overheads, align your data
Optimize the memory hierarchy
Improving our tiling
What does all this mean to the host version?
Summary

Chapter 10: N-Body Methods

Abstract
Fast N-body methods and direct N-body kernels
Applications of N-body methods
Direct N-body code
Performance results
Summary

Chapter 11: Dynamic Load Balancing Using OpenMP 4.0

Abstract
Maximizing hardware usage
The N-Body kernel
The offloaded version
A first processor combined with coprocessor version
Version for processor with multiple coprocessors

Chapter 12: Concurrent Kernel Offloading

Abstract
Setting the context
Concurrent kernels on the coprocessor
Force computation in PD using concurrent kernel offloading
The bottom line

Chapter 13: Heterogeneous Computing with MPI

Abstract
Acknowledgments
MPI in the modern clusters
MPI task location
Selection of the DAPL providers
Summary

Chapter 14: Power Analysis on the Intel® Xeon Phi™ Coprocessor

Abstract
Power analysis 101
Measuring power and temperature with software
Hardware-based power analysis methods
Summary

Chapter 15: Integrating Intel Xeon Phi Coprocessors into a Cluster Environment

Abstract
Acknowledgments
Early explorations
Beacon system history
Beacon system architecture
Intel MPSS installation procedure
Setting up the resource and workload managers
Health checking and monitoring
Scripting common commands
User software environment
Future directions
Summary

Chapter 16: Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors

Abstract
Network configuration concepts and goals
Coprocessor file systems support
Summary

Chapter 17: NWChem: Quantum Chemistry Simulations at Scale

Abstract
Introduction
Overview of single-reference CC formalism
NWChem software architecture
Engineering an offload solution
Offload architecture
Kernel optimizations
Performance evaluation
Summary
Acknowledgments

Chapter 18: Efficient Nested Parallelism on Large-Scale Systems

Abstract
Motivation
The benchmark
Baseline benchmarking
Pipeline approach—flat_arena class
Intel® TBB user-managed task arenas
Hierarchical approach—hierarchical_arena class
Performance evaluation
Implication on NUMA architectures
Summary

Chapter 19: Performance Optimization of Black-Scholes Pricing

Abstract
Financial market model basics and the Black-Scholes formula
Case study
Summary

Chapter 20: Data Transfer Using the Intel COI Library

Abstract
First steps with the Intel COI library
COI buffer types and transfer performance
Applications
Summary

Chapter 21: High-Performance Ray Tracing

Abstract
Background
Vectorizing ray traversal
The Embree ray tracing kernels
Using Embree in an application
Performance
Summary

Chapter 22: Portable Performance with OpenCL

Abstract
The dilemma
A brief introduction to OpenCL
A matrix multiply example in OpenCL
OpenCL and the Intel Xeon Phi Coprocessor
Matrix multiply performance results
Case study: Molecular docking
Results: Portable performance
Related work
Summary

Chapter 23: Characterization and Optimization Methodology Applied to Stencil Computations

Abstract
Introduction
Performance evaluation
Standard optimizations
Summary

Chapter 24: Profiling-Guided Optimization

Abstract
Matrix transposition in computer science
Tools and methods
“Serial”: Our original in-place transposition
“Parallel”: Adding parallelism with OpenMP
“Tiled”: Improving data locality
“Regularized”: Microkernel with multiversioning
“Planned”: Exposing more parallelism
Summary

Chapter 25: Heterogeneous MPI application optimization with ITAC

Abstract
Asian options pricing
Application design
Synchronization in heterogeneous clusters
Finding bottlenecks with ITAC
Setting up ITAC
Unbalanced MPI run
Manual workload balance
Dynamic “Boss-Workers” load balancing
Conclusion

Chapter 26: Scalable Out-of-Core Solvers on a Cluster

Abstract
Introduction
An OOC factorization based on ScaLAPACK
Porting from NVIDIA GPU to the Intel Xeon Phi coprocessor
Numerical results
Conclusions and future work
Acknowledgments

Chapter 27: Sparse Matrix-Vector Multiplication: Parallelization and Vectorization

Abstract
Acknowledgments
Background
Sparse matrix data structures
Parallel SpMV multiplication
Vectorization on the Intel Xeon Phi coprocessor
Evaluation
Summary

Chapter 28: Morton Order Improves Performance

Abstract
Improving cache locality by data ordering
Improving performance
Matrix transpose
Matrix multiply
Summary

Review quotes

"This book will make it much easier in general to exploit high levels of parallelism including programming optimally for the Intel Xeon Phi products. The common programming methodology between the Xeon and Xeon Phi families is good news for the entire scientific and engineering community; the same programming can realize parallel scaling and vectorization for both multicore and many-core."–-from the Foreword by Sverre Jarp, CERN Openlab CTO

Product details

Edition: 1
Latest edition
Published: November 4, 2014
Language: English

About the authors

James Reinders

James Reinders is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including the world’s first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for a number of Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. James has published numerous articles, contributed to several books and is widely interviewed on parallelism. James has managed software development groups, customer service and consulting teams, business development and marketing teams. James is sought after to keynote on parallel programming, and is the author/co-author of three books currently in print including Structured Parallel Programming, published by Morgan Kaufmann in 2012.

Affiliations and expertise

Director and Programming Model Architect, Intel Corporation

James Jeffers

Jim Jeffers was the primary strategic planner and one of the first full-time employees on the program that became Intel ® MIC. He served as lead SW Engineering Manager on the program and formed and launched the SW development team. As the program evolved, he became the workloads (applications) and SW performance team manager. He has some of the deepest insight into the market, architecture and programming usages of the MIC product line. He has been a developer and development manager for embedded and high performance systems for close to 30 years.

Affiliations and expertise

Principal Engineer and Visualization Lead, Intel Corporation

View book on ScienceDirect

Read High Performance Parallelism Pearls Volume One on ScienceDirect

Life Sciences

Physical Sciences & Engineering

Social Sciences & Humanities

Health