Intel Xeon Phi Processor High Performance Programming

Knights Landing Edition

2nd Edition - May 31, 2016
Latest edition
Authors: James Jeffers, James Reinders, Avinash Sodani
Language: English

Intel Xeon Phi Processor High Performance Programming is an all-in-one source of information for programming the Second-Generation Intel Xeon Phi product family also called Kn… Read more

Data Mining & ML

Unlock the cutting edge

Up to 20% on trusted resources. Build expertise with data mining, ML methods.

Explore now

Description

Intel Xeon Phi Processor High Performance Programming is an all-in-one source of information for programming the Second-Generation Intel Xeon Phi product family also called Knights Landing. The authors provide detailed and timely Knights Landingspecific details, programming advice, and real-world examples. The authors distill their years of Xeon Phi programming experience coupled with insights from many expert customers — Intel Field Engineers, Application Engineers, and Technical Consulting Engineers — to create this authoritative book on theessentials of programming for Intel Xeon Phi products.

Intel® Xeon Phi™ Processor High-Performance Programming is useful even before you ever program a system with an Intel Xeon Phi processor. To help ensure that your applications run at maximum efficiency, the authors emphasize key techniques for programming any modern parallel computing system whether based on Intel Xeon processors, Intel Xeon Phi processors, or other high-performance microprocessors. Applying these techniques will generally increase your program performance on any system and prepareyou better for Intel Xeon Phi processors.

Key features

A practical guide to the essentials for programming Intel Xeon Phi processors
Definitive coverage of the Knights Landing architecture
Presents best practices for portable, high-performance computing and a familiar and proven threads and vectors programming model
Includes real world code examples that highlight usages of the unique aspects of this new highly parallel and high-performance computational product
Covers use of MCDRAM, AVX-512, Intel® Omni-Path fabric, many-cores (up to 72), and many threads (4 per core)
Covers software developer tools, libraries and programming models
Covers using Knights Landing as a processor and a coprocessor

Readership

Software engineers, High Performance and Super Computing developers, scientific researchers in need of high-performance computing resources

Foreword

Extending the Sports Car Analogy to Higher Performance
What Exactly Is The Unfair Advantage?
Peak Performance Versus Drivable/Usable Performance
How Does The Unfair Advantage Relate to This Book?
Closing Comments

Preface

Sports Car Tutorial: Introduction for Many-Core Is Online
Parallelism Pearls: Inspired by Many Cores
Organization
Structured Parallel Programming
What’s New?
lotsofcores.com

Section I: Knights Landing

Introduction

Chapter 1: Introduction

Abstract
Introduction to Many-Core Programming
Trend: More Parallelism
Why Intel® Xeon Phi™ Processors Are Needed
Processors Versus Coprocessor
Measuring Readiness for Highly Parallel Execution
What About GPUs?
Enjoy the Lack of Porting Needed but Still Tune!
Transformation for Performance
Hyper-Threading Versus Multithreading
Programming Models
Why We Could Skip To Section II Now
For More Information

Chapter 2: Knights Landing overview

Abstract
Overview
Instruction Set
Architecture Overview
Motivation: Our Vision and Purpose
Summary
For More Information

Chapter 3: Programming MCDRAM and Cluster modes

Abstract
Programming for Cluster Modes
Programming for Memory Modes
Query Memory Mode and MCDRAM Available
SNC Performance Implications of Allocation and Threading
How to Not Hard Code the NUMA Node Numbers
Approaches to Determining What to Put in MCDRAM
Why Rebooting Is Required to Change Modes
BIOS
Summary
For More Information

Chapter 4: Knights Landing architecture

Abstract
Tile Architecture
Cluster Modes
Memory Interleaving
Memory Modes
Interactions of Cluster and Memory Modes
Summary
For More Information

Chapter 5: Intel Omni-Path Fabric

Abstract
Overview
Performance and Scalability
Transport Layer APIs
Quality of Service
Virtual Fabrics
Unicast Address Resolution
Multicast Address Resolution
Summary
For More Information

Chapter 6: μarch optimization advice

Abstract
Best Performance From 1, 2, or 4 Threads Per Core, Rarely 3
Memory Subsystem
μarch Nuances (tile)
Direct Mapped MCDRAM Cache
Advice: Use AVX-512
Summary
For More Information

Section II: Parallel Programming

Introduction

Chapter 7: Programming overview for Knights Landing

Abstract
To Refactor, or Not to Refactor, That Is the Question
Evolutionary Optimization of Applications
Revolutionary Optimization of Applications
Know When to Hold’em and When to Fold’em
For More Information

Chapter 8: Tasks and threads

Abstract
OpenMP
Fortran 2008
Intel TBB
hStreams
Summary
For More Information

Chapter 9: Vectorization

Abstract
Why Vectorize?
How to Vectorize
Three Approaches to Achieving Vectorization
Six-Step Vectorization Methodology
Streaming Through Caches: Data Layout, Alignment, Prefetching, and so on
Compiler Tips
Compiler Options
Compiler Directives
Use Array Sections to Encourage Vectorization
Look at What the Compiler Created: Assembly Code Inspection
Numerical Result Variations With Vectorization
Summary
For More Information

Chapter 10: Vectorization advisor

Abstract
Getting Started With Intel Advisor for Knights Landing
Enabling and Improving AVX-512 Code With the Survey Report
Memory Access Pattern Report
AVX-512 Gather/Scatter Profiler
Mask Utilization and FLOPs Profiler
Advisor Roofline Report
Explore AVX-512 Code Characteristics Without AVX-512 Hardware
Example — Analysis of a Computational Chemistry Code
Summary
For More Information

Chapter 11: Vectorization with SDLT

Abstract
What Is SDLT?
Getting Started
SDLT Basics
Example Normalizing 3d Points With SIMD
What Is Wrong With AOS Memory Layout and SIMD?
SIMD Prefers Unit-Stride Memory Accesses
Alpha-Blended Overlay Reference
Alpha-Blended Overlay With SDLT
Additional Features
Summary
For More Information

Chapter 12: Vectorization with AVX-512 intrinsics

Abstract
What Are Intrinsics?
AVX-512 Overview
Migrating From Knights Corner
AVX-512 Detection
Learning AVX-512 Instructions
Learning AVX-512 Intrinsics
Step-by-Step Example Using AVX-512 Intrinsics
Results Using Our Intrinsics Code
For More Information

Chapter 13: Performance libraries

Abstract
Intel Performance Library Overview
Intel Math Kernel Library Overview
Intel Data Analytics Library Overview
Together: MKL and DAAL
Intel Integrated Performance Primitives Library Overview
Intel Performance Libraries and Intel Compilers
Native (Direct) Library Usage
Offloading to Knights Landing While Using a Library
Precision Choices and Variations
Performance Tip for Faster Dynamic Libraries
For More Information

Chapter 14: Profiling and timing

Abstract
Introduction to Knight Landing Tuning
Event-Monitoring Registers
Efficiency Metrics
Potential Performance Issues
Intel VTune Amplifier XE Product
Performance Application Programming Interface
MPI Analysis: ITAC
HPCToolkit
Tuning and Analysis Utilities
Timing
Summary
For More Information

Chapter 15: MPI

Abstract
Internode Parallelism
MPI on Knights Landing
MPI Overview
How to Run MPI Applications
Analyzing MPI Application Runs
Tuning of MPI Applications
Heterogeneous Clusters
Recent Trends in MPI Coding
Putting it All Together
Summary
For More Information

Chapter 16: PGAS programming models

Abstract
To Share or Not to Share
Why use PGAS on Knights Landing?
Programming with PGAS
Performance Evaluation
Beyond PGAS
Summary
For More Information

Chapter 17: Software-defined visualization

Abstract
Motivation for Software-Defined Visualization
Software-Defined Visualization Architecture
OpenSWR: OpenGL Raster-Graphics Software Rendering
Embree: High-performance Ray Tracing Kernel Library
OSPRay: Scalable Ray Tracing Framework
Summary
Image Attributions
For More Information

Chapter 18: Offload to Knights Landing

Abstract
Offload Programming Model—Using With Knights Landing
Processors Versus Coprocessor
Offload Model Considerations
OpenMP Target Directives
Concurrent Host and Target Execution
Offload Over Fabric
Summary
For More Information

Chapter 19: Power analysis

Abstract
Power Demand Gates Exascale
Power 101
Hardware-Based Power Analysis Techniques
Software-Based Knights Landing Power Analyzer
ManyCore Platform Software Package Power Tools
Running Average Power Limit
Performance Profiling on Knights Landing
Intel Remote Management Module
Summary
For More Information

Section III: Pearls

Introduction

Chapter 20: Optimizing classical molecular dynamics in LAMMPS

Abstract
Acknowledgment
Molecular Dynamics
LAMMPS
Knights Landing Processors
LAMMPS Optimizations
Data Alignment
Data Types and Layout
Vectorization
Neighbor List
Long-Range Electrostatics
MPI and OpenMP Parallelization
Performance Results
System, Build, and Run Configurations
Workloads
Organic Photovoltaic Molecules
Hydrocarbon Mixtures
Rhodopsin Protein in Solvated Lipid Bilayer
Coarse Grain Liquid Crystal Simulation
Coarse-Grain Water Simulation
Summary
For More Information

Chapter 21: High performance seismic simulations

Abstract
High-Order Seismic Simulations
Numerical Background
Application Characteristics
Intel Architecture as Compute Engine
Highly-efficient Small Matrix Kernels
Sparse Matrix Kernel Generation and Sparse/Dense Kernel Selection
Dense Matrix Kernel Generation: AVX2
Dense Matrix Kernel Generation: AVX-512
Kernel Performance Benchmarking
Incorporating Knights Landing’s Different Memory Subsystems
Performance Evaluation
Mount Merapi
1992 Landers
Summary and Take-Aways
For More Information

Chapter 22: Weather research and forecasting (WRF)

Abstract
WRF Overview
WRF Execution Profile: Relatively Flat
History of WRF on Intel Many-Core (Intel Xeon Phi Product Line)
Our Early Experiences With WRF on Knights Landing
Compiling WRF for Intel Xeon and Intel Xeon Phi Systems
WRF CONUS12km Benchmark Performance
MCDRAM Bandwidth
Vectorization: Boost of AVX-512 Over AVX2
Core Scaling
Summary
For More Information

Chapter 23: N-Body simulation

Abstract
Parallel Programming for Noncomputer Scientists
Step-by-Step Improvements
N-Body simulation
optimization
Initial Implementation (Optimization Step 0)
Thread parallelism (optimization step 1)
Scalar Performance Tuning (Optimization Step 2)
Vectorization with SOA (optimization step 3)
Memory traffic (optimization step 4)
Impact of MCDRAM on Performance
Summary
For More Information

Chapter 24: Machine learning

Abstract
Convolutional Neural Networks
OverFeat-FAST Results
For More Information

Chapter 25: Trinity workloads

Abstract
Out of the Box Performance
Optimizing MiniGhost OpenMP Performance
Summary
For More Information

Chapter 26: Quantum chromodynamics

Abstract
LQCD
The QPhiX Library and Code Generator
Wilson-Dslash Operator
Configuring the QPhiX Code Generator
The Experimental Setup
Results
Conclusion
For More Information

Review quotes

"I believe you will find this book is an invaluable reference to help develop your own Unfair Advantage."– James A. Ang, Ph.D., Manager, Exascale Computing Program, Sandia National Laboratories, New Mexico, USA

Product details

Edition: 2
Latest edition
Published: June 17, 2016
Language: English

About the authors

James Jeffers

Jim Jeffers was the primary strategic planner and one of the first full-time employees on the program that became Intel ® MIC. He served as lead SW Engineering Manager on the program and formed and launched the SW development team. As the program evolved, he became the workloads (applications) and SW performance team manager. He has some of the deepest insight into the market, architecture and programming usages of the MIC product line. He has been a developer and development manager for embedded and high performance systems for close to 30 years.

Affiliations and expertise

Principal Engineer and Visualization Lead, Intel Corporation

James Reinders

James Reinders is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including the world’s first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for a number of Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. James has published numerous articles, contributed to several books and is widely interviewed on parallelism. James has managed software development groups, customer service and consulting teams, business development and marketing teams. James is sought after to keynote on parallel programming, and is the author/co-author of three books currently in print including Structured Parallel Programming, published by Morgan Kaufmann in 2012.

Affiliations and expertise

Director and Programming Model Architect, Intel Corporation

Avinash Sodani

Avinash Sodani is the chief architect of the Knights Landing Xeon Phi Processor. He has many years of experience architecting high end processors and previously was one of the architects for the first Core(tm) processor codenamed Nehalem.

Affiliations and expertise

PhD, Senior Principal Engineer and Chief Architect of Knights Landing Processor, Intel

View book on ScienceDirect

Read Intel Xeon Phi Processor High Performance Programming on ScienceDirect

Life Sciences

Physical Sciences & Engineering

Social Sciences & Humanities

Health

Intel Xeon Phi Processor High Performance Programming

Knights Landing Edition

Unlock the cutting edge

Description

Key features

Readership

Table of contents

Review quotes

Product details

About the authors

James Jeffers

James Reinders

Avinash Sodani

View book on ScienceDirect