Intel Xeon Phi Coprocessor High Performance Programming

1st Edition - February 11, 2013
Newer edition is available
Authors: James Jeffers, James Reinders
Language: English

Authors Jim Jeffers and James Reinders spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel Xeon Phi co… Read more

Holiday Savings

Save up to 30% off books & Journals plus free shipping on all orders.

Shop now

Authors Jim Jeffers and James Reinders spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel Xeon Phi coprocessor. They have distilled their own experiences coupled with insights from many expert customers, Intel Field Engineers, Application Engineers and Technical Consulting Engineers, to create this authoritative first book on the essentials of programming for this new architecture and these new products.

This book is useful even before you ever touch a system with an Intel Xeon Phi coprocessor. To ensure that your applications run at maximum efficiency, the authors emphasize key techniques for programming any modern parallel computing system whether based on Intel Xeon processors, Intel Xeon Phi coprocessors, or other high performance microprocessors. Applying these techniques will generally increase your program performance on any system, and better prepare you for Intel Xeon Phi coprocessors and the Intel MIC architecture.

Foreword

Preface

Organization

Lots-of-cores.com

Acknowledgements

Chapter 1. Introduction

Trend: more parallelism

Why Intel® Xeon Phi™ coprocessors are needed

Platforms with coprocessors

The first Intel® Xeon Phi™ coprocessor

Keeping the “Ninja Gap” under control

Transforming-and-tuning double advantage

When to use an Intel® Xeon Phi™ coprocessor

Maximizing performance on processors first

Why scaling past one hundred threads is so important

Maximizing parallel program performance

Measuring readiness for highly parallel execution

What about GPUs?

Beyond the ease of porting to increased performance

Transformation for performance

Hyper-threading versus multithreading

Coprocessor major usage model: MPI versus offload

Compiler and programming models

Cache optimizations

Examples, then details

For more information

Chapter 2. High Performance Closed Track Test Drive!

Looking under the hood: coprocessor specifications

Starting the car: communicating with the coprocessor

Taking it out easy: running our first code

Starting to accelerate: running more than one thread

Petal to the metal: hitting full speed using all cores

Easing in to the first curve: accessing memory bandwidth

High speed banked curved: maximizing memory bandwidth

Back to the pit: a summary

Chapter 3. A Friendly Country Road Race

Preparing for our country road trip: chapter focus

Getting a feel for the road: the 9-point stencil algorithm

At the starting line: the baseline 9-point stencil implementation

Rough road ahead: running the baseline stencil code

Cobblestone street ride: vectors but not yet scaling

Open road all-out race: vectors plus scaling

Some grease and wrenches!: a bit of tuning

Summary

For more information

Chapter 4. Driving Around Town: Optimizing A Real-World Code Example

Choosing the direction: the basic diffusion calculation

Turn ahead: accounting for boundary effects

Finding a wide boulevard: scaling the code

Thunder road: ensuring vectorization

Peeling out: peeling code from the inner loop

Trying higher octane fuel: improving speed using data locality and tiling

High speed driver certificate: summary of our high speed tour

Chapter 5. Lots of Data (Vectors)

Why vectorize?

How to vectorize

Five approaches to achieving vectorization

Six step vectorization methodology

Streaming through caches: data layout, alignment, prefetching, and so on

Compiler tips

Compiler options

Compiler directives

Use array sections to encourage vectorization

Look at what the compiler created: assembly code inspection

Numerical result variations with vectorization

Summary

For more information

Chapter 6. Lots of Tasks (not Threads)

OpenMP, Fortran 2008, Intel® TBB, Intel® Cilk™ Plus, Intel® MKL

OpenMP

Fortran 2008

Intel® TBB

Cilk Plus

Summary

For more information

Chapter 7. Offload

Two offload models

Choosing offload vs. native execution

Language extensions for offload

Using pragma/directive offload

Using offload with shared virtual memory

About asynchronous computation

About asynchronous data transfer

Applying the target attribute to multiple declarations

Performing file I/O on the coprocessor

Logging stdout and stderr from offloaded code

Summary

For more information

Chapter 8. Coprocessor Architecture

The Intel® Xeon Phi™ coprocessor family

Coprocessor card design

Intel® Xeon Phi™ coprocessor silicon overview

Individual coprocessor core architecture

Instruction and multithread processing

Cache organization and memory access considerations

Prefetching

Vector processing unit architecture

Coprocessor PCIe system interface and DMA

Coprocessor power management capabilities

Reliability, availability, and serviceability (RAS)

Coprocessor system management controller (SMC)

Benchmarks

Summary

For more information

Chapter 9. Coprocessor System Software

Coprocessor software architecture overview

Coprocessor programming models and options

Coprocessor software architecture components

Intel® manycore platform software stack

Linux support for Intel® Xeon Phi™ coprocessors

Tuning memory allocation performance

Summary

For more information

Chapter 10. Linux on the Coprocessor

Coprocessor Linux baseline

Introduction to coprocessor Linux bootstrap and configuration

Default coprocessor Linux configuration

Changing coprocessor configuration

The micctrl utility

Adding software

Coprocessor Linux boot process

Coprocessors in a Linux cluster

Summary

For more information

Chapter 11. Math Library

Intel Math Kernel Library overview

Intel MKL and Intel compiler

Coprocessor support overview

Using the coprocessor in native mode

Using automatic offload mode

Using compiler-assisted offload

Precision choices and variations

Summary

For more information

Chapter 12. MPI

MPI overview

Using MPI on Intel® Xeon PhiTM coprocessors

Prerequisites (batteries not included)

Offload from an MPI rank

Using MPI natively on the coprocessor

Summary

For more information

Chapter 13. Profiling and Timing

Event monitoring registers on the coprocessor

Efficiency metrics

Potential performance issues

Intel® VTune™ Amplifier XE product

Performance application programming interface

MPI analysis: Intel Trace Analyzer and Collector

Timing

Summary

For more information

Chapter 14. Summary

Advice

Additional resources

Another book coming?

Feedback appreciated

Glossary

Index

Life Sciences

Physical Sciences & Engineering

Social Sciences & Humanities

Health

Intel Xeon Phi Coprocessor High Performance Programming

Holiday Savings

Description

Key features

Readership

Table of contents

Review quotes

Product details

About the authors

James Jeffers

James Reinders

View book on ScienceDirect

Related books