LIMITED OFFER
Save 50% on book bundles
Immediately download your ebook while waiting for your print delivery. No promo code needed.
Authors Jim Jeffers and James Reinders spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel Xeon Phi co… Read more
LIMITED OFFER
Immediately download your ebook while waiting for your print delivery. No promo code needed.
Authors Jim Jeffers and James Reinders spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel Xeon Phi coprocessor. They have distilled their own experiences coupled with insights from many expert customers, Intel Field Engineers, Application Engineers and Technical Consulting Engineers, to create this authoritative first book on the essentials of programming for this new architecture and these new products.
This book is useful even before you ever touch a system with an Intel Xeon Phi coprocessor. To ensure that your applications run at maximum efficiency, the authors emphasize key techniques for programming any modern parallel computing system whether based on Intel Xeon processors, Intel Xeon Phi coprocessors, or other high performance microprocessors. Applying these techniques will generally increase your program performance on any system, and better prepare you for Intel Xeon Phi coprocessors and the Intel MIC architecture.
Software engineers, High Performance and Super Computing developers, scientific researchers in need of high-performance computing resources
Foreword
Preface
Organization
Lots-of-cores.com
Acknowledgements
Chapter 1. Introduction
Trend: more parallelism
Why Intel® Xeon Phi™ coprocessors are needed
Platforms with coprocessors
The first Intel® Xeon Phi™ coprocessor
Keeping the “Ninja Gap” under control
Transforming-and-tuning double advantage
When to use an Intel® Xeon Phi™ coprocessor
Maximizing performance on processors first
Why scaling past one hundred threads is so important
Maximizing parallel program performance
Measuring readiness for highly parallel execution
What about GPUs?
Beyond the ease of porting to increased performance
Transformation for performance
Hyper-threading versus multithreading
Coprocessor major usage model: MPI versus offload
Compiler and programming models
Cache optimizations
Examples, then details
For more information
Chapter 2. High Performance Closed Track Test Drive!
Looking under the hood: coprocessor specifications
Starting the car: communicating with the coprocessor
Taking it out easy: running our first code
Starting to accelerate: running more than one thread
Petal to the metal: hitting full speed using all cores
Easing in to the first curve: accessing memory bandwidth
High speed banked curved: maximizing memory bandwidth
Back to the pit: a summary
Chapter 3. A Friendly Country Road Race
Preparing for our country road trip: chapter focus
Getting a feel for the road: the 9-point stencil algorithm
At the starting line: the baseline 9-point stencil implementation
Rough road ahead: running the baseline stencil code
Cobblestone street ride: vectors but not yet scaling
Open road all-out race: vectors plus scaling
Some grease and wrenches!: a bit of tuning
Summary
For more information
Chapter 4. Driving Around Town: Optimizing A Real-World Code Example
Choosing the direction: the basic diffusion calculation
Turn ahead: accounting for boundary effects
Finding a wide boulevard: scaling the code
Thunder road: ensuring vectorization
Peeling out: peeling code from the inner loop
Trying higher octane fuel: improving speed using data locality and tiling
High speed driver certificate: summary of our high speed tour
Chapter 5. Lots of Data (Vectors)
Why vectorize?
How to vectorize
Five approaches to achieving vectorization
Six step vectorization methodology
Streaming through caches: data layout, alignment, prefetching, and so on
Compiler tips
Compiler options
Compiler directives
Use array sections to encourage vectorization
Look at what the compiler created: assembly code inspection
Numerical result variations with vectorization
Summary
For more information
Chapter 6. Lots of Tasks (not Threads)
OpenMP, Fortran 2008, Intel® TBB, Intel® Cilk™ Plus, Intel® MKL
OpenMP
Fortran 2008
Intel® TBB
Cilk Plus
Summary
For more information
Chapter 7. Offload
Two offload models
Choosing offload vs. native execution
Language extensions for offload
Using pragma/directive offload
Using offload with shared virtual memory
About asynchronous computation
About asynchronous data transfer
Applying the target attribute to multiple declarations
Performing file I/O on the coprocessor
Logging stdout and stderr from offloaded code
Summary
For more information
Chapter 8. Coprocessor Architecture
The Intel® Xeon Phi™ coprocessor family
Coprocessor card design
Intel® Xeon Phi™ coprocessor silicon overview
Individual coprocessor core architecture
Instruction and multithread processing
Cache organization and memory access considerations
Prefetching
Vector processing unit architecture
Coprocessor PCIe system interface and DMA
Coprocessor power management capabilities
Reliability, availability, and serviceability (RAS)
Coprocessor system management controller (SMC)
Benchmarks
Summary
For more information
Chapter 9. Coprocessor System Software
Coprocessor software architecture overview
Coprocessor programming models and options
Coprocessor software architecture components
Intel® manycore platform software stack
Linux support for Intel® Xeon Phi™ coprocessors
Tuning memory allocation performance
Summary
For more information
Chapter 10. Linux on the Coprocessor
Coprocessor Linux baseline
Introduction to coprocessor Linux bootstrap and configuration
Default coprocessor Linux configuration
Changing coprocessor configuration
The micctrl utility
Adding software
Coprocessor Linux boot process
Coprocessors in a Linux cluster
Summary
For more information
Chapter 11. Math Library
Intel Math Kernel Library overview
Intel MKL and Intel compiler
Coprocessor support overview
Using the coprocessor in native mode
Using automatic offload mode
Using compiler-assisted offload
Precision choices and variations
Summary
For more information
Chapter 12. MPI
MPI overview
Using MPI on Intel® Xeon PhiTM coprocessors
Prerequisites (batteries not included)
Offload from an MPI rank
Using MPI natively on the coprocessor
Summary
For more information
Chapter 13. Profiling and Timing
Event monitoring registers on the coprocessor
Efficiency metrics
Potential performance issues
Intel® VTune™ Amplifier XE product
Performance application programming interface
MPI analysis: Intel Trace Analyzer and Collector
Timing
Summary
For more information
Chapter 14. Summary
Advice
Additional resources
Another book coming?
Feedback appreciated
Glossary
Index
JJ
JR