LIMITED OFFER
Save 50% on book bundles
Immediately download your ebook while waiting for your print delivery. No promo code needed.
Programming Massively Parallel Processors: A Hands-on Approach, Second Edition, teaches students how to program massively parallel processors. It offers a detailed discussio… Read more
LIMITED OFFER
Immediately download your ebook while waiting for your print delivery. No promo code needed.
Programming Massively Parallel Processors: A Hands-on Approach, Second Edition, teaches students how to program massively parallel processors. It offers a detailed discussion of various techniques for constructing parallel programs. Case studies are used to demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs.
This guide shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth. This revised edition contains more parallel programming examples, commonly-used libraries such as Thrust, and explanations of the latest tools. It also provides new coverage of CUDA 5.0, improved performance, enhanced development tools, increased hardware support, and more; increased coverage of related technology, OpenCL and new material on algorithm patterns, GPU clusters, host programming, and data parallelism; and two new case studies (on MRI reconstruction and molecular visualization) that explore the latest applications of CUDA and GPUs for scientific research and high-performance computing.
This book should be a valuable resource for advanced students, software engineers, programmers, and hardware engineers.
Advanced students, software engineers, programmers, hardware engineers
Preface
Target Audience
How to Use the Book
Online Supplements
Acknowledgements
Dedication
Chapter 1. Introduction
1.1 Heterogeneous Parallel Computing
1.2 Architecture of a Modern GPU
1.3 Why More Speed or Parallelism?
1.4 Speeding Up Real Applications
1.5 Parallel Programming Languages and Models
1.6 Overarching Goals
1.7 Organization of the Book
References
Chapter 2. History of GPU Computing
2.1 Evolution of Graphics Pipelines
2.2 GPGPU: An Intermediate Step
2.3 GPU Computing
References and Further Reading
Chapter 3. Introduction to Data Parallelism and CUDA C
3.1 Data Parallelism
3.2 CUDA Program Structure
3.3 A Vector Addition Kernel
3.4 Device Global Memory and Data Transfer
3.5 Kernel Functions and Threading
3.6 Summary
3.7 Exercises
References
Chapter 4. Data-Parallel Execution Model
4.1 Cuda Thread Organization
4.2 Mapping Threads to Multidimensional Data
4.3 Matrix-Matrix Multiplication—A More Complex Kernel
4.4 Synchronization and Transparent Scalability
4.5 Assigning Resources to Blocks
4.6 Querying Device Properties
4.7 Thread Scheduling and Latency Tolerance
4.8 Summary
4.9 Exercises
Chapter 5. CUDA Memories
5.1 Importance of Memory Access Efficiency
5.2 CUDA Device Memory Types
5.3 A Strategy for Reducing Global Memory Traffic
5.4 A Tiled Matrix–Matrix Multiplication Kernel
5.5 Memory as a Limiting Factor to Parallelism
5.6 Summary
5.7 Exercises
Chapter 6. Performance Considerations
6.1 Warps and Thread Execution
6.2 Global Memory Bandwidth
6.3 Dynamic Partitioning of Execution Resources
6.4 Instruction Mix and Thread Granularity
6.5 Summary
6.6 Exercises
References
Chapter 7. Floating-Point Considerations
7.1 Floating-Point Format
7.2 Representable Numbers
7.3 Special Bit Patterns and Precision in IEEE Format
7.4 Arithmetic Accuracy and Rounding
7.5 Algorithm Considerations
7.6 Numerical Stability
7.7 Summary
7.8 Exercises
References
Chapter 8. Parallel Patterns: Convolution: With an Introduction to Constant Memory and Caches
8.1 Background
8.2 1D Parallel Convolution—A Basic Algorithm
8.3 Constant Memory and Caching
8.4 Tiled 1D Convolution with Halo Elements
8.5 A Simpler Tiled 1D Convolution—General Caching
8.6 Summary
8.7 Exercises
Chapter 9. Parallel Patterns: Prefix Sum: An Introduction to Work Efficiency in Parallel Algorithms
9.1 Background
9.2 A Simple Parallel Scan
9.3 Work Efficiency Considerations
9.4 A Work-Efficient Parallel Scan
9.5 Parallel Scan for Arbitrary-Length Inputs
9.6 Summary
9.7 Exercises
Reference
Chapter 10. Parallel Patterns: Sparse Matrix–Vector Multiplication: An Introduction to Compaction and Regularization in Parallel Algorithms
10.1 Background
10.2 Parallel SpMV Using CSR
10.3 Padding and Transposition
10.4 Using Hybrid to Control Padding
10.5 Sorting and Partitioning for Regularization
10.6 Summary
10.7 Exercises
References
Chapter 11. Application Case Study: Advanced MRI Reconstruction
11.1 Application Background
11.2 Iterative Reconstruction
11.3 Computing FHD
11.4 Final Evaluation
11.5 Exercises
References
Chapter 12. Application Case Study: Molecular Visualization and Analysis
12.1 Application Background
12.2 A Simple Kernel Implementation
12.3 Thread Granularity Adjustment
12.4 Memory Coalescing
12.5 Summary
12.6 Exercises
References
Chapter 13. Parallel Programming and Computational Thinking
13.1 Goals of Parallel Computing
13.2 Problem Decomposition
13.3 Algorithm Selection
13.4 Computational Thinking
13.5 Summary
13.6 Exercises
References
Chapter 14. An Introduction to OpenCLTM
14.1 Background
14.2 Data Parallelism Model
14.3 Device Architecture
14.4 Kernel Functions
14.5 Device Management and Kernel Launch
14.6 Electrostatic Potential Map in OpenCL
14.7 Summary
14.8 Exercises
References
Chapter 15. Parallel Programming with OpenACC
15.1 OpenACC Versus CUDA C
15.2 Execution Model
15.3 Memory Model
15.4 Basic OpenACC Programs
15.5 Future Directions of OpenACC
15.6 Exercises
Chapter 16. Thrust: A Productivity-Oriented Library for CUDA
16.1 Background
16.2 Motivation
16.3 Basic Thrust Features
16.4 Generic Programming
16.5 Benefits of Abstraction
16.6 Programmer Productivity
16.7 Best Practices
16.8 Exercises
References
Chapter 17. CUDA FORTRAN
17.1 CUDA FORTRAN and CUDA C Differences
17.2 A First CUDA FORTRAN Program
17.3 Multidimensional Array in CUDA FORTRAN
17.4 Overloading Host/Device Routines With Generic Interfaces
17.5 Calling CUDA C Via Iso_C_Binding
17.6 Kernel Loop Directives and Reduction Operations
17.7 Dynamic Shared Memory
17.8 Asynchronous Data Transfers
17.9 Compilation and Profiling
17.10 Calling Thrust from CUDA FORTRAN
17.11 Exercises
Chapter 18. An Introduction to C++ AMP
18.1 Core C++ AMP Features
18.2 Details of the C++ AMP Execution Model
18.3 Managing Accelerators
18.4 Tiled Execution
18.5 C++ AMP Graphics Features
18.6 Summary
18.7 Exercises
Chapter 19. Programming a Heterogeneous Computing Cluster
19.1 Background
19.2 A Running Example
19.3 MPI Basics
19.4 MPI Point-to-Point Communication Types
19.5 Overlapping Computation and Communication
19.6 MPI Collective Communication
19.7 Summary
19.8 Exercises
Reference
Chapter 20. CUDA Dynamic Parallelism
20.1 Background
20.2 Dynamic Parallelism Overview
20.3 Important Details
20.4 Memory Visibility
20.5 A Simple Example
20.6 Runtime Limitations
20.7 A More Complex Example
20.8 Summary
Reference
Chapter 21. Conclusion and Future Outlook
21.1 Goals Revisited
21.2 Memory Model Evolution
21.3 Kernel Execution Control Evolution
21.4 Core Performance
21.5 Programming Environment
21.6 Future Outlook
References
Appendix A. Matrix Multiplication Host-Only Version Source Code
Appendix Outline
A.1 matrixmul.cu
A.2 matrixmul_gold.cpp
A.3 matrixmul.h
A.4 assist.h
A.5 Expected Output
Appendix B. GPU Compute Capabilities
Appendix Outline
B.1 GPU Compute Capability Tables
B.2 Memory Coalescing Variations
Index
DK
At NVIDIA, Kirk led graphics-technology development for some of today's most popular consumer-entertainment platforms, playing a key role in providing mass-market graphics capabilities previously available only on workstations costing hundreds of thousands of dollars. For his role in bringing high-performance graphics to personal computers, Kirk received the 2002 Computer Graphics Achievement Award from the Association for Computing Machinery and the Special Interest Group on Graphics and Interactive Technology (ACM SIGGRAPH) and, in 2006, was elected to the National Academy of Engineering, one of the highest professional distinctions for engineers.
Kirk holds 50 patents and patent applications relating to graphics design and has published more than 50 articles on graphics technology, won several best-paper awards, and edited the book Graphics Gems III. A technological "evangelist" who cares deeply about education, he has supported new curriculum initiatives at Caltech and has been a frequent university lecturer and conference keynote speaker worldwide.
WH