Skip to main content

Programming Massively Parallel Processors

A Hands-on Approach

Programming Massively Parallel Processors discusses the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explo… Read more

World Book Day celebration

Where learning shapes lives

Up to 25% off trusted resources that support research, study, and discovery.

Description

Programming Massively Parallel Processors discusses the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs.

This book describes computational thinking techniques that will enable students to think about problems in ways that are amenable to high-performance parallel computing. It utilizes CUDA (Compute Unified Device Architecture), NVIDIA's software development tool created specifically for massively parallel environments. Studies learn how to achieve both high-performance and high-reliability using the CUDA programming model as well as OpenCL.

This book is recommended for advanced students, software engineers, programmers, and hardware engineers.

Key features

  • Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing.
  • Utilizes CUDA (Compute Unified Device Architecture), NVIDIA's software development tool created specifically for massively parallel environments.
  • Shows you how to achieve both high-performance and high-reliability using the CUDA programming model as well as OpenCL.

Readership

Advanced Students, Software engineers, Programmers, Hardware Engineers

Table of contents

PrefaceAcknowledgmentsDedicationChapter 1 Introduction    1.1 GPUs as Parallel Computers    1.2 Architecture of a Modern GPU    1.3 Why More Speed or Parallelism?    1.4 Parallel Programming Languages and Models    1.5 Overarching Goals    1.6 Organization of the BookChapter 2 History of GPU Computing    2.1 Evolution of Graphics Pipelines         2.1.1 The Era of Fixed-Function Graphics Pipelines         2.1.2 Evolution of Programmable Real-Time Graphics         2.1.3 Unified Graphics and Computing Processors         2.1.4 GPGPU: An Intermediate Step    2.2 GPU Computing         2.2.1 Scalable GPUs         2.2.2 Recent Developments    2.3 Future TrendsChapter 3 Introduction to CUDA    3.1 Data Parallelism    3.2 CUDA Program Structure    3.3 A Matrix–Matrix Multiplication Example    3.4 Device Memories and Data Transfer    3.5 Kernel Functions and Threading    3.6 Summary         3.6.1 Function declarations         3.6.2 Kernel launch         3.6.3 Predefined variables         3.6.4 Runtime APIChapter 4 CUDA Threads    4.1 CUDA Thread Organization    4.2 Using blockIdx and threadIdx    4.3 Synchronization and Transparent Scalability    4.4 Thread Assignment    4.5 Thread Scheduling and Latency Tolerance    4.6 Summary    4.7 ExercisesChapter 5 CUDA™ Memories    5.1 Importance of Memory Access Efficiency    5.2 CUDA Device Memory Types    5.3 A Strategy for Reducing Global Memory Traffic    5.4 Memory as a Limiting Factor to Parallelism    5.5 Summary    5.6 ExercisesChapter 6 Performance Considerations    6.1 More on Thread Execution    6.2 Global Memory Bandwidth    6.3 Dynamic Partitioning of SM Resources    6.4 Data Prefetching    6.5 Instruction Mix    6.6 Thread Granularity    6.7 Measured Performance and Summary    6.8 ExercisesChapter 7 Floating Point Considerations    7.1 Floating-Point Format         7.1.1 Normalized Representation of M         7.1.2 Excess Encoding of E    7.2 Representable Numbers    7.3 Special Bit Patterns and Precision    7.4 Arithmetic Accuracy and Rounding    7.5 Algorithm Considerations    7.6 Summary    7.7 ExercisesChapter 8 Application Case Study: Advanced MRI Reconstruction    8.1 Application Background    8.2 Iterative Reconstruction    8.3 Computing FHd         Step 1. Determine the Kernel Parallelism Structure         Step 2. Getting Around the Memory Bandwidth Limitation         Step 3. Using Hardware Trigonometry Functions         Step 4. Experimental Performance Tuning    8.4 Final Evaluation    8.5 ExercisesChapter 9 Application Case Study: Molecular Visualization and Analysis    9.1 Application Background    9.2 A Simple Kernel Implementation    9.3 Instruction Execution Efficiency    9.4 Memory Coalescing    9.5 Additional Performance Comparisons    9.6 Using Multiple GPUs    9.7 ExercisesChapter 10 Parallel Programming and Computational Thinking    10.1 Goals of Parallel Programming    10.2 Problem Decomposition    10.3 Algorithm Selection    10.4 Computational Thinking    10.5 ExercisesChapter 11 A Brief Introduction to OpenCL™    11.1 Background    11.2 Data Parallelism Model    11.3 Device Architecture    11.4 Kernel Functions    11.5 Device Management and Kernel Launch    11.6 Electrostatic Potential Map in OpenCL    11.7 Summary    11.8 ExercisesChapter 12 Conclusion and Future Outlook    12.1 Goals Revisited    12.2 Memory Architecture Evolution         12.2.1 Large Virtual and Physical Address Spaces         12.2.2 Unified Device Memory Space         12.2.3 Configurable Caching and Scratch Pad         12.2.4 Enhanced Atomic Operations         12.2.5 Enhanced Global Memory Access    12.3 Kernel Execution Control Evolution         12.3.1 Function Calls within Kernel Functions         12.3.2 Exception Handling in Kernel Functions         12.3.3 Simultaneous Execution of Multiple Kernels         12.3.4 Interruptible Kernels    12.4 Core Performance         12.4.1 Double-Precision Speed         12.4.2 Better Control Flow Efficiency    12.5 Programming Environment    12.6 A Bright OutlookAppendix A Matrix Multiplication Host-Only Version Source Code    A.1 matrixmul.cu    A.2 matrixmul_gold.cpp    A.3 matrixmul.h    A.4 assist.h    A.5 Expected OutputAppendix B GPU Compute Capabilities    B.1 GPU Compute Capability Tables    B.2 Memory Coalescing VariationsIndex

Review quotes

"For those interested in the GPU path to parallel enlightenment, this new book from David Kirk and Wen-mei Hwu is a godsend, as it introduces CUDA (tm), a C-like data parallel language, and Tesla(tm), the architecture of the current generation of NVIDIA GPUs. In addition to explaining the language and the architecture, they define the nature of data parallel problems that run well on the heterogeneous CPU-GPU hardware ...This book is a valuable addition to the recently reinvigorated parallel computing literature."—David Patterson, Director of The Parallel Computing Research Laboratory and the Pardee Professor of Computer Science, U.C. Berkeley. Co-author of Computer Architecture: A Quantitative Approach

"Written by two teaching pioneers, this book is the definitive practical reference on programming massively parallel processors—a true technological gold mine. The hands-on learning included is cutting-edge, yet very readable. This is a most rewarding read for students, engineers, and scientists interested in supercharging computational resources to solve today's and tomorrow's hardest problems."—Nicolas Pinto, MIT, NVIDIA Fellow, 2009

"I have always admired Wen-mei Hwu's and David Kirk's ability to turn complex problems into easy-to-comprehend concepts. They have done it again in this book. This joint venture of a passionate teacher and a GPU evangelizer tackles the trade-off between the simple explanation of the concepts and the in-depth analysis of the programming techniques. This is a great book to learn both massive parallel programming and CUDA."—Mateo Valero, Director, Barcelona Supercomputing Center

"The use of GPUs is having a big impact in scientific computing. David Kirk and Wen-mei Hwu's new book is an important contribution towards educating our students on the ideas and techniques of programming for massively parallel processors."—Mike Giles, Professor of Scientific Computing, University of Oxford

"This book is the most comprehensive and authoritative introduction to GPU computing yet. David Kirk and Wen-mei Hwu are the pioneers in this increasingly important field, and their insights are invaluable and fascinating. This book will be the standard reference for years to come."—Hanspeter Pfister, Harvard University

"This is a vital and much-needed text. GPU programming is growing by leaps and bounds. This new book will be very welcomed and highly useful across inter-disciplinary fields."—Shannon Steinfadt, Kent State University

"GPUs have hundreds of cores capable of delivering transformative performance increases across a wide range of computational challenges. The rise of these multi-core architectures has raised the need to teach advanced programmers a new and essential skill: how to program massively parallel processors."–-CNNMoney.com

"This book is a valuable resource for all students from science and engineering disciplines where parallel programming skills are needed to allow solving compute-intensive problems."—BCS: The British Computer Society’s online journal

Product details

About the authors

DK

David B. Kirk

David B. Kirk is known for major contributions to graphics, hardware, and algorithms. Before pursuing his Ph.D. at Caltech, he earned B.S. and M.S. degrees in mechanical engineering from MIT and worked at Raster Technologies and Hewlett-Packard’s Apollo Systems Division. After completing his doctorate, he served as chief scientist and head of technology at Crystal Dynamics. In 1997, he became Chief Scientist at NVIDIA. Dr. Kirk has received numerous honors including the IEEE Seymour Cray Computer Engineering Award and ACM SIGGRAPH Computer Graphics Achievement Award. He is a member of the U.S. National Academy of Engineering.
Affiliations and expertise
NVIDIA Fellow

WH

Wen-mei W. Hwu

Wen-mei W. Hwu is a Senior Director of Research of NVIDIA and the Sanders-AMD Endowed Chair Professor Emeritus of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. His work focuses on parallel computing—covering architecture, implementation, compilers, and algorithms. Dr. Hwu has received numerous honors, including the ACM/ IEEE Eckert-Mauchly Award, ACM Grace Murray Hopper Award, IEEE B.R. Rau Award. He is an IEEE and ACM Fellow. He earned his Ph.D. in Computer Science from UC Berkele
Affiliations and expertise
CTO, MulticoreWare and professor specializing in compiler design, computer architecture, microarchitecture, and parallel processing, University of Illinois at Urbana-Champaign, USA