Preface Acknowledgements CHAPTER.1 Introduction 1.1 Heterogeneous Parallel Computing 1.2 Architecture of a Modern GPU 1.3 Why More Speed or Parallelism 1.4 Speeding Up Real Applications 1.5 Challenges in Parallel Programming 1.6 Parallel Programming Languages and Models 1.7 Overarching Goals 1.8 Organization of the Book References CHAPTER.2 Data Parallel Computing 2.1 Data Parallelism 2.2 CUDA C Program Structure 2.3 A Vector Addition Kernel 2.4 Device Global Memory and Data Transfer 2.5 Kernel Functions and Threading 2.6 Kernel Launch 2.7 Summary Function Declarations Kernel Launch Built-in (Predefined) Variables Run-time API 2.8 Exercises References CHAPTER.3 Scalable Parallel Execution 3.1 CUDA Thread Organization 3.2 Mapping Threads to Multidimensional Data 3.3 Image Blur: A More Complex Kernel 3.4 Synchronization and Transparent Scalability 3.5 Resource Assignment 3.6 Querying Device Properties 3.7 Thread Scheduling and Latency Tolerance 3.8 Summary 3.9 Exercises CHAPTER.4 Memory and Data Locality 4.1 Importance of Memory Access Efficiency 4.2 Matrix Multiplication 4.3 CUDA Memory Types 4.4 Tiling for Reduced Memory Traffic 4.5 A Tiled Matrix Multiplication Kernel 4.6 Boundary Checks 4.7 Memory as a Limiting Factor to Parallelism 4.8 Summary 4.9 Exercises …… CHAPTER 17 Parallel Programming and ComputationalThinking 17.1 Goals of Parallel Computing 17.2 Problem Decomposition
17.3 Algorithm Selection 17.4 Computational Thinking 17.5 Single Program, Multiple Data,Shared Memoryand Locality 17.6 Strategies for Computational Thinking 7.7 A Hypothetical Example: Sodium Map of the Brain 17.8 Summary 17.9 Exercises References CHAPTER 18 Programming a Heterogeneous ComputingCluster 18.1 Background 18.2 A Running Example 18.3 Message Passing Interface Basics 18.4 Message Passing Interface Point-to-Point Communication 18.5 Overlapping Computation and Communication 18.7 CUDA-Aware Message Passing Interface 18.8 Summary 18.9 Exercises Reference CHAPTER 19 Parallel Programming with OpenACC 19.1 The OpenACC Execution Model 19.2 OpenACC Directive Format 19.3 OpenACC by Example The OpenACC Kernels Directive The OpenACC Parallel Directive Comparison of Kernels and Parallel Directives OpenACC Data Directives OpenACC Loop Optimizations OpenACCRoutine Directive Asynchronous Computation and Data 19.4 Comparing OpenACC and CUDA Portability Performance Simplicity 19.5 Interoperability with CUDA and Libraries Calling CUDA or Libraries with OpenACC Arrays Using CUDA Pointers in OpenACC Calling CUDA Device Kernels from OpenACC 19.6 The Future of OpenACC 19.7 Exercises CHAPTER 20 M ore on CUDA and Graphics Processing Unit Computing 20.1 Model of Host/Device Interaction 20.2 Kernel Execution Control 20.3 Memory Bandwidth and Compute Throughput 20.4 Programming Environment 20.5 Future Outlook References CHAPTER 21 Conclusion and Outlook 21.1 Goals Revisited 21.2 Future Outlook
Appendix A:An Introduction to OpenCL Appendix B:THRUST:a Productivity-oriented Library for CUDA Appendix C:CUDA Fortran Appendix D:An introduction to C++AMP Index