大规模并行处理器程序设计
图书信息大规模并行处理器程序设计
[1]
作者: (美)柯克等著
出 版 社: 清华大学出版社
出版时间: 2010-7-1
开本: 16开
I S B N : 9787302229735
所属分类: 图书 >> 计算机/网络 >> 程序设计 >> 其他
定价:¥36.00
内容简介本书介绍了并行程序设计与GPU体系结构的基本概念,并详细探讨了用于构建并行程序的各种技术,用案例演示了并行程序设计的整个开发过程,即从并行计算的思想开始,直到最终实现实际且高效的并行程序。
本书特点
介绍了并行计算的思想,使得读者可以把这种问题的思考方式渗透到高性能并行计算中去。
介绍了CUDA的使用,CUDA是NVIDIA公司专门为大规模并行环境创建的一种软件开发工具。
介绍如何使用CUDA编程模式和OpenCL来获得高性能和高可靠性。
目录Preface
Acknowledgments
Dedication
CHAPTER 1 INTRODUCTION
1.1GPUs as Parallel Computers
1.2Architecture of a Modern GPU
1.3Why More Speed or Parallelism?
1.4Parallel Programming Languages and Models
1.5Overarching Goals
1.6Organization of the Book
CHAPTER 2HISTORY OF GPU COMPUTING
2.1Evolution of Graphics pipelines
2.1.1The Era of Fixed-Function Graphics Pipelines
2.1.2Evolution of Programmable Real-Time Graphics
2.1.3Unified Graphics and Computing Processors
2.1.4GPGPU: An Intermediate Step
2.2GPU Computing
2.2.1Scalable GPUs
2.2.2Recent Developments
2.3Future Trends
CHAPTER 3INTRODUCTION TO CUDA
3.1Data Parallelism
3.2CUDA Program Structure
3.3A Matrix-Matrix Multiplication Example
3.4Device Memories and Data Transfer
3.5Kernel Functions and Threading
3.6Summary
3.6.1Function declarations
3.6.2Kernel launch
3.6.3Predefined variables
3.6.4Runtime APl
CHAPTER 4CUDA THREADS
4.1CUDA Thread Organization
4.2Using b]ockldx and threadIdx
4.3Synchronization and Transparent Scalability
4.4Thread Assignment
4.5Thread Scheduling and Latency Tolerance
4.6Summary
4.7Exercises
CHAPTER 5CUDATM MEMORIES
5.1Importance of Memory Access Efficiency
5.2CUDA Device Memory Types
5.3A Strategy for Reducing Global Memory Traffic
5.4Memory as a Limiting Factor to Parallelism
5.5Summary
5.6Exercises
CHAPTER 6PERFORMANCE CONSIDERATIONS
6.1More on Thread Execution
6.2Global Memory Bandwidth
6.3Dynamic Partitioning of SM Resources
6.4Data Prefetching
6.5Instruction Mix
6.6Thread Granularity
6.7Measured Performance and Summary
6.8Exercises
CHAPTER 7FLOATING POINT CONSIDERATIONS
7.1Floating-Point Format
7.1.1Normalized Representation of M
7.1.2Excess Encoding of E
7.2Representable Numbers
7.3Special Bit Patterns and Precision
7.4Arithmetic Accuracy and Rounding
7.5Algorithm Considerations
7.6Summary
7.7Exercises
CHAPTER 8APPLICATION CASE STUDY: ADVANCED MRI RECONSTRUCTION
8.1Application Background
8.2Iterative Reconstruction
8.3Computing FHd
Step 1. Determine the Kernel Parallelism Structure
Step 2. Getting Around the Memory Bandwidth Limitation.
Step 3. Using Hardware Trigonometry Functions
Step 4. Experimental Performance Tuning
8.4Final Evaluation
8.5Exercises
CHAPTER 9APPLICATION CASE STUDY: MOLECULAR VISUALIZATION AND ANALYSIS
CHAPTER 10PARALLEL PROGRAMMING AND COMPUTATIONAL THINKING
CHAPTER 11A BRIEF INTRODUCTION TO OPENCLTM
CHAPTER 12CONCLUSION AND'FuTuRE OUTLOOK
APPENDIX AMATRIX MULTIPLICATION HOST-ONLY VERSION SOURCE CODE
APPENDIX BGPU COMPUTE CAPABILITIES
Index