| ||||||||||||||||||||||||||||||||||||||||||||||||||
|
, , |
| |||||||||||||||||||||||||||||||||||||||||||||||||
|
Abstract General-purpose computing on graphics processing units (GPGPU) is dramatically changing the landscape of high performance computing in astronomy. In this paper, we identify and investigate several key decision areas, with a goal of simplifying the early adoption of GPGPU in astronomy. We consider the merits of OpenCL as an open standard in order to reduce risks associated with coding in a native, vendor-specific programming environment, and present a GPU programming philosophy based on using brute force solutions. We assert that effective use of new GPU-based supercomputing facilities will require a change in approach from astronomers. This will likely include improved programming training, an increased need for software development best practice through the use of profiling and related optimisation tools, and a greater reliance on third-party code libraries. As with any new technology, those willing to take the risks and make the investment of time and effort to become early adopters of GPGPU in astronomy, stand to reap great benefits. Keywords:
1 2 1 flop = 1 floating-point operation; 1 flop/s = 1 floating-point operation/second. 3 4 5 Published prior to the release of CUDA, some of the implementation issues they raise have been resolved. 6 Online versions of volumes 1–3 are freely available from 7 NVIDIA CUDA: 8 Other architecture-specific SDKs include the ATI Stream SDK (AMD) for programming ATI Radeon GPUs and the Cell Broadband Engine SDK (IBM). 9 Khronos OpenCL: 10 Khronos: 11 SiSoftware CUDA and OpenCL comparison: 12 Only one of the two GPUs on the Radeon card were used in these tests. 13 i.e. the difference between OpenCL and CUDA kernel execution performance is typcially a factor or 10–100 smaller than the the gain achieved by using GPUs instead of CPUs. 14 We note that at the time of code development, OpenCL had not been publically released, hence our choice of CUDA. 15 16 A single processor of a 2 quad-core Clovertown Processor. 17 We note that a naïve use of the Mathematica 18 The standalone C implementation was a single-core code; we report quad-core timings by assuming perfect scaling which is reasonable for this task. 19 20 21 22 23 24 25 26 27 28 29 30 * Research undertaken as part of the Commonwealth Cosmology Initiative (CCI: | ||||||||||||||||||||||||||||||||||||||||||||||||||





