Illustration of the caching mechanism of the proposed GPU-CUDA method.
By slicing the image volume orthogonal to the predominant TOR direction, the area of the intersection between the TOR and the slice is bounded. Here, a y direction TOR and x-z slice are shown as an example.
Number of collisions per voxel for backprojecting a random set of 1 million lines. White means no collision, gray means one collision, and black means two collisions.
With the ToF information, fewer LOR-slice pairs need to be processed. Four LORs are shown intersecting the current slice. The dots on each line denote the ToF center, and the bell shapes denote the ToF kernels. Only LORs 2 and 3 contribute to the current slice significantly.
Cumulative contributions of different optimization strategies to the overall speedup of the GPU-CUDA method compared to the CPU-based code. Simple GPU implementation refers to the method that directly maps the computation to the GPU hardware without using the subsequent optimizations listed in this figure.
Number of randomly-generated LORs that can be processed per second, as a function of the number of thread blocks for the GPU-CUDA method. Due to hardware limitations, block size of GTX 285 cannot be set to 1024.
Hot rod phantom (a) acquired on a preclinical PET scanner and reconstructed with 2 iterations and 40 subsets of list-mode OSEM, using (c) CPU method and (d) GPU-CUDA method. (b) The normalization map. Profiles of (c) and (d) through the centers of the hot rods, depicted in (a), are shown in (e). Contrast between the two ROIs in (a) and noise as functions of the number of iterations are shown in (f). The method for computing contrast and noise are explained in Sec. ???. The processing time for the GPU and the CPU is 7.0 s and 23 min, respectively.
Mouse PET scan (maximum intensity projection), reconstructed with three iterations and five subsets of list-mode OSEM using the CPU method (a) and the GPU-CUDA method (b). The processing time for the GPU and the CPU is 8.0 s and 28 min, respectively.
Transaxial image taken through slice 30 of the liver of the reconstructed patient data from a Philips Gemini TF PET/CT scanner. A CT image at the same slice location is shown in (e) with a soft tissue window and inverse gray scale to provide an anatomical frame of reference. The (cropped) normalization image is shown in (f). The lesion is visualized with higher contrast for the ToF data. For non-ToF, the lesion contrast for the CPU and GPU methods are 2.6 and 2.7, respectively. For ToF, the values are 3.0 and 3.1, respectively. The processing time for the GPU and the CPU is 7.7 s and 42 min, respectively.
Execution time (ms) for processing varying numbers of randomly-generated LORs for the GPU-CUDA method.
Effect of using fast math for one iteration of 1 million random ToF LORs in a 75 × 75 × 26 image.
Execution time for processing 1 million random events in an image matrix of L × L × L with TOR width T w increasing simultaneously with L.
Execution time for processing 1 million random events in an image matrix of L × L × L with fixed TOR width 3 × 3.
Execution time for processing 1 million random LORs in an image matrix of 75 × 75 × 26 for different TOR width T w . is the maximum number of voxels in a TOR-slice intersection.
Article metrics loading...
Full text loading...