Forum > Windows
optimized sources
_heinz:
Hi Jason,
I run gp_fft_1-8 on the ION (R3600 ATOM)
if you would have a closer look, resultfile attached
modify:
compiled 1-8 on the Xeon (GF470) and run it to see if there are general differences to the ION
have a look on the runtimes
resultfile attached.
some later:
I run GPUZ while 1-8 are running and it is shown, at the beginning when the short 1,2,3,4 are running the GPU usage started with 95% and then slowly fall back to 40% when 8 is running
look at the picture here
A complete other picture shows the ION, 1-4 shows at the beginning of 1 ca 70% but then it is going over into spikes and periods of no cpu usage, that means the necessary calculations between the batches took to much time to feed the gpu continious.
ion_fft_1-4_gpu_load
ion_fft_6-7_gpu_load
heinz
Raistmer:
For emulating current MB FFT situation it's worth to test FFT in bunches where num_of_ffts*size_of_fft always == 1M dots.
That is, small fft size means big number of FFTs that can be done at once whereas large FFTs come in smaller numbers.
If few cfft pairs will be unrolled the rule stays the same, just 1M should be replaced to 1M*number_of_unrolled_cfft_pairs.
Jason G:
Yes, it's interesting that Heinz's ION shows no speed difference with changes to the batch to reflect the 1M points we use in multibeam.
The smaller ION GPU seems already filled with this smaller batch. I have before made a modified version of this test that sticks to 1M points, but chains it with the getpowerspectrum kernel like in the app, and includes flops for that accordingly, so it better matches what we'll need for profiling / refinement. That one I'll have to dig out from backups, due to recent OS reinstall, and it has the small size freaky powerspectrum prototypes in it against stock method.
All that is clear so far is that 1M points doesn't fill the 480 here, so concurrency at the chirp rate level may be necessary for larger cards.
Curiously, nSight measuies my smaller sized FFTs at .33 occupancy, and the CuFFT ones at .17 , and I tuned those for generic compute capability 1.0 devices. That would seem to further indicate to me that CuFFT is really meant for larger batches than our 1M points ... (Or is being used incorrectly somehow :o)
_heinz:
To make things more clear I compared gpu-load 1 against 8
the first compact part you see is 1, after that comes 8 in exact 9 pieces, the max value in the middle is 512.
Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3010.
--------CUFFT------- ---This prototype--- ---two way---
N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error
8 1048576 4.3 4.6 1.5 6.7 7.2 1.6 6.8 3.0
16 524288 5.5 4.4 1.7 7.0 5.6 1.4 7.0 2.3
64 131072 8.6 4.6 1.7 10.3 5.5 2.2 10.3 3.4
256 32768 10.0 4.0 2.0 9.5 3.8 2.0 9.7 3.5
512 16384 10.5 3.7 2.1 12.4 4.4 2.5 12.4 4.2
1024 8192 9.0 2.9 2.5 9.1 2.9 2.4 9.0 4.5
2048 4096 8.5 2.5 2.7 8.8 2.6 3.0 8.8 5.1
4096 2048 7.0 1.9 2.4 9.9 2.7 3.3 10.2 5.4
8192 1024 6.4 1.6 2.4 9.5 2.3 3.4 9.4 5.7
It is clear to see like the 9 batchs implizit gpu_load on the ION
ion_fft_1_and_8_gpu_load
heinz
Jason G:
You need nSight Heinz:
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version