Forum > GPU crunching
[Split] PowerSpectrum Unit Test
Raistmer:
For spike finding whole array should be scanned. That is, either long loop inside thread or threads cooperation via shared memory and barriers.
Whereas power spectrum computation inherently parallel and each thread can be mapped to separate matrix point.
I tried to fuse power spectrum computation with normalization - performance decreased because of huge drop in available separate threads (normalization required mean computation, i.e., again, access to whole PoT array ).
Jason G:
--- Quote from: Raistmer on 22 Nov 2010, 04:19:21 am --- (normalization required mean computation, i.e., again, access to whole PoT array ).
--- End quote ---
mmm, there may be a way partially around that by some sync barrier / reduction. Averaging the large dataset *should* be parallelisable (By swapping local means). [Edit:] That summax stuff seems to be doing that, but seems to be fairly generalised, with lots of 'TODO' and unnecessary stuff. will work out how to reduce in powerspectrum kernel later, since it seems pointless rescanning the whole array when we just had it there for powerspectrum.
Raistmer:
summax uses thread cooperation/synching and barriers.
[
I just wanna say that by merging such kernel with power spectrum one you will bound N points to thread instead of 1 point,where N>>1 or you will have too many reduce steps.
]
Jason G:
--- Quote from: Raistmer on 22 Nov 2010, 06:47:49 am ---summax uses thread cooperation/synching and barriers.
[
I just wanna say that by merging such kernel with power spectrum one you will bound N points to thread instead of 1 point,where N>>1 or you will have too many reduce steps.
]
--- End quote ---
I agree to some extent, except for that we're already memory bound here, so pinching at least portions (say first stage of the reduction of 256 points in the block) should be almost free via shared memory (compared to memory access time anyway). If it doesn't work out then it all leads to better understanding of these complicated things anyway ;)
Jason G:
Further confirmation of this kernel being memory bound first. I wound up the memory clock without changing the core clock.
At original OC (not stock ) 2088MHz memory clock:
--- Quote ---GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
256 threads: 44.1 GFlops 176.4 GB/s 121.7ulps
--- End quote ---
At 2208 MHz:
--- Quote ---GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
256 threads: 46.7 GFlops 186.8 GB/s 121.7ulps
--- End quote ---
So a ~5.7% increase in throughput for similar increase in memory clock (linear scaling with memory clock)
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version