Author Topic: [Split] PowerSpectrum Unit Test (Read 186831 times)

Jason G · « **Reply #75 on:** 21 Nov 2010, 05:32:37 am »

Interesting. Cuda visual profiler reports the global memory throughput as ~175GB/s , of Mod 3 with 256 threads. That means the measurement in the UnitTest is a factor of 10 out

( reported by Powerspectrum4 was ~17.7 GB/s)

Raistmer · « **Reply #76 on:** 21 Nov 2010, 05:42:36 am »

Possible reason:
profiler counts all memory transfers, including overhead. Your code probably counts only useful data transfers.
It can be sign of big overhead amount.

Jason G · « **Reply #77 on:** 21 Nov 2010, 05:53:24 am »

Hmmm, yes I read that. Whatever reason will pop up as I analyse the crap out of the 256 thread version to see why it's faster on Fermi. I'm looking for a counter for uncoalesced global loads, but can't find it so far $:-\$

Raistmer · « **Reply #78 on:** 21 Nov 2010, 05:56:58 am »

in memory operations section. For NV it presents.

Regarding workgroup size quite few factors can influence:
1) register pressure
2) local (shared in NV terms) memory amount
3) deepness of call stack

All these factors can limit number of warps in flight simultaneously on single compute unit. That is, it influence quality of memory latence hiding.
It will add to all other issues with memory access patterns vs workgroup dimensions (at same workgroup size).

Jason G · « **Reply #79 on:** 21 Nov 2010, 06:05:20 am »

Ahh, found that counter for uncoalesced reads & writes isn't supported on greater that compute capability 1.1... oh well.

Quote

1) register pressure
2) local (shared in NV terms) memory amount
3) deepness of call stack

We're all good there with Mod3. only 6 registers / thread, occupancy is 1, no shared mem usage in this variant, and only a single call.
So it looks like a clean memory bound kernel with no issues.

I did notice the Memcopy kerenls use 192 threads, possibly to fit extra blocks per SM despite being memory bound, so I'm going to try that.

Mod3/256 threads fits 6 blocks/SM, and the max is 8, so it might be worth checking.

[Edit:] much the same:

Quote

192 threads: 44.1 GFlops 17.6 GB/s 121.7ulps

Not much more to squeeze out of Kernels like this I think, Will add concurrent kernels next (take my time doing so)

[Later:] Oops:

Quote

float GB = ((n * sizeof(float2)) + ( n*sizeof(float) ))/10e9;

fixing:

Quote

float GB = ((n * sizeof(float2)) + ( n*sizeof(float) ))/1e9;

That's better:

Quote

256 threads: 44.2 GFlops 176.8 GB/s 121.7ulps

Near maximum I think, will have to calculate the theoretical.

Theoretical max of GTX480 @ 2088MHz memclock = 200.448 GB/s , so 176.8 (effective) is pushing pretty hard. Onto concurrency....

Miep · « **Reply #80 on:** 21 Nov 2010, 11:13:01 am »

Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 4.6 GFlops 1.8 GB/s 1183.3ulps

GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 2.9 GFlops 1.2 GB/s 121.7ulps
64 threads: 4.3 GFlops 1.7 GB/s 121.7ulps
128 threads: 4.4 GFlops 1.7 GB/s 121.7ulps
256 threads: 4.3 GFlops 1.7 GB/s 121.7ulps

GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 0.8 GFlops 0.3 GB/s 1183.3ulps
64 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
128 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
256 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps

GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 3.2 GFlops 1.3 GB/s 121.7ulps
64 threads: 4.6 GFlops 1.8 GB/s 121.7ulps
128 threads: 4.4 GFlops 1.7 GB/s 121.7ulps
256 threads: 4.2 GFlops 1.7 GB/s 121.7ulps
512 threads: 3.5 GFlops 1.4 GB/s 121.7ulps
1024 threads: N/A

HTH

Jason G · « **Reply #81 on:** 21 Nov 2010, 11:15:39 am »

Fits with the theories so far

... and turns out we can multiply the memory throughput ( GB/s ) by 10

Josef W. Segur · « **Reply #82 on:** 21 Nov 2010, 02:42:32 pm »

Quote from: Jason G on 21 Nov 2010, 11:15:39 am

Fits with the theories so far ... and turns out we can multiply the memory throughput ( GB/s ) by 10

And maybe consider whether the kernels might be memory bound on some cards?

Quote from: perryjay on 20 Nov 2010, 06:55:14 pm

Just to add a little bit... I'm running Vista 32 on a E5400 dual 2.7GHz. My 9500GT has driver 260.99 and is slightly overclocked at core 723/ shader 1840 and memory at 400 to give me 118GFLOP)S Peak.

Yeah, 1840 shader gives 117.76 GFLOPS per the nVidia formula with 32 CUDA cores (aka shaders). What I find interesting is that they're trying to discourage use of Furmark and such which actually try to achieve the highest possible performance...
Joe

M_M · « **Reply #83 on:** 21 Nov 2010, 03:09:49 pm »

Actually, Nvidia seems to start even more differeintiating gaming GeForce and HTPC Tesla products by putting a limitaton in GTX580 (and probably future high-end gaming GeForce products) to downclock when its usage achieve very high level (as in FurMark or OCCT for example). Reason they are giving is that games will never put such high workload on GPU, and they are probably right. However, some highly optimized real-life CUDA applications could achieve it also - my guess is that Nvidia will respond with "buy a (much more expensive) Tesla if HTPC is what you want"...

Jason G · « **Reply #84 on:** 21 Nov 2010, 11:16:03 pm »

Quote from: Josef W. Segur on 21 Nov 2010, 02:42:32 pm

And maybe consider whether the kernels might be memory bound on some cards?,,,

This one is memory bound on all of them, with only 3 compute instructions which are partially fused in each thread. Getting the right kernel geometry per compute capability does seem to let us push bandwidth upward from stock though, and it appears stock code for that kernel was compute cap 1.0 optimised (reasonable). There are ways to switch kernels at run time that are automatic, in the driver,now though.

With a memory bound computation like this then, it does seem logical to increase the compute density, which the first freaky powerspectrum's inclusion of power-spectrum into the FFT output does do, will need to test & refine those implementations for extension to more sizes in the long run.

Next is probably to try to rearrange the FFT & powerspectrum into chunks, in order to better exploit the cache available on Fermi (~768k L2 ), which the FFT-> powerspectrum sequence appears to be thrashing solely due to dataset size. I'm hoping that the concurrent kernels mechanism is intelligent enough to discriminate cache hot data.

In either case the next test will probably need some extra compute density, which should see the GFlops rise against a hopefully similar bandwidth figure.

I haven't yet explored whether any processing subsequent to the powerspectrum could also be embedded to further raise the compute density, finding spikes immediately, for example, but it's looking like a possibility. Further on, Dealing with indiviual PulsePoTs for the pulsefinding looks like an option, if the FFT sequence preceding it can be done in suitable blocks.

Raistmer · « **Reply #85 on:** 22 Nov 2010, 04:19:21 am »

For spike finding whole array should be scanned. That is, either long loop inside thread or threads cooperation via shared memory and barriers.
Whereas power spectrum computation inherently parallel and each thread can be mapped to separate matrix point.
I tried to fuse power spectrum computation with normalization - performance decreased because of huge drop in available separate threads (normalization required mean computation, i.e., again, access to whole PoT array ).

Jason G · « **Reply #86 on:** 22 Nov 2010, 04:46:21 am »

Quote from: Raistmer on 22 Nov 2010, 04:19:21 am

(normalization required mean computation, i.e., again, access to whole PoT array ).

mmm, there may be a way partially around that by some sync barrier / reduction. Averaging the large dataset *should* be parallelisable (By swapping local means). [Edit:] That summax stuff seems to be doing that, but seems to be fairly generalised, with lots of 'TODO' and unnecessary stuff. will work out how to reduce in powerspectrum kernel later, since it seems pointless rescanning the whole array when we just had it there for powerspectrum.

Raistmer · « **Reply #87 on:** 22 Nov 2010, 06:47:49 am »

summax uses thread cooperation/synching and barriers.
[
I just wanna say that by merging such kernel with power spectrum one you will bound N points to thread instead of 1 point,where N>>1 or you will have too many reduce steps.
]

Jason G · « **Reply #88 on:** 22 Nov 2010, 09:14:50 am »

Quote from: Raistmer on 22 Nov 2010, 06:47:49 am

summax uses thread cooperation/synching and barriers.
[
I just wanna say that by merging such kernel with power spectrum one you will bound N points to thread instead of 1 point,where N>>1 or you will have too many reduce steps.
]

I agree to some extent, except for that we're already memory bound here, so pinching at least portions (say first stage of the reduction of 256 points in the block) should be almost free via shared memory (compared to memory access time anyway). If it doesn't work out then it all leads to better understanding of these complicated things anyway

Jason G · « **Reply #89 on:** 24 Nov 2010, 01:35:28 am »

Further confirmation of this kernel being memory bound first. I wound up the memory clock without changing the core clock.

At original OC (not stock ) 2088MHz memory clock:

Quote

GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
256 threads: 44.1 GFlops 176.4 GB/s 121.7ulps

At 2208 MHz:

Quote

GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
256 threads: 46.7 GFlops 186.8 GB/s 121.7ulps

So a ~5.7% increase in throughput for similar increase in memory clock (linear scaling with memory clock)

Author Topic: [Split] PowerSpectrum Unit Test (Read 186831 times)

Jason G

Re: [Split] PowerSpectrum Unit Test

Raistmer

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Raistmer

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Miep

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Josef W. Segur

Re: [Split] PowerSpectrum Unit Test

M_M

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Raistmer

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Raistmer

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test