Forum > GPU crunching

[Split] PowerSpectrum Unit Test

<< < (16/62) > >>

Jason G:
Interesting.  Cuda visual profiler reports the global memory throughput as ~175GB/s , of Mod 3 with 256 threads.  That means the measurement in the UnitTest is a factor of 10 out  ::)  ( reported by Powerspectrum4 was ~17.7 GB/s)

Raistmer:
Possible reason:
profiler counts all memory transfers, including overhead. Your code probably counts only useful data transfers.
It can be sign of big overhead amount.

Jason G:
Hmmm, yes I read that.  Whatever reason will pop up as I analyse the crap out of the 256 thread version to see why it's faster on Fermi.  I'm looking for a counter for uncoalesced global loads, but can't find it so far  :-\

Raistmer:
in memory operations section. For NV it presents.

Regarding workgroup size quite few factors can influence:
1) register pressure
2) local (shared in NV terms) memory amount
3) deepness of call stack

All these factors can limit number of warps in flight simultaneously on single compute unit. That is, it influence quality of memory latence hiding.
It will add to all other issues with memory access patterns vs workgroup dimensions (at same workgroup size).

Jason G:
Ahh, found that counter for uncoalesced reads & writes isn't supported on greater that compute capability 1.1... oh well.


--- Quote ---1) register pressure
2) local (shared in NV terms) memory amount
3) deepness of call stack

--- End quote ---

We're all good there with Mod3.  only 6 registers / thread, occupancy is 1, no shared mem usage in this variant, and only a single call.
So it looks like a clean memory bound kernel with no issues.

I did notice the Memcopy kerenls use  192 threads, possibly to fit extra blocks per SM despite being memory bound, so I'm going to try that.

Mod3/256 threads fits 6 blocks/SM, and the max is 8, so it might be worth checking.

[Edit:] much the same:

--- Quote ---    192 threads:       44.1 GFlops   17.6 GB/s 121.7ulps
--- End quote ---

Not much more to squeeze out of Kernels like this I think,  Will add concurrent kernels next (take my time doing so)

[Later:] Oops:

--- Quote ---float GB =  ((n * sizeof(float2)) + ( n*sizeof(float) ))/10e9;
--- End quote ---
fixing:

--- Quote ---float GB =  ((n * sizeof(float2)) + ( n*sizeof(float) ))/1e9;
--- End quote ---
That's better:

--- Quote ---    256 threads:       44.2 GFlops  176.8 GB/s 121.7ulps
--- End quote ---
Near maximum I think, will have to calculate the theoretical.

Theoretical max of GTX480 @ 2088MHz memclock = 200.448 GB/s  , so 176.8 (effective) is pushing pretty hard.  Onto concurrency....

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version