+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: [Split] PowerSpectrum Unit Test  (Read 162873 times)

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #75 on: 21 Nov 2010, 05:32:37 am »
Interesting.  Cuda visual profiler reports the global memory throughput as ~175GB/s , of Mod 3 with 256 threads.  That means the measurement in the UnitTest is a factor of 10 out  ::)  ( reported by Powerspectrum4 was ~17.7 GB/s)

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: [Split] PowerSpectrum Unit Test
« Reply #76 on: 21 Nov 2010, 05:42:36 am »
Possible reason:
profiler counts all memory transfers, including overhead. Your code probably counts only useful data transfers.
It can be sign of big overhead amount.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #77 on: 21 Nov 2010, 05:53:24 am »
Hmmm, yes I read that.  Whatever reason will pop up as I analyse the crap out of the 256 thread version to see why it's faster on Fermi.  I'm looking for a counter for uncoalesced global loads, but can't find it so far  :-\

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: [Split] PowerSpectrum Unit Test
« Reply #78 on: 21 Nov 2010, 05:56:58 am »
in memory operations section. For NV it presents.

Regarding workgroup size quite few factors can influence:
1) register pressure
2) local (shared in NV terms) memory amount
3) deepness of call stack

All these factors can limit number of warps in flight simultaneously on single compute unit. That is, it influence quality of memory latence hiding.
It will add to all other issues with memory access patterns vs workgroup dimensions (at same workgroup size).

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #79 on: 21 Nov 2010, 06:05:20 am »
Ahh, found that counter for uncoalesced reads & writes isn't supported on greater that compute capability 1.1... oh well.

Quote
1) register pressure
2) local (shared in NV terms) memory amount
3) deepness of call stack

We're all good there with Mod3.  only 6 registers / thread, occupancy is 1, no shared mem usage in this variant, and only a single call.
So it looks like a clean memory bound kernel with no issues.

I did notice the Memcopy kerenls use  192 threads, possibly to fit extra blocks per SM despite being memory bound, so I'm going to try that.

Mod3/256 threads fits 6 blocks/SM, and the max is 8, so it might be worth checking.

[Edit:] much the same:
Quote
    192 threads:       44.1 GFlops   17.6 GB/s 121.7ulps

Not much more to squeeze out of Kernels like this I think,  Will add concurrent kernels next (take my time doing so)

[Later:] Oops:
Quote
float GB =  ((n * sizeof(float2)) + ( n*sizeof(float) ))/10e9;
fixing:
Quote
float GB =  ((n * sizeof(float2)) + ( n*sizeof(float) ))/1e9;
That's better:
Quote
    256 threads:       44.2 GFlops  176.8 GB/s 121.7ulps
Near maximum I think, will have to calculate the theoretical.

Theoretical max of GTX480 @ 2088MHz memclock = 200.448 GB/s  , so 176.8 (effective) is pushing pretty hard.  Onto concurrency....
« Last Edit: 21 Nov 2010, 06:59:32 am by Jason G »

Offline Miep

  • Global Moderator
  • Knight who says 'Ni!'
  • *****
  • Posts: 964
Re: [Split] PowerSpectrum Unit Test
« Reply #80 on: 21 Nov 2010, 11:13:01 am »
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:        4.6 GFlops    1.8 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        2.9 GFlops    1.2 GB/s 121.7ulps
     64 threads:        4.3 GFlops    1.7 GB/s 121.7ulps
    128 threads:        4.4 GFlops    1.7 GB/s 121.7ulps
    256 threads:        4.3 GFlops    1.7 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        0.8 GFlops    0.3 GB/s 1183.3ulps
     64 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps
    128 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps
    256 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        3.2 GFlops    1.3 GB/s 121.7ulps
     64 threads:        4.6 GFlops    1.8 GB/s 121.7ulps
    128 threads:        4.4 GFlops    1.7 GB/s 121.7ulps
    256 threads:        4.2 GFlops    1.7 GB/s 121.7ulps
    512 threads:        3.5 GFlops    1.4 GB/s 121.7ulps
   1024 threads: N/A


HTH
The road to hell is paved with good intentions

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #81 on: 21 Nov 2010, 11:15:39 am »
Fits with the theories so far  :D ... and turns out we can multiply the memory throughput ( GB/s ) by 10  ::)

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: [Split] PowerSpectrum Unit Test
« Reply #82 on: 21 Nov 2010, 02:42:32 pm »
Fits with the theories so far  :D ... and turns out we can multiply the memory throughput ( GB/s ) by 10  ::)

And maybe consider whether the kernels might be memory bound on some cards?

Just to add a little bit... I'm running Vista 32 on a E5400 dual 2.7GHz. My 9500GT has driver 260.99 and is slightly overclocked at core 723/ shader 1840 and memory at 400 to give me 118GFLOP)S Peak.

Yeah, 1840 shader gives 117.76 GFLOPS per the nVidia formula with 32 CUDA cores (aka shaders). What I find interesting is that they're trying to discourage use of Furmark and such which actually try to achieve the highest possible performance...
                                                                                Joe

Offline M_M

  • Squire
  • *
  • Posts: 32
Re: [Split] PowerSpectrum Unit Test
« Reply #83 on: 21 Nov 2010, 03:09:49 pm »
Actually, Nvidia seems to start even more differeintiating gaming GeForce and HTPC Tesla products by putting a limitaton in GTX580 (and probably future high-end gaming GeForce products) to downclock when its usage achieve very high level (as in FurMark or OCCT for example). Reason they are giving is that games will never put such high workload on GPU, and they are probably right. However, some highly optimized real-life CUDA applications could achieve it also - my guess is that Nvidia will respond with "buy a (much more expensive) Tesla if HTPC is what you want"... :(   

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #84 on: 21 Nov 2010, 11:16:03 pm »
And maybe consider whether the kernels might be memory bound on some cards?,,,

This one is  memory bound on all of them, with only 3 compute instructions which are partially fused in each thread.  Getting the right kernel geometry per compute capability does  seem to let us push bandwidth upward from stock though, and it appears stock code for that kernel was compute cap 1.0 optimised (reasonable).  There are ways to switch kernels at run time that are automatic, in the driver,now though.

With a memory bound computation like this then, it does seem logical to increase the compute density, which the first freaky powerspectrum's inclusion of power-spectrum into the FFT output does do, will need to test & refine those implementations for extension to more sizes in the long run.

Next is probably to try to rearrange the FFT & powerspectrum into chunks, in order to better exploit the cache available on Fermi (~768k L2 ), which the FFT-> powerspectrum sequence appears to be thrashing solely due to dataset size. I'm hoping that the concurrent kernels mechanism is intelligent enough to discriminate cache hot data. 

In either case the next test will probably need some extra compute density, which should see the GFlops rise against a hopefully similar bandwidth figure.

I haven't yet explored whether any processing subsequent to the powerspectrum could also be embedded to further raise the compute density, finding spikes immediately, for example, but it's looking like a possibility.  Further on, Dealing with indiviual PulsePoTs for the pulsefinding looks like an option, if the FFT sequence preceding it can be done in suitable blocks.
« Last Edit: 21 Nov 2010, 11:37:18 pm by Jason G »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: [Split] PowerSpectrum Unit Test
« Reply #85 on: 22 Nov 2010, 04:19:21 am »
For spike finding whole array should be scanned. That is, either long loop inside thread or threads cooperation via shared memory and barriers.
Whereas power spectrum computation inherently parallel and each thread can be mapped to separate matrix point.
I tried to fuse power spectrum computation with normalization - performance decreased because of huge drop in available separate threads (normalization required mean computation, i.e., again, access to whole PoT array ).

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #86 on: 22 Nov 2010, 04:46:21 am »
(normalization required mean computation, i.e., again, access to whole PoT array ).
  mmm, there may be a way partially around that by some sync barrier / reduction.  Averaging the large dataset *should* be parallelisable (By swapping local means).  [Edit:] That summax stuff seems to be doing that, but seems to be fairly generalised, with lots of 'TODO' and unnecessary stuff.  will work out how to reduce in powerspectrum kernel later, since it seems pointless rescanning the whole array when we just had it there for powerspectrum.
« Last Edit: 22 Nov 2010, 05:27:23 am by Jason G »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: [Split] PowerSpectrum Unit Test
« Reply #87 on: 22 Nov 2010, 06:47:49 am »
summax uses thread cooperation/synching and barriers.
[
I just wanna say that by merging such kernel with power spectrum one you will bound N points to thread instead of 1 point,where  N>>1 or you will have too many reduce steps.
]
« Last Edit: 22 Nov 2010, 07:19:59 am by Raistmer »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #88 on: 22 Nov 2010, 09:14:50 am »
summax uses thread cooperation/synching and barriers.
[
I just wanna say that by merging such kernel with power spectrum one you will bound N points to thread instead of 1 point,where  N>>1 or you will have too many reduce steps.
]

I agree to some extent, except for that we're already memory bound here, so pinching at least portions (say first stage of the reduction of 256 points in the block) should be almost free via shared memory (compared to memory access time anyway).    If it doesn't work out then it all leads to better understanding of these complicated things anyway  ;)

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #89 on: 24 Nov 2010, 01:35:28 am »
Further confirmation of this kernel being memory bound first.  I wound up the memory clock without changing the core clock.

At original OC (not stock ) 2088MHz memory clock:
Quote
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
    256 threads:       44.1 GFlops  176.4 GB/s 121.7ulps

At 2208 MHz:
Quote
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
    256 threads:       46.7 GFlops  186.8 GB/s 121.7ulps

So a ~5.7% increase in throughput for similar increase in memory clock (linear scaling with memory clock)

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 29
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 25
Total: 25
Powered by EzPortal