+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: [Split] PowerSpectrum Unit Test  (Read 165152 times)

Offline PatrickV2

  • Knight o' The Round Table
  • ***
  • Posts: 139
Re: [Split] PowerSpectrum Unit Test
« Reply #90 on: 25 Nov 2010, 03:05:46 am »
Is it perhaps an option to put the latest version of your test program (the one with the fixed GB/s numbers) in the first post?

Of course, if you want to add/run more tests, I'm looking forward to providing you with the new results. ;)

Regards, Patrick.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #91 on: 25 Nov 2010, 04:59:06 am »
Hi Patrick,
   I'm currently working on adding the next part of stock code to the reference set.  It seems that the method used for the next part of processing in stock cuda code is very slow (Though I'm busily checking my numbers).  Once I've done that, and come up with some suitable alternatives or refinements for that code, I'll probably replace the current test with a new one (with fixed memory throughput numbers).

Until then you can just multiply the Memory throughput figures by ten in your head  ;).

As part of the next refinements, whether they turn out to be replacing or integrating the summax reduction kernels, or something else if that proves unworkable as Raistmer suggests, I'll be trying to include the threads per block heuristic we work out for the powerspectrum Mod3.  All going well, I should have something more worth testing in a day or so.

Jason

[A bit later:] Just to make things complicated, the performance of the next reduction (stock code) depends on what sizes are fed to it  ::) (Powerspectrum performance is constant)
Quote
Stock:
   PowerSpectrum<  64 thrd/blk>   29.0 GFlops  115.9 GB/s   0.0ulps
   SumMax (     8 )    4.3 GFlops   19.0 GB/s
   SumMax (    16)    3.8 GFlops   16.5 GB/s
   SumMax (    32)    1.8 GFlops    8.0 GB/s
   SumMax (    64)    3.1 GFlops   13.5 GB/s
   SumMax (   128)    4.7 GFlops   20.5 GB/s
   SumMax (   256)    6.3 GFlops   27.6 GB/s
   SumMax (   512)   11.2 GFlops   48.9 GB/s
   SumMax (  1024)   17.2 GFlops   75.1 GB/s
   SumMax (  2048)   20.3 GFlops   88.9 GB/s
   SumMax (  4096)   24.3 GFlops  106.3 GB/s
   SumMax (  8192)   25.2 GFlops  110.2 GB/s
   SumMax ( 16384)   24.8 GFlops  108.7 GB/s
   SumMax ( 32768)   28.3 GFlops  123.8 GB/s
   SumMax ( 65536)   18.4 GFlops   80.4 GB/s
   SumMax (131072)   10.1 GFlops   44.3 GB/s

   Powerspectrum + SumMax (     8 )   12.0 GFlops   49.1 GB/s
   Powerspectrum + SumMax (    16)   10.8 GFlops   44.4 GB/s
   Powerspectrum + SumMax (    32)    6.2 GFlops   25.2 GB/s
   Powerspectrum + SumMax (    64)    9.3 GFlops   38.3 GB/s
   Powerspectrum + SumMax (   128)   12.6 GFlops   51.7 GB/s
   Powerspectrum + SumMax (   256)   15.3 GFlops   62.5 GB/s
   Powerspectrum + SumMax (   512)   20.8 GFlops   85.1 GB/s
   Powerspectrum + SumMax (  1024)   24.8 GFlops  101.5 GB/s
   Powerspectrum + SumMax (  2048)   26.3 GFlops  107.5 GB/s
   Powerspectrum + SumMax (  4096)   27.7 GFlops  113.5 GB/s
   Powerspectrum + SumMax (  8192)   28.0 GFlops  114.6 GB/s
   Powerspectrum + SumMax ( 16384)   27.8 GFlops  113.8 GB/s
   Powerspectrum + SumMax ( 32768)   28.8 GFlops  117.9 GB/s
   Powerspectrum + SumMax ( 65536)   25.4 GFlops  104.0 GB/s
   Powerspectrum + SumMax (131072)   19.8 GFlops   81.1 GB/s
« Last Edit: 25 Nov 2010, 05:52:55 am by Jason G »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: [Split] PowerSpectrum Unit Test
« Reply #92 on: 25 Nov 2010, 10:39:19 am »
yes, i should be so.
Different sizes mean different block numbers - different  memory latence hiding at least.
Whereas power spectrum has constant (1M) amount of threads always - each thread mapped jus o single spectrum point and there are always 1M points no matter of sizes X*Y==1024*1024 always  even if X varies.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #93 on: 25 Nov 2010, 04:22:15 pm »
@Raistmer:  Now I restore the stock Memory transfers, and find this response from stock code:

Quote
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
reference summax[FFT#0](     8) mean - 0.673622, peak - 1.624994
reference summax[FFT#0](    16) mean - 0.705653, peak - 2.213269
reference summax[FFT#0](    32) mean - 0.728661, peak - 2.725552
reference summax[FFT#0](    64) mean - 0.650947, peak - 3.050944
reference summax[FFT#0](   128) mean - 0.637886, peak - 3.113411
reference summax[FFT#0](   256) mean - 0.668928, peak - 2.968936
reference summax[FFT#0](   512) mean - 0.666855, peak - 2.978162
reference summax[FFT#0](  1024) mean - 0.665324, peak - 2.985018
reference summax[FFT#0](  2048) mean - 0.661129, peak - 3.003958
reference summax[FFT#0](  4096) mean - 0.665850, peak - 2.982658
reference summax[FFT#0](  8192) mean - 0.667464, peak - 2.975447
reference summax[FFT#0]( 16384) mean - 0.666575, peak - 2.979414
reference summax[FFT#0]( 32768) mean - 0.665878, peak - 2.982532
reference summax[FFT#0]( 65536) mean - 0.665683, peak - 2.983408
reference summax[FFT#0](131072) mean - 0.665053, peak - 2.992251
                PowerSpectrum+summax Unit test
Stock:
   PowerSpectrum<  64 thrd/blk>   29.1 GFlops  116.3 GB/s   0.0ulps
   SumMax (     8)    0.8 GFlops    3.5 GB/s; fft[0] avg 0.673622 Pk 1.624994
   SumMax (    16)    1.1 GFlops    4.7 GB/s; fft[0] avg 0.705653 Pk 2.213270
   SumMax (    32)    1.1 GFlops    4.7 GB/s; fft[0] avg 0.728661 Pk 2.725552
   SumMax (    64)    1.8 GFlops    7.8 GB/s; fft[0] avg 0.650947 Pk 3.050944
   SumMax (   128)    2.6 GFlops   11.5 GB/s; fft[0] avg 0.637887 Pk 3.113411
   SumMax (   256)    3.5 GFlops   15.2 GB/s; fft[0] avg 0.668928 Pk 2.968936
   SumMax (   512)    5.0 GFlops   21.7 GB/s; fft[0] avg 0.666855 Pk 2.978162
   SumMax (  1024)    6.1 GFlops   26.7 GB/s; fft[0] avg 0.665324 Pk 2.985018
   SumMax (  2048)    6.7 GFlops   29.4 GB/s; fft[0] avg 0.661129 Pk 3.003958
   SumMax (  4096)    7.2 GFlops   31.3 GB/s; fft[0] avg 0.665850 Pk 2.982658
   SumMax (  8192)    7.3 GFlops   31.9 GB/s; fft[0] avg 0.667464 Pk 2.975447
   SumMax ( 16384)    7.3 GFlops   31.9 GB/s; fft[0] avg 0.666575 Pk 2.979414
   SumMax ( 32768)    7.3 GFlops   32.1 GB/s; fft[0] avg 0.665878 Pk 2.982532
   SumMax ( 65536)    6.2 GFlops   27.1 GB/s; fft[0] avg 0.665683 Pk 2.983408
   SumMax (131072)    5.1 GFlops   22.5 GB/s; fft[0] avg 0.665053 Pk 2.992251

Did you also find the stock reduction code prior to FindSpikes is a pile of poo also ?  or do you think my test is broken ?

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: [Split] PowerSpectrum Unit Test
« Reply #94 on: 25 Nov 2010, 05:47:47 pm »
Do you mean that summax reach much lower throughput than power spectrum?
If yes, it should be, I described reasons earlier.
For now I converting CUDA summax32 into OpenCL for HD5xxx GPUs. Will se if it will be better than my own reduction kernels that don't use local memory at all.
Reduction in summax allows to increase number of workitems involved. W/o reduction for long FFT (and there are many long FFT calls, much more than small FFT ones) find spike would have only few workitems each dealing with big array of data, that equal very poor memory latency hiding.
So some sort of reduction is essential here [ and surely it will decrease throughput but in much less degree ]
{BTW, summax32 starts form FFTsize==32. From your table it looks like codelets (template-based) for sizes less than 32 are not very good ones, too low throughput. Good info, I'll think twice before using them in OpenCL now ;D }
« Last Edit: 25 Nov 2010, 05:54:29 pm by Raistmer »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #95 on: 25 Nov 2010, 06:07:03 pm »
Yep, I mean exactly what you're getting at:
Stock:
...
   SumMax (     8)    0.8 GFlops    3.5 GB/s; fft[0] avg 0.673622 Pk 1.624994
...
   SumMax (  8192)    7.3 GFlops   31.9 GB/s; fft[0] avg 0.667464 Pk 2.975447
...
   SumMax (131072)    5.1 GFlops   22.5 GB/s; fft[0] avg 0.665053 Pk 2.992251

I think my old Pentium 3 will calculate average & peak for a 1MiB point dataset in similar GFlops speeds, and need much less power to do so. (compared to GTX 480 overclocked)

Part of the waste is definitely the memory copies back to host for result reduction (OK) but not that much.  I'll continue playing around and see if I can determine whether something like this should really be done as is, with improved GPU code, partially on CPU, or fully on CPU.

Jason

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: [Split] PowerSpectrum Unit Test
« Reply #96 on: 25 Nov 2010, 06:13:55 pm »
Full CPU transfer will be slower, I started with that in early stage of OpenCL MB.
4M memory transfer of power matrix costs too much.
For low FFT sizes I use flag transfer to see if mean/max/pos data need to be downloaded from GPU or not, but for big FFT sizes (low number mean/max/pos elements) I found it's easier to transfer them than flag.
Memory transaction (for ATI at least) has some threshold size (about 16kb) after that time of transfer almost doesn't change. That is, no matter single byte transferred or 16kb - overhead will be the same. So, no sence to download flag instead of 16kb of origianl mean/max/pos data.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #97 on: 25 Nov 2010, 06:20:43 pm »
OK, then I have a middle ground in mind that might restore some throughput & hopefully be friendly to the preceeding powerspectrum threadblock layout.  Will give it a go.

[Next Day:] Stock method is doomed by those memory transfers for the summax reduction results.  I'll go onto FindSpikes to see if all that data is really needed.
« Last Edit: 27 Nov 2010, 06:58:14 am by Jason G »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: [Split] PowerSpectrum Unit Test
« Reply #98 on: 27 Nov 2010, 07:20:53 am »
[Next Day:] Stock method is doomed by those memory transfers for the summax reduction results.  I'll go onto FindSpikes to see if all that data is really needed.
3 float numbers (12 bytes) per each array. And when FFT size is big enough (most time it ~8-16k ) it's not too much for transfer (at FFT size of 8K we have 1k/8=128 arrays => 128*12=1,5kB for memory transfer per kernel call. It's well in threshold value of constant time memory transfer for ATI. That is, no sence to reduce size of transfer (and I see no ways to eliminate it completely w/o doing full reduction and spike bookkeeping completely on GPU).
Will look onto NV profiler data to see if it has another time of transfer vs size of transfer dependance...

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: [Split] PowerSpectrum Unit Test
« Reply #99 on: 27 Nov 2010, 08:25:36 am »
OK,it's data for OpenCL build but you will gt ~ same with CUDA perhaps:

Transfer of flag (in my case flag in uint4 so 16 bytes):
4us GPU time and ~120us CPU time (as NV profiler shows)
transfer of full results array in case of big FFT sizes, for example, transfer of 8k data: GPU: 14us, CPU:128us;
4k of data: GPU: 9us, CPU:117us

looks like for NV GPU same rule applies, maybe with slightly lower threshold value: if transfer size less than some threshold, transfer time no longer depends from transfer size.
And it should be so, because of rather big quant of data in bus transfer.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #100 on: 27 Nov 2010, 09:38:00 am »
OK,
    While playing, it is definitely numdatapoints/fftlen transfers (of size float3) that is killing performance here.  I've tried variations on partial reductions, to be completed on host, even with mapped/pinned memory. I find It's going to be more efficient to transmit the thresholds into the kernel, transferring flag/results only if necessary, solely because we have an upper limit of 30 signals... will keep playing.

Jason


Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: [Split] PowerSpectrum Unit Test
« Reply #101 on: 27 Nov 2010, 10:25:27 am »
Flag is better while FFT size are small. I setted threshold size at 2048. After that size it looks better to pass values itself than flag, look OpenCL PC_find_spike* kernels.

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: [Split] PowerSpectrum Unit Test
« Reply #102 on: 27 Nov 2010, 10:29:38 am »
solely because we have an upper limit of 30 signals...
Per signle kernel call it doesn't matter. You can always download whole array back if needed, and this is very rare case. In common case only uint flag of 4 bytes can be transferred if threshold /best comparison inside kernel.

But again, it works good only while FFT sizes are small (many short arrays).

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #103 on: 27 Nov 2010, 10:48:35 am »
Flag is better while FFT size are small. I setted threshold size at 2048. After that size it looks better to pass values itself than flag, look OpenCL PC_find_spike* kernels.

  Yep, that gels with what I'm seeing in cuda calls.  I'll avoid looking too hard at OpenCL implementation in respect of strength in diversity, but the principles match up so far.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #104 on: 27 Nov 2010, 10:50:53 am »
But again, it works good only while FFT sizes are small (many short arrays).

Yep, same again.  The shorter FFT lengths match with the longer pulsePoTs, so I want those as short as possible.  I'll feed through flags to the kernels at least for the shorter FFT lengths.

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 50
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 52
Total: 52
Powered by EzPortal