+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: optimized sources  (Read 615582 times)

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #600 on: 05 Sep 2010, 06:17:02 pm »
CUDA3.1 in W7 was a hart nut, env vars was not set by setup as supposed.
After sorting it out and made entries by hand it is running now.
Qxh compiled
compiled with:2011 beta updat1
configuration:(Release) Platform(Win32)
g2011_Qxh_ATOM_fft  started
23:55:22.036
g2011_Qxh_ATOM_fft.exe

Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3010.
             --------CUFFT-------  ---This prototype---  ---two way---
   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error
   8 1048576    4.3    4.6   1.5      6.8    7.3   1.6      6.9   3.0
  16  524288    5.7    4.6   1.7      7.1    5.7   1.4      7.0   2.3
  64  131072    8.6    4.6   1.7     10.3    5.5   2.2     10.3   3.4
 256   32768   10.0    4.0   2.0      9.4    3.8   2.0      9.5   3.5
 512   16384   10.4    3.7   2.1     12.4    4.4   2.5     12.4   4.2
1024    8192    9.0    2.9   2.5      9.1    2.9   2.4      9.1   4.5
2048    4096    8.5    2.5   2.7      8.8    2.6   3.0      8.9   5.1
4096    2048    7.0    1.9   2.4     10.1    2.7   3.3     10.2   5.4
8192    1024    6.4    1.6   2.4      9.5    2.3   3.4      9.5   5.7

Errors are supposed to be of order of 1 (ULPs).

elapsed time=64 seconds
23:56:26.683
g2011_Qxh_ATOM_fft ended
-------------------------------------------------
compiled with:MS Compiler
configuration:(Release) Platform(Win32)
gmsc_fft  started
23:30:06.184
gmsc_fft.exe

Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3000.
             --------CUFFT-------  ---This prototype---  ---two way---
   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error
   8 1048576    0.5    0.5   1.8      6.9    7.3   1.6      6.9   2.1
  16  524288    1.0    0.8   2.2      7.0    5.6   1.5      7.1   1.9
  64  131072    5.8    3.1   1.7     10.3    5.5   2.4     10.3   3.0
 256   32768    8.6    3.4   1.7      9.5    3.8   1.9      9.4   2.9
 512   16384   13.2    4.7   2.1     12.5    4.4   2.5     12.4   3.7
1024    8192    9.5    3.0   2.3     10.6    3.4   2.4     10.6   3.9
2048    4096    8.6    2.5   2.6      8.8    2.6   3.0      8.9   4.5
4096    2048    7.4    2.0   2.2     10.1    2.7   3.3     10.2   4.9
8192    1024    9.1    2.3   2.8      9.6    2.4   3.4      9.5   5.2

Errors are supposed to be of order of 1 (ULPs).

elapsed time=249 seconds
23:34:15.909
gmsc_fft ended
-------------------------------------------------
MS-Compiler with CUDA3000 need for the same job  ~4 much more times.
heinz
« Last Edit: 05 Sep 2010, 06:32:11 pm by _heinz »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #601 on: 05 Sep 2010, 06:48:10 pm »
Hi Heinz,

Numbers come out different when you change to the same data set size that Multibeam apps use ( 1*1024*1024 complex data points).
CUFFT is not very fast at the small sizes for that small amount of data.  It gets better relatively as the FFT size goes up.  I haven't optimised these custom ones (So they remain ~G80 GPU arranged), but did change the results to give in-otder output.  Didn't need two-way, so made forward & inverse transforms instead.

You can see CUFFT goes pretty slowly when doing many small transforms on our smaller dataset.

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compiled with CUDA 3000.
             --------CUFFT-------  ---FFT--------------  ---IFFT------
   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error
   8  131072    7.1    7.6   1.4    140.0  149.4   1.1    140.5   1.1
  16   65536   16.1   12.9   1.7    183.1  146.5   1.0    183.7   1.0
  64   16384  259.2  138.2   1.4    280.0  149.4   1.4    279.7   1.4
 256    4096  352.2  140.9   1.4    352.8  141.1   1.5    352.0   1.5
 512    2048  413.3  146.9   1.8    411.8  146.4   1.8    412.2   1.8

Errors are supposed to be of order of 1 (ULPs).


« Last Edit: 05 Sep 2010, 06:51:37 pm by Jason G »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #602 on: 06 Sep 2010, 08:35:37 pm »
thanks Jason, will try it as you said.

the modified bench shows the runtime dependencies from different compiler and compiler options. Reference is MSC compiler. It is always the same source.

CLEANUP DONE

 1 reference science app(s) found
   └─(gmsc_fft.exe)

 7 science app(s) found
   └─(g054_fft.exe)
   └─(g060_fft.exe)
   └─(g065_fft.exe)
   └─(g2011_fft.exe)
   └─(g2011_Qxh_fft.exe)
   └─(g2011_SSSE3_fft.exe)
   └─(gcomp_u6_fft.exe)

======================================

------------------------
Running app : gmsc_fft.exe
Started at  : 00:39:39.679
Ended at    : 00:41:06.070
     86.346 secs Elapsed
     85.703 secs CPU time
Result      : stored as reference.
------------------------
Running app : g054_fft.exe
Started at  : 00:41:06.137
Ended at    : 00:41:25.189
     19.008 secs Elapsed
     26.172 secs CPU time
Speedup     : 69.46%
Ratio       : 3.27 x
------------------------
Running app : g060_fft.exe
Started at  : 00:41:25.318
Ended at    : 00:41:44.833
     19.466 secs Elapsed
     18.844 secs CPU time
Speedup     : 78.01%
Ratio       : 4.55 x
------------------------
Running app : g065_fft.exe
Started at  : 00:41:44.967
Ended at    : 00:42:04.481
     19.467 secs Elapsed
     18.813 secs CPU time
Speedup     : 78.05%
Ratio       : 4.56 x
------------------------
Running app : g2011_fft.exe
Started at  : 00:42:04.602
Ended at    : 00:42:23.763
     19.117 secs Elapsed
     18.391 secs CPU time
Speedup     : 78.54%
Ratio       : 4.66 x
------------------------
Running app : g2011_Qxh_fft.exe
Started at  : 00:42:23.896
Ended at    : 00:42:42.663
     18.722 secs Elapsed
     26.109 secs CPU time
Speedup     : 69.54%
Ratio       : 3.28 x
------------------------
Running app : g2011_SSSE3_fft.exe
Started at  : 00:42:42.786
Ended at    : 00:43:01.603
     18.774 secs Elapsed
     24.594 secs CPU time
Speedup     : 71.30%
Ratio       : 3.48 x
------------------------
Running app : gcomp_u6_fft.exe
Started at  : 00:43:01.725
Ended at    : 00:43:21.231
     19.463 secs Elapsed
     19.047 secs CPU time
Speedup     : 77.78%
Ratio       : 4.50 x
------------------------

Collecting hardware / OS infos, please wait...
Sorting ...

Bench results file V8-SK01-07.09.2010-046-bench.txt
stored in .\Testdatas\ directory.

Quick timetable
--------------------------------------
gmsc_fft.exe : 85.703 secs CPU
Result      : stored as reference.
--------------------------------------
g054_fft.exe : 26.172 secs CPU
Speedup     : 69.46%
Ratio       : 3.27 x
--------------------------------------
g060_fft.exe : 18.844 secs CPU
Speedup     : 78.01%
Ratio       : 4.55 x
--------------------------------------
g065_fft.exe : 18.813 secs CPU
Speedup     : 78.05%
Ratio       : 4.56 x
--------------------------------------
g2011_fft.exe : 18.391 secs CPU
Speedup     : 78.54%
Ratio       : 4.66 x
--------------------------------------
g2011_Qxh_fft.exe : 26.109 secs CPU
Speedup     : 69.54%
Ratio       : 3.28 x
--------------------------------------
g2011_SSSE3_fft.exe : 24.594 secs CPU
Speedup     : 71.30%
Ratio       : 3.48 x
--------------------------------------
gcomp_u6_fft.exe : 19.047 secs CPU
Speedup     : 77.78%
Ratio       : 4.50 x
--------------------------------------

======================================
heinz

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #603 on: 07 Sep 2010, 03:38:45 pm »
Hi Heinz,

Numbers come out different when you change to the same data set size that Multibeam apps use ( 1*1024*1024 complex data points).
CUFFT is not very fast at the small sizes for that small amount of data.  It gets better relatively as the FFT size goes up.  I haven't optimised these custom ones (So they remain ~G80 GPU arranged), but did change the results to give in-otder output.  Didn't need two-way, so made forward & inverse transforms instead.

You can see CUFFT goes pretty slowly when doing many small transforms on our smaller dataset.

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compiled with CUDA 3000.
             --------CUFFT-------  ---FFT--------------  ---IFFT------
   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error
   8  131072    7.1    7.6   1.4    140.0  149.4   1.1    140.5   1.1
  16   65536   16.1   12.9   1.7    183.1  146.5   1.0    183.7   1.0
  64   16384  259.2  138.2   1.4    280.0  149.4   1.4    279.7   1.4
 256    4096  352.2  140.9   1.4    352.8  141.1   1.5    352.0   1.5
 512    2048  413.3  146.9   1.8    411.8  146.4   1.8    412.2   1.8

Errors are supposed to be of order of 1 (ULPs).



Hi Jason,
compiled a g2011_QxSSE3_ATOM_fft_small with
//    int n_entries = 8*1024*1024;
   int n_entries = 1*1024*1024;
I can't see any big differences in the Gflops against the full matrix.
But to confirm this I will make a testserie for all 1-8*1024*1024 matrices.
protocoll attached
heinz

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #604 on: 07 Sep 2010, 04:11:44 pm »
Aha. Yes there will be some consderation to how the FFTs fit on the GPU.  Of course 480 can fit more at once, so the small dataset is underutilising the GPU.  It's pretty clear then for the larger GPUs then the concurrent streams must be used with the small dataset.

It will be interesting to make a modified test to do 4 or 16 of the smaller batches of FFTs at the same time on Fermi to see if that'll bring GPU usage up.

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #605 on: 07 Sep 2010, 04:52:12 pm »
Hi Jason,
I run gp_fft_1-8 on the ION (R3600 ATOM)
if you would have a closer look, resultfile attached
modify:
compiled 1-8 on the Xeon (GF470) and run it to see if there are general differences to the ION
have a look on the runtimes
resultfile attached.
some later:
I run GPUZ while 1-8 are running and it is shown, at the beginning when the short 1,2,3,4 are running the GPU usage started with 95% and then slowly fall back to 40% when 8 is running
look at the picture here

A complete other picture shows the ION, 1-4 shows at the beginning of 1 ca 70% but then it is going over into spikes and periods of no cpu usage, that means the necessary calculations between the batches took to much time to feed the gpu continious.
ion_fft_1-4_gpu_load
ion_fft_6-7_gpu_load
heinz
« Last Edit: 07 Sep 2010, 07:58:25 pm by _heinz »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: optimized sources
« Reply #606 on: 08 Sep 2010, 02:38:41 am »
For emulating current MB FFT situation it's worth to test FFT in bunches where num_of_ffts*size_of_fft always == 1M dots.
That is, small fft size means big number of FFTs that can be done at once whereas large FFTs come in smaller numbers.
If few cfft pairs will be unrolled the rule stays the same, just 1M should be replaced to 1M*number_of_unrolled_cfft_pairs.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #607 on: 08 Sep 2010, 02:55:38 am »
Yes, it's interesting that Heinz's ION shows no speed difference with changes to the batch to reflect the 1M points we use in multibeam. 

The smaller ION GPU seems already filled with this smaller batch.  I have before made a modified version of this test that sticks to 1M points, but chains it with the getpowerspectrum kernel like in the app, and includes flops for that accordingly, so it better matches what we'll need for profiling / refinement.  That one I'll have to dig out from backups, due to recent OS reinstall, and it has the small size freaky powerspectrum prototypes in it against stock method.

All that is clear so far is that 1M points doesn't fill the 480 here, so concurrency at the chirp rate level may be necessary for larger cards.

Curiously, nSight measuies my smaller sized FFTs at .33 occupancy, and the CuFFT ones at .17 , and I tuned those for generic compute capability 1.0 devices.  That would seem to further indicate to me that CuFFT is really meant for larger batches than our 1M points ... (Or is being used incorrectly somehow  :o)


Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #608 on: 08 Sep 2010, 06:20:14 am »
To make things more clear I compared gpu-load 1 against 8
the first compact part you see is 1, after that comes 8 in exact 9 pieces, the max value in the middle is 512.

Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3010.
             --------CUFFT-------  ---This prototype---  ---two way---
   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error
   8 1048576    4.3    4.6   1.5      6.7    7.2   1.6      6.8   3.0
  16  524288    5.5    4.4   1.7      7.0    5.6   1.4      7.0   2.3
  64  131072    8.6    4.6   1.7     10.3    5.5   2.2     10.3   3.4
 256   32768   10.0    4.0   2.0      9.5    3.8   2.0      9.7   3.5
512   16384   10.5    3.7   2.1     12.4    4.4   2.5     12.4   4.2
1024    8192    9.0    2.9   2.5      9.1    2.9   2.4      9.0   4.5
2048    4096    8.5    2.5   2.7      8.8    2.6   3.0      8.8   5.1
4096    2048    7.0    1.9   2.4      9.9    2.7   3.3     10.2   5.4
8192    1024    6.4    1.6   2.4      9.5    2.3   3.4      9.4   5.7

It is clear to see like the 9 batchs implizit gpu_load on the ION
ion_fft_1_and_8_gpu_load

heinz

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #609 on: 08 Sep 2010, 11:09:21 am »
You need nSight Heinz:


Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #610 on: 08 Sep 2010, 04:58:19 pm »
Hi Jason,
downloading nSight now, need I a Standard license code or a Professional ?
btw 2011 is installed on the R3600 Atom now.
heinz

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #611 on: 08 Sep 2010, 06:16:41 pm »
They are giving away a short pro license with the free standard one I think.  I entered the pro one to try out, but I'm not sure what the features difference is.

Also note that it must be used on Visual Studio 2008 with Service pack 1 for now. 

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #612 on: 15 Sep 2010, 05:35:45 pm »
Now I asked NVIDIA to become a registered developer. At the beginning of CUDA I was already a registered developer, but as they reorganized their websites I lost my status and new registration was necessary.
Now I'm waiting for a answer.

heinz

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #613 on: 15 Sep 2010, 05:57:47 pm »
Also note that it must be used on Visual Studio 2008 with Service pack 1 for now. 
I tried to install "Service Pack 1" but it says Service Pack 1 is already included in "VS2008 Professional", operation aborted.
As I want to install now "Parallel_Nsight_Host_Win32_1.0.10200 (Jul 2010)" it says Service Pack1 is not installed, rollback installation.
Anyhow curious...
Any ideas ?
Perhaps I should post this in NVIDIA forums.
heinz
« Last Edit: 15 Sep 2010, 06:10:24 pm by _heinz »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #614 on: 15 Sep 2010, 06:14:27 pm »
Yeah, really weird heinz  :o  my installation was from raw VS2008, then applied the service pack, then installed nSight no problems.  Maybe they have some slight problem with SP1 integrated version.  My guess is you wouldn't be the first to see this issue and you'll find a workaround/fix on the forum already.

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 355
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 28
Total: 28
Powered by EzPortal