Forum > Windows

optimized sources

<< < (121/179) > >>

_heinz:
CUDA3.1 in W7 was a hart nut, env vars was not set by setup as supposed.
After sorting it out and made entries by hand it is running now.
Qxh compiled
compiled with:2011 beta updat1
configuration:(Release) Platform(Win32)
g2011_Qxh_ATOM_fft  started
23:55:22.036
g2011_Qxh_ATOM_fft.exe

Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3010.
             --------CUFFT-------  ---This prototype---  ---two way---
   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error
   8 1048576    4.3    4.6   1.5      6.8    7.3   1.6      6.9   3.0
  16  524288    5.7    4.6   1.7      7.1    5.7   1.4      7.0   2.3
  64  131072    8.6    4.6   1.7     10.3    5.5   2.2     10.3   3.4
 256   32768   10.0    4.0   2.0      9.4    3.8   2.0      9.5   3.5
 512   16384   10.4    3.7   2.1     12.4    4.4   2.5     12.4   4.2
1024    8192    9.0    2.9   2.5      9.1    2.9   2.4      9.1   4.5
2048    4096    8.5    2.5   2.7      8.8    2.6   3.0      8.9   5.1
4096    2048    7.0    1.9   2.4     10.1    2.7   3.3     10.2   5.4
8192    1024    6.4    1.6   2.4      9.5    2.3   3.4      9.5   5.7

Errors are supposed to be of order of 1 (ULPs).

elapsed time=64 seconds
23:56:26.683
g2011_Qxh_ATOM_fft ended
-------------------------------------------------
compiled with:MS Compiler
configuration:(Release) Platform(Win32)
gmsc_fft  started
23:30:06.184
gmsc_fft.exe

Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3000.
             --------CUFFT-------  ---This prototype---  ---two way---
   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error
   8 1048576    0.5    0.5   1.8      6.9    7.3   1.6      6.9   2.1
  16  524288    1.0    0.8   2.2      7.0    5.6   1.5      7.1   1.9
  64  131072    5.8    3.1   1.7     10.3    5.5   2.4     10.3   3.0
 256   32768    8.6    3.4   1.7      9.5    3.8   1.9      9.4   2.9
 512   16384   13.2    4.7   2.1     12.5    4.4   2.5     12.4   3.7
1024    8192    9.5    3.0   2.3     10.6    3.4   2.4     10.6   3.9
2048    4096    8.6    2.5   2.6      8.8    2.6   3.0      8.9   4.5
4096    2048    7.4    2.0   2.2     10.1    2.7   3.3     10.2   4.9
8192    1024    9.1    2.3   2.8      9.6    2.4   3.4      9.5   5.2

Errors are supposed to be of order of 1 (ULPs).

elapsed time=249 seconds
23:34:15.909
gmsc_fft ended
-------------------------------------------------
MS-Compiler with CUDA3000 need for the same job  ~4 much more times.
heinz

Jason G:
Hi Heinz,

Numbers come out different when you change to the same data set size that Multibeam apps use ( 1*1024*1024 complex data points).
CUFFT is not very fast at the small sizes for that small amount of data.  It gets better relatively as the FFT size goes up.  I haven't optimised these custom ones (So they remain ~G80 GPU arranged), but did change the results to give in-otder output.  Didn't need two-way, so made forward & inverse transforms instead.

You can see CUFFT goes pretty slowly when doing many small transforms on our smaller dataset.

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compiled with CUDA 3000.
             --------CUFFT-------  ---FFT--------------  ---IFFT------
   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error
   8  131072    7.1    7.6   1.4    140.0  149.4   1.1    140.5   1.1
  16   65536   16.1   12.9   1.7    183.1  146.5   1.0    183.7   1.0
  64   16384  259.2  138.2   1.4    280.0  149.4   1.4    279.7   1.4
 256    4096  352.2  140.9   1.4    352.8  141.1   1.5    352.0   1.5
 512    2048  413.3  146.9   1.8    411.8  146.4   1.8    412.2   1.8

Errors are supposed to be of order of 1 (ULPs).

_heinz:
thanks Jason, will try it as you said.

the modified bench shows the runtime dependencies from different compiler and compiler options. Reference is MSC compiler. It is always the same source.

CLEANUP DONE

 1 reference science app(s) found
   └─(gmsc_fft.exe)

 7 science app(s) found
   └─(g054_fft.exe)
   └─(g060_fft.exe)
   └─(g065_fft.exe)
   └─(g2011_fft.exe)
   └─(g2011_Qxh_fft.exe)
   └─(g2011_SSSE3_fft.exe)
   └─(gcomp_u6_fft.exe)

======================================

------------------------
Running app : gmsc_fft.exe
Started at  : 00:39:39.679
Ended at    : 00:41:06.070
     86.346 secs Elapsed
     85.703 secs CPU time
Result      : stored as reference.
------------------------
Running app : g054_fft.exe
Started at  : 00:41:06.137
Ended at    : 00:41:25.189
     19.008 secs Elapsed
     26.172 secs CPU time
Speedup     : 69.46%
Ratio       : 3.27 x
------------------------
Running app : g060_fft.exe
Started at  : 00:41:25.318
Ended at    : 00:41:44.833
     19.466 secs Elapsed
     18.844 secs CPU time
Speedup     : 78.01%
Ratio       : 4.55 x
------------------------
Running app : g065_fft.exe
Started at  : 00:41:44.967
Ended at    : 00:42:04.481
     19.467 secs Elapsed
     18.813 secs CPU time
Speedup     : 78.05%
Ratio       : 4.56 x
------------------------
Running app : g2011_fft.exe
Started at  : 00:42:04.602
Ended at    : 00:42:23.763
     19.117 secs Elapsed
     18.391 secs CPU time
Speedup     : 78.54%
Ratio       : 4.66 x
------------------------
Running app : g2011_Qxh_fft.exe
Started at  : 00:42:23.896
Ended at    : 00:42:42.663
     18.722 secs Elapsed
     26.109 secs CPU time
Speedup     : 69.54%
Ratio       : 3.28 x
------------------------
Running app : g2011_SSSE3_fft.exe
Started at  : 00:42:42.786
Ended at    : 00:43:01.603
     18.774 secs Elapsed
     24.594 secs CPU time
Speedup     : 71.30%
Ratio       : 3.48 x
------------------------
Running app : gcomp_u6_fft.exe
Started at  : 00:43:01.725
Ended at    : 00:43:21.231
     19.463 secs Elapsed
     19.047 secs CPU time
Speedup     : 77.78%
Ratio       : 4.50 x
------------------------

Collecting hardware / OS infos, please wait...
Sorting ...

Bench results file V8-SK01-07.09.2010-046-bench.txt
stored in .\Testdatas\ directory.

Quick timetable
--------------------------------------
gmsc_fft.exe : 85.703 secs CPU
Result      : stored as reference.
--------------------------------------
g054_fft.exe : 26.172 secs CPU
Speedup     : 69.46%
Ratio       : 3.27 x
--------------------------------------
g060_fft.exe : 18.844 secs CPU
Speedup     : 78.01%
Ratio       : 4.55 x
--------------------------------------
g065_fft.exe : 18.813 secs CPU
Speedup     : 78.05%
Ratio       : 4.56 x
--------------------------------------
g2011_fft.exe : 18.391 secs CPU
Speedup     : 78.54%
Ratio       : 4.66 x
--------------------------------------
g2011_Qxh_fft.exe : 26.109 secs CPU
Speedup     : 69.54%
Ratio       : 3.28 x
--------------------------------------
g2011_SSSE3_fft.exe : 24.594 secs CPU
Speedup     : 71.30%
Ratio       : 3.48 x
--------------------------------------
gcomp_u6_fft.exe : 19.047 secs CPU
Speedup     : 77.78%
Ratio       : 4.50 x
--------------------------------------

======================================
heinz

_heinz:

--- Quote from: Jason G on 05 Sep 2010, 06:48:10 pm ---Hi Heinz,

Numbers come out different when you change to the same data set size that Multibeam apps use ( 1*1024*1024 complex data points).
CUFFT is not very fast at the small sizes for that small amount of data.  It gets better relatively as the FFT size goes up.  I haven't optimised these custom ones (So they remain ~G80 GPU arranged), but did change the results to give in-otder output.  Didn't need two-way, so made forward & inverse transforms instead.

You can see CUFFT goes pretty slowly when doing many small transforms on our smaller dataset.

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compiled with CUDA 3000.
             --------CUFFT-------  ---FFT--------------  ---IFFT------
   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error
   8  131072    7.1    7.6   1.4    140.0  149.4   1.1    140.5   1.1
  16   65536   16.1   12.9   1.7    183.1  146.5   1.0    183.7   1.0
  64   16384  259.2  138.2   1.4    280.0  149.4   1.4    279.7   1.4
 256    4096  352.2  140.9   1.4    352.8  141.1   1.5    352.0   1.5
 512    2048  413.3  146.9   1.8    411.8  146.4   1.8    412.2   1.8

Errors are supposed to be of order of 1 (ULPs).



--- End quote ---
Hi Jason,
compiled a g2011_QxSSE3_ATOM_fft_small with
//    int n_entries = 8*1024*1024;
   int n_entries = 1*1024*1024;
I can't see any big differences in the Gflops against the full matrix.
But to confirm this I will make a testserie for all 1-8*1024*1024 matrices.
protocoll attached
heinz

Jason G:
Aha. Yes there will be some consderation to how the FFTs fit on the GPU.  Of course 480 can fit more at once, so the small dataset is underutilising the GPU.  It's pretty clear then for the larger GPUs then the concurrent streams must be used with the small dataset.

It will be interesting to make a modified test to do 4 or 16 of the smaller batches of FFTs at the same time on Fermi to see if that'll bring GPU usage up.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version