Forum > Windows
optimized sources
_heinz:
CUDA3.1 in W7 was a hart nut, env vars was not set by setup as supposed.
After sorting it out and made entries by hand it is running now.
Qxh compiled
compiled with:2011 beta updat1
configuration:(Release) Platform(Win32)
g2011_Qxh_ATOM_fft started
23:55:22.036
g2011_Qxh_ATOM_fft.exe
Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3010.
--------CUFFT------- ---This prototype--- ---two way---
N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error
8 1048576 4.3 4.6 1.5 6.8 7.3 1.6 6.9 3.0
16 524288 5.7 4.6 1.7 7.1 5.7 1.4 7.0 2.3
64 131072 8.6 4.6 1.7 10.3 5.5 2.2 10.3 3.4
256 32768 10.0 4.0 2.0 9.4 3.8 2.0 9.5 3.5
512 16384 10.4 3.7 2.1 12.4 4.4 2.5 12.4 4.2
1024 8192 9.0 2.9 2.5 9.1 2.9 2.4 9.1 4.5
2048 4096 8.5 2.5 2.7 8.8 2.6 3.0 8.9 5.1
4096 2048 7.0 1.9 2.4 10.1 2.7 3.3 10.2 5.4
8192 1024 6.4 1.6 2.4 9.5 2.3 3.4 9.5 5.7
Errors are supposed to be of order of 1 (ULPs).
elapsed time=64 seconds
23:56:26.683
g2011_Qxh_ATOM_fft ended
-------------------------------------------------
compiled with:MS Compiler
configuration:(Release) Platform(Win32)
gmsc_fft started
23:30:06.184
gmsc_fft.exe
Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3000.
--------CUFFT------- ---This prototype--- ---two way---
N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error
8 1048576 0.5 0.5 1.8 6.9 7.3 1.6 6.9 2.1
16 524288 1.0 0.8 2.2 7.0 5.6 1.5 7.1 1.9
64 131072 5.8 3.1 1.7 10.3 5.5 2.4 10.3 3.0
256 32768 8.6 3.4 1.7 9.5 3.8 1.9 9.4 2.9
512 16384 13.2 4.7 2.1 12.5 4.4 2.5 12.4 3.7
1024 8192 9.5 3.0 2.3 10.6 3.4 2.4 10.6 3.9
2048 4096 8.6 2.5 2.6 8.8 2.6 3.0 8.9 4.5
4096 2048 7.4 2.0 2.2 10.1 2.7 3.3 10.2 4.9
8192 1024 9.1 2.3 2.8 9.6 2.4 3.4 9.5 5.2
Errors are supposed to be of order of 1 (ULPs).
elapsed time=249 seconds
23:34:15.909
gmsc_fft ended
-------------------------------------------------
MS-Compiler with CUDA3000 need for the same job ~4 much more times.
heinz
Jason G:
Hi Heinz,
Numbers come out different when you change to the same data set size that Multibeam apps use ( 1*1024*1024 complex data points).
CUFFT is not very fast at the small sizes for that small amount of data. It gets better relatively as the FFT size goes up. I haven't optimised these custom ones (So they remain ~G80 GPU arranged), but did change the results to give in-otder output. Didn't need two-way, so made forward & inverse transforms instead.
You can see CUFFT goes pretty slowly when doing many small transforms on our smaller dataset.
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compiled with CUDA 3000.
--------CUFFT------- ---FFT-------------- ---IFFT------
N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error
8 131072 7.1 7.6 1.4 140.0 149.4 1.1 140.5 1.1
16 65536 16.1 12.9 1.7 183.1 146.5 1.0 183.7 1.0
64 16384 259.2 138.2 1.4 280.0 149.4 1.4 279.7 1.4
256 4096 352.2 140.9 1.4 352.8 141.1 1.5 352.0 1.5
512 2048 413.3 146.9 1.8 411.8 146.4 1.8 412.2 1.8
Errors are supposed to be of order of 1 (ULPs).
_heinz:
thanks Jason, will try it as you said.
the modified bench shows the runtime dependencies from different compiler and compiler options. Reference is MSC compiler. It is always the same source.
CLEANUP DONE
1 reference science app(s) found
└─(gmsc_fft.exe)
7 science app(s) found
└─(g054_fft.exe)
└─(g060_fft.exe)
└─(g065_fft.exe)
└─(g2011_fft.exe)
└─(g2011_Qxh_fft.exe)
└─(g2011_SSSE3_fft.exe)
└─(gcomp_u6_fft.exe)
======================================
------------------------
Running app : gmsc_fft.exe
Started at : 00:39:39.679
Ended at : 00:41:06.070
86.346 secs Elapsed
85.703 secs CPU time
Result : stored as reference.
------------------------
Running app : g054_fft.exe
Started at : 00:41:06.137
Ended at : 00:41:25.189
19.008 secs Elapsed
26.172 secs CPU time
Speedup : 69.46%
Ratio : 3.27 x
------------------------
Running app : g060_fft.exe
Started at : 00:41:25.318
Ended at : 00:41:44.833
19.466 secs Elapsed
18.844 secs CPU time
Speedup : 78.01%
Ratio : 4.55 x
------------------------
Running app : g065_fft.exe
Started at : 00:41:44.967
Ended at : 00:42:04.481
19.467 secs Elapsed
18.813 secs CPU time
Speedup : 78.05%
Ratio : 4.56 x
------------------------
Running app : g2011_fft.exe
Started at : 00:42:04.602
Ended at : 00:42:23.763
19.117 secs Elapsed
18.391 secs CPU time
Speedup : 78.54%
Ratio : 4.66 x
------------------------
Running app : g2011_Qxh_fft.exe
Started at : 00:42:23.896
Ended at : 00:42:42.663
18.722 secs Elapsed
26.109 secs CPU time
Speedup : 69.54%
Ratio : 3.28 x
------------------------
Running app : g2011_SSSE3_fft.exe
Started at : 00:42:42.786
Ended at : 00:43:01.603
18.774 secs Elapsed
24.594 secs CPU time
Speedup : 71.30%
Ratio : 3.48 x
------------------------
Running app : gcomp_u6_fft.exe
Started at : 00:43:01.725
Ended at : 00:43:21.231
19.463 secs Elapsed
19.047 secs CPU time
Speedup : 77.78%
Ratio : 4.50 x
------------------------
Collecting hardware / OS infos, please wait...
Sorting ...
Bench results file V8-SK01-07.09.2010-046-bench.txt
stored in .\Testdatas\ directory.
Quick timetable
--------------------------------------
gmsc_fft.exe : 85.703 secs CPU
Result : stored as reference.
--------------------------------------
g054_fft.exe : 26.172 secs CPU
Speedup : 69.46%
Ratio : 3.27 x
--------------------------------------
g060_fft.exe : 18.844 secs CPU
Speedup : 78.01%
Ratio : 4.55 x
--------------------------------------
g065_fft.exe : 18.813 secs CPU
Speedup : 78.05%
Ratio : 4.56 x
--------------------------------------
g2011_fft.exe : 18.391 secs CPU
Speedup : 78.54%
Ratio : 4.66 x
--------------------------------------
g2011_Qxh_fft.exe : 26.109 secs CPU
Speedup : 69.54%
Ratio : 3.28 x
--------------------------------------
g2011_SSSE3_fft.exe : 24.594 secs CPU
Speedup : 71.30%
Ratio : 3.48 x
--------------------------------------
gcomp_u6_fft.exe : 19.047 secs CPU
Speedup : 77.78%
Ratio : 4.50 x
--------------------------------------
======================================
heinz
_heinz:
--- Quote from: Jason G on 05 Sep 2010, 06:48:10 pm ---Hi Heinz,
Numbers come out different when you change to the same data set size that Multibeam apps use ( 1*1024*1024 complex data points).
CUFFT is not very fast at the small sizes for that small amount of data. It gets better relatively as the FFT size goes up. I haven't optimised these custom ones (So they remain ~G80 GPU arranged), but did change the results to give in-otder output. Didn't need two-way, so made forward & inverse transforms instead.
You can see CUFFT goes pretty slowly when doing many small transforms on our smaller dataset.
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compiled with CUDA 3000.
--------CUFFT------- ---FFT-------------- ---IFFT------
N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error
8 131072 7.1 7.6 1.4 140.0 149.4 1.1 140.5 1.1
16 65536 16.1 12.9 1.7 183.1 146.5 1.0 183.7 1.0
64 16384 259.2 138.2 1.4 280.0 149.4 1.4 279.7 1.4
256 4096 352.2 140.9 1.4 352.8 141.1 1.5 352.0 1.5
512 2048 413.3 146.9 1.8 411.8 146.4 1.8 412.2 1.8
Errors are supposed to be of order of 1 (ULPs).
--- End quote ---
Hi Jason,
compiled a g2011_QxSSE3_ATOM_fft_small with
// int n_entries = 8*1024*1024;
int n_entries = 1*1024*1024;
I can't see any big differences in the Gflops against the full matrix.
But to confirm this I will make a testserie for all 1-8*1024*1024 matrices.
protocoll attached
heinz
Jason G:
Aha. Yes there will be some consderation to how the FFTs fit on the GPU. Of course 480 can fit more at once, so the small dataset is underutilising the GPU. It's pretty clear then for the larger GPUs then the concurrent streams must be used with the small dataset.
It will be interesting to make a modified test to do 4 or 16 of the smaller batches of FFTs at the same time on Fermi to see if that'll bring GPU usage up.
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version