Author Topic: optimized sources (Read 767827 times)

_heinz · « **Reply #600 on:** 05 Sep 2010, 06:17:02 pm »

CUDA3.1 in W7 was a hart nut, env vars was not set by setup as supposed.
After sorting it out and made entries by hand it is running now.
Qxh compiled
compiled with:2011 beta updat1
configuration:(Release) Platform(Win32)
g2011_Qxh_ATOM_fft started
23:55:22.036
g2011_Qxh_ATOM_fft.exe

Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3010.
--------CUFFT------- ---This prototype--- ---two way---
N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error
8 1048576 4.3 4.6 1.5 6.8 7.3 1.6 6.9 3.0
16 524288 5.7 4.6 1.7 7.1 5.7 1.4 7.0 2.3
64 131072 8.6 4.6 1.7 10.3 5.5 2.2 10.3 3.4
256 32768 10.0 4.0 2.0 9.4 3.8 2.0 9.5 3.5
512 16384 10.4 3.7 2.1 12.4 4.4 2.5 12.4 4.2
1024 8192 9.0 2.9 2.5 9.1 2.9 2.4 9.1 4.5
2048 4096 8.5 2.5 2.7 8.8 2.6 3.0 8.9 5.1
4096 2048 7.0 1.9 2.4 10.1 2.7 3.3 10.2 5.4
8192 1024 6.4 1.6 2.4 9.5 2.3 3.4 9.5 5.7

Errors are supposed to be of order of 1 (ULPs).

elapsed time=64 seconds
23:56:26.683
g2011_Qxh_ATOM_fft ended
-------------------------------------------------
compiled with:MS Compiler
configuration:(Release) Platform(Win32)
gmsc_fft started
23:30:06.184
gmsc_fft.exe

Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3000.
--------CUFFT------- ---This prototype--- ---two way---
N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error
8 1048576 0.5 0.5 1.8 6.9 7.3 1.6 6.9 2.1
16 524288 1.0 0.8 2.2 7.0 5.6 1.5 7.1 1.9
64 131072 5.8 3.1 1.7 10.3 5.5 2.4 10.3 3.0
256 32768 8.6 3.4 1.7 9.5 3.8 1.9 9.4 2.9
512 16384 13.2 4.7 2.1 12.5 4.4 2.5 12.4 3.7
1024 8192 9.5 3.0 2.3 10.6 3.4 2.4 10.6 3.9
2048 4096 8.6 2.5 2.6 8.8 2.6 3.0 8.9 4.5
4096 2048 7.4 2.0 2.2 10.1 2.7 3.3 10.2 4.9
8192 1024 9.1 2.3 2.8 9.6 2.4 3.4 9.5 5.2

Errors are supposed to be of order of 1 (ULPs).

elapsed time=249 seconds
23:34:15.909
gmsc_fft ended
-------------------------------------------------
MS-Compiler with CUDA3000 need for the same job ~4 much more times.
heinz

Jason G · « **Reply #601 on:** 05 Sep 2010, 06:48:10 pm »

Hi Heinz,

Numbers come out different when you change to the same data set size that Multibeam apps use ( 1*1024*1024 complex data points).
CUFFT is not very fast at the small sizes for that small amount of data. It gets better relatively as the FFT size goes up. I haven't optimised these custom ones (So they remain ~G80 GPU arranged), but did change the results to give in-otder output. Didn't need two-way, so made forward & inverse transforms instead.

You can see CUFFT goes pretty slowly when doing many small transforms on our smaller dataset.

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compiled with CUDA 3000.
             --------CUFFT-------  ---FFT--------------  ---IFFT------
   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error
   8  131072    7.1    7.6   1.4    140.0  149.4   1.1    140.5   1.1
  16   65536   16.1   12.9   1.7    183.1  146.5   1.0    183.7   1.0
  64   16384  259.2  138.2   1.4    280.0  149.4   1.4    279.7   1.4
 256    4096  352.2  140.9   1.4    352.8  141.1   1.5    352.0   1.5
 512    2048  413.3  146.9   1.8    411.8  146.4   1.8    412.2   1.8

Errors are supposed to be of order of 1 (ULPs).

_heinz · « **Reply #602 on:** 06 Sep 2010, 08:35:37 pm »

thanks Jason, will try it as you said.

the modified bench shows the runtime dependencies from different compiler and compiler options. Reference is MSC compiler. It is always the same source.

CLEANUP DONE

1 reference science app(s) found
└─(gmsc_fft.exe)

7 science app(s) found
└─(g054_fft.exe)
└─(g060_fft.exe)
└─(g065_fft.exe)
└─(g2011_fft.exe)
└─(g2011_Qxh_fft.exe)
└─(g2011_SSSE3_fft.exe)
└─(gcomp_u6_fft.exe)

======================================

------------------------
Running app : gmsc_fft.exe
Started at : 00:39:39.679
Ended at : 00:41:06.070
86.346 secs Elapsed
85.703 secs CPU time
Result : stored as reference.
------------------------
Running app : g054_fft.exe
Started at : 00:41:06.137
Ended at : 00:41:25.189
19.008 secs Elapsed
26.172 secs CPU time
Speedup : 69.46%
Ratio : 3.27 x
------------------------
Running app : g060_fft.exe
Started at : 00:41:25.318
Ended at : 00:41:44.833
19.466 secs Elapsed
18.844 secs CPU time
Speedup : 78.01%
Ratio : 4.55 x
------------------------
Running app : g065_fft.exe
Started at : 00:41:44.967
Ended at : 00:42:04.481
19.467 secs Elapsed
18.813 secs CPU time
Speedup : 78.05%
Ratio : 4.56 x
------------------------
Running app : g2011_fft.exe
Started at : 00:42:04.602
Ended at : 00:42:23.763
19.117 secs Elapsed
18.391 secs CPU time
Speedup : 78.54%
Ratio : 4.66 x
------------------------
Running app : g2011_Qxh_fft.exe
Started at : 00:42:23.896
Ended at : 00:42:42.663
18.722 secs Elapsed
26.109 secs CPU time
Speedup : 69.54%
Ratio : 3.28 x
------------------------
Running app : g2011_SSSE3_fft.exe
Started at : 00:42:42.786
Ended at : 00:43:01.603
18.774 secs Elapsed
24.594 secs CPU time
Speedup : 71.30%
Ratio : 3.48 x
------------------------
Running app : gcomp_u6_fft.exe
Started at : 00:43:01.725
Ended at : 00:43:21.231
19.463 secs Elapsed
19.047 secs CPU time
Speedup : 77.78%
Ratio : 4.50 x
------------------------

Collecting hardware / OS infos, please wait...
Sorting ...

Bench results file V8-SK01-07.09.2010-046-bench.txt
stored in .\Testdatas\ directory.

Quick timetable
--------------------------------------
gmsc_fft.exe : 85.703 secs CPU
Result : stored as reference.
--------------------------------------
g054_fft.exe : 26.172 secs CPU
Speedup : 69.46%
Ratio : 3.27 x
--------------------------------------
g060_fft.exe : 18.844 secs CPU
Speedup : 78.01%
Ratio : 4.55 x
--------------------------------------
g065_fft.exe : 18.813 secs CPU
Speedup : 78.05%
Ratio : 4.56 x
--------------------------------------
g2011_fft.exe : 18.391 secs CPU
Speedup : 78.54%
Ratio : 4.66 x
--------------------------------------
g2011_Qxh_fft.exe : 26.109 secs CPU
Speedup : 69.54%
Ratio : 3.28 x
--------------------------------------
g2011_SSSE3_fft.exe : 24.594 secs CPU
Speedup : 71.30%
Ratio : 3.48 x
--------------------------------------
gcomp_u6_fft.exe : 19.047 secs CPU
Speedup : 77.78%
Ratio : 4.50 x
--------------------------------------

======================================
heinz

_heinz · « **Reply #603 on:** 07 Sep 2010, 03:38:45 pm »

Quote from: Jason G on 05 Sep 2010, 06:48:10 pm

Hi Heinz,

Numbers come out different when you change to the same data set size that Multibeam apps use ( 1*1024*1024 complex data points).
CUFFT is not very fast at the small sizes for that small amount of data. It gets better relatively as the FFT size goes up. I haven't optimised these custom ones (So they remain ~G80 GPU arranged), but did change the results to give in-otder output. Didn't need two-way, so made forward & inverse transforms instead.

You can see CUFFT goes pretty slowly when doing many small transforms on our smaller dataset.
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compiled with CUDA 3000.
             --------CUFFT-------  ---FFT--------------  ---IFFT------
   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error
   8  131072    7.1    7.6   1.4    140.0  149.4   1.1    140.5   1.1
  16   65536   16.1   12.9   1.7    183.1  146.5   1.0    183.7   1.0
  64   16384  259.2  138.2   1.4    280.0  149.4   1.4    279.7   1.4
 256    4096  352.2  140.9   1.4    352.8  141.1   1.5    352.0   1.5
 512    2048  413.3  146.9   1.8    411.8  146.4   1.8    412.2   1.8

Errors are supposed to be of order of 1 (ULPs). 

Hi Jason,
compiled a g2011_QxSSE3_ATOM_fft_small with
// int n_entries = 8*1024*1024;
int n_entries = 1*1024*1024;
I can't see any big differences in the Gflops against the full matrix.
But to confirm this I will make a testserie for all 1-8*1024*1024 matrices.
protocoll attached
heinz

Jason G · « **Reply #604 on:** 07 Sep 2010, 04:11:44 pm »

Aha. Yes there will be some consderation to how the FFTs fit on the GPU. Of course 480 can fit more at once, so the small dataset is underutilising the GPU. It's pretty clear then for the larger GPUs then the concurrent streams must be used with the small dataset.

It will be interesting to make a modified test to do 4 or 16 of the smaller batches of FFTs at the same time on Fermi to see if that'll bring GPU usage up.

_heinz · « **Reply #605 on:** 07 Sep 2010, 04:52:12 pm »

Hi Jason,
I run gp_fft_1-8 on the ION (R3600 ATOM)
if you would have a closer look, resultfile attached
modify:
compiled 1-8 on the Xeon (GF470) and run it to see if there are general differences to the ION
have a look on the runtimes
resultfile attached.
some later:
I run GPUZ while 1-8 are running and it is shown, at the beginning when the short 1,2,3,4 are running the GPU usage started with 95% and then slowly fall back to 40% when 8 is running
look at the picture here

A complete other picture shows the ION, 1-4 shows at the beginning of 1 ca 70% but then it is going over into spikes and periods of no cpu usage, that means the necessary calculations between the batches took to much time to feed the gpu continious.
ion_fft_1-4_gpu_load
ion_fft_6-7_gpu_load
heinz

Raistmer · « **Reply #606 on:** 08 Sep 2010, 02:38:41 am »

For emulating current MB FFT situation it's worth to test FFT in bunches where num_of_ffts*size_of_fft always == 1M dots.
That is, small fft size means big number of FFTs that can be done at once whereas large FFTs come in smaller numbers.
If few cfft pairs will be unrolled the rule stays the same, just 1M should be replaced to 1M*number_of_unrolled_cfft_pairs.

Jason G · « **Reply #607 on:** 08 Sep 2010, 02:55:38 am »

Yes, it's interesting that Heinz's ION shows no speed difference with changes to the batch to reflect the 1M points we use in multibeam.

The smaller ION GPU seems already filled with this smaller batch. I have before made a modified version of this test that sticks to 1M points, but chains it with the getpowerspectrum kernel like in the app, and includes flops for that accordingly, so it better matches what we'll need for profiling / refinement. That one I'll have to dig out from backups, due to recent OS reinstall, and it has the small size freaky powerspectrum prototypes in it against stock method.

All that is clear so far is that 1M points doesn't fill the 480 here, so concurrency at the chirp rate level may be necessary for larger cards.

Curiously, nSight measuies my smaller sized FFTs at .33 occupancy, and the CuFFT ones at .17 , and I tuned those for generic compute capability 1.0 devices. That would seem to further indicate to me that CuFFT is really meant for larger batches than our 1M points ... (Or is being used incorrectly somehow

)

_heinz · « **Reply #608 on:** 08 Sep 2010, 06:20:14 am »

To make things more clear I compared gpu-load 1 against 8
the first compact part you see is 1, after that comes 8 in exact 9 pieces, the max value in the middle is 512.

Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3010.
--------CUFFT------- ---This prototype--- ---two way---
N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error
8 1048576 4.3 4.6 1.5 6.7 7.2 1.6 6.8 3.0
16 524288 5.5 4.4 1.7 7.0 5.6 1.4 7.0 2.3
64 131072 8.6 4.6 1.7 10.3 5.5 2.2 10.3 3.4
256 32768 10.0 4.0 2.0 9.5 3.8 2.0 9.7 3.5
512 16384 10.5 3.7 2.1 12.4 4.4 2.5 12.4 4.2
1024 8192 9.0 2.9 2.5 9.1 2.9 2.4 9.0 4.5
2048 4096 8.5 2.5 2.7 8.8 2.6 3.0 8.8 5.1
4096 2048 7.0 1.9 2.4 9.9 2.7 3.3 10.2 5.4
8192 1024 6.4 1.6 2.4 9.5 2.3 3.4 9.4 5.7

It is clear to see like the 9 batchs implizit gpu_load on the ION
ion_fft_1_and_8_gpu_load

heinz

Jason G · « **Reply #609 on:** 08 Sep 2010, 11:09:21 am »

You need nSight Heinz:

_heinz · « **Reply #610 on:** 08 Sep 2010, 04:58:19 pm »

Hi Jason,
downloading nSight now, need I a Standard license code or a Professional ?
btw 2011 is installed on the R3600 Atom now.
heinz

Jason G · « **Reply #611 on:** 08 Sep 2010, 06:16:41 pm »

They are giving away a short pro license with the free standard one I think. I entered the pro one to try out, but I'm not sure what the features difference is.

Also note that it must be used on Visual Studio 2008 with Service pack 1 for now.

_heinz · « **Reply #612 on:** 15 Sep 2010, 05:35:45 pm »

Now I asked NVIDIA to become a registered developer. At the beginning of CUDA I was already a registered developer, but as they reorganized their websites I lost my status and new registration was necessary.
Now I'm waiting for a answer.

heinz

_heinz · « **Reply #613 on:** 15 Sep 2010, 05:57:47 pm »

Quote from: Jason G on 08 Sep 2010, 06:16:41 pm

Also note that it must be used on Visual Studio 2008 with Service pack 1 for now.

I tried to install "Service Pack 1" but it says Service Pack 1 is already included in "VS2008 Professional", operation aborted.
As I want to install now "Parallel_Nsight_Host_Win32_1.0.10200 (Jul 2010)" it says Service Pack1 is not installed, rollback installation.
Anyhow curious...
Any ideas ?
Perhaps I should post this in NVIDIA forums.
heinz

Jason G · « **Reply #614 on:** 15 Sep 2010, 06:14:27 pm »

Yeah, really weird heinz

my installation was from raw VS2008, then applied the service pack, then installed nSight no problems. Maybe they have some slight problem with SP1 integrated version. My guess is you wouldn't be the first to see this issue and you'll find a workaround/fix on the forum already.

Author Topic: optimized sources (Read 767827 times)

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Raistmer

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources