Author Topic: optimized sources (Read 548393 times)

_heinz · « **Reply #405 on:** 26 Nov 2008, 01:35:14 pm »

Quote from: Raistmer on 25 Nov 2008, 04:35:48 pm

@
You are among the first to receive notification of the groundbreaking Intel® Parallel Composer Beta. Download this exciting new tool and get instant access to an advanced parallelism C/C++ compiler, debugger, and libraries that can change the way you develop parallel applications. @

Will look...

I'm registered and downloading now..
We will see how this works...
:-)

_heinz · « **Reply #406 on:** 26 Nov 2008, 02:01:59 pm »

Hi Raistmer,
have you seen this
Note when Installing the Intel(R) Parallel Composer Beta on a system with Intel(R) C++ Compiler
This is a limitation of the Intel(R) Parallel Composer beta's Integration with Microsoft Visual Studio*.

If you install the Intel Parallel Composer beta on a system that has Intel C++ Compiler 9.x, 10.x or 11.0 installed already, the IDE integration of Intel Parallel Composer will replace the existing IDE integration from Intel C++ Compiler. This causes the existing Intel C++ Compiler 9.x, 10.x or 11.0 not usable from within the Visual Studio IDE.

If you'd like to use the Intel C++ Compiler 9.x, 10.x or 11.0, please uninstall the Intel Parallel Composer, and repair the old compiler.

-------------------------
uuuuhhh... requires a VM for me to try out...or a second parallel installation of OS..
greetings...

Raistmer · « **Reply #407 on:** 26 Nov 2008, 02:20:33 pm »

Hi

No prob, I not using ICC right now

Raistmer · « **Reply #408 on:** 26 Nov 2008, 02:34:39 pm »

Just Released: AMD Core Math Library v4.2.0

New features in v4.2.0 include:
Further optimized DGEMM for better performance, and requiring less memory bandwidth
Improved 3D Complex-Complex FFT routines with significantly reduced work space requirements
New optimized RNG base generators for 32-bit builds
Updated version of GFORTRAN to 4.3.2
And another news form AMD - new Shanghai core
http://forums.amd.com/devblog/blogpost.cfm?threadid=103010&catid=271

"
A comprehensive suite of Fast Fourier Transforms (FFTs) in both single-, double-, single-complex and double-complex data types.
"
Always wanted to do comparison between IPP/FFTW and ACML for AMD CPUs

OMG it's FORTRAN library %)
http://developer.amd.com/cpu/Libraries/acml/downloads/Pages/default.aspx#downloads
And it has many flavors... Interesting, can it be used w/o any FORTRAN installation, just as simple lib-file? ....

_heinz · « **Reply #409 on:** 26 Nov 2008, 03:47:42 pm »

Quote from: Raistmer on 26 Nov 2008, 02:34:39 pm

Just Released: AMD Core Math Library v4.2.0

OMG it's FORTRAN library %)
http://developer.amd.com/cpu/Libraries/acml/downloads/Pages/default.aspx#downloads
And it has many flavors... Interesting, can it be used w/o any FORTRAN installation, just as simple lib-file? ....

Out of my view we can link from different libs....often this is used in scientific work.
Have seen it AMD Core Math Library v4.2.0 is published...
Thanks

heinz

_heinz · « **Reply #410 on:** 26 Nov 2008, 07:24:20 pm »

Quote from: Jason G on 26 Nov 2008, 11:44:44 am

@Heinz: Do you happen to have any single and multithreaded FFT processing times benched on your skulltrail? Time for 1,2,4 & 8 threads would be nice for 32k element &/or 128k elements, if you have them.

I'm trying to verify/refine some efficiency calculations & have no reference but my dual core.

Jason

compiled the fftw project (single thread) as 32 bit
/I "." /I ".." /I "../libbench2" /I "../api" /I "../kernel" /I "../dft" /I "../rdft" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "FFTW_SINGLE" /D "BENCHFFT_SINGLE" /D "HAVE_SSE" /D "_VC80_UPGRADE=0x0710" /D "_MBCS" /FD /EHsc /MT /Fp".\bench___Win32_Release_float/bench.pch" /Fo".\bench___Win32_Release_float/" /Fd".\bench___Win32_Release_float/" /W3 /nologo /c /errorReport:prompt

Results:
C:\Windows\system32>echo off
fftw-3.1.2 benchfsse(VS2005) started
benchf_sse.exe -opatient 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
131072
Problem: 8, setup: 300.32 us, time: 169.69 ns, ``mflops'': 707.16
Problem: 16, setup: 288.86 us, time: 332.84 ns, ``mflops'': 961.43
Problem: 32, setup: 7.91 ms, time: 726.79 ns, ``mflops'': 1100.7
Problem: 64, setup: 27.46 ms, time: 1.67 us, ``mflops'': 1148.4
Problem: 128, setup: 62.98 ms, time: 4.19 us, ``mflops'': 1069.1
Problem: 256, setup: 137.48 ms, time: 9.18 us, ``mflops'': 1115
Problem: 512, setup: 267.80 ms, time: 20.95 us, ``mflops'': 1099.6
Problem: 1024, setup: 575.47 ms, time: 46.10 us, ``mflops'': 1110.7
Problem: 2048, setup: 1.37 s, time: 99.17 us, ``mflops'': 1135.8
Problem: 4096, setup: 3.42 s, time: 220.42 us, ``mflops'': 1115
Problem: 8192, setup: 8.83 s, time: 530.79 us, ``mflops'': 1003.2
Problem: 16384, setup: 21.99 s, time: 1.13 ms, ``mflops'': 1014.9
Problem: 32768, setup: 53.80 s, time: 2.41 ms, ``mflops'': 1020
Problem: 131072, setup: 369.12 s, time: 9.89 ms, ``mflops'': 1126
fftw-3.1.2 benchfsse ended.
Drücken Sie eine beliebige Taste . . .
----------------------------------------------------------------------------------------------------
For the threaded variants I must first read doku again...
Did you mean this ? or if you want some other Compiler options let me know..
If I have installed the Intel® Parallel Composer Beta, I will recompile the project...

regards heinz

Jason G · « **Reply #411 on:** 27 Nov 2008, 07:34:50 am »

Thanks Heinz,
Could you let me know:
- Current CPU speed at time of test
- Cache sizes per package
- Bus speed

My single core computations are so far within around 10% of your numbers at least, but don't allow for those overheads for large problems, so I factor them into the instruction cost at the moment.

For multithreaded (eventually) FFTW i think it would require a different package they have, (alpha?). In any case the purpose is to refine my textbook efficiency approximations into more practical ones that can be used to assess scalability of parallel FFT algorithms.

_heinz · « **Reply #412 on:** 27 Nov 2008, 06:09:38 pm »

Quote from: Jason G on 27 Nov 2008, 07:34:50 am

Thanks Heinz,
Could you let me know:
- Current CPU speed at time of test
- Cache sizes per package
- Bus speed

CPU speed 2398 MHz
FSB speed 400(QP) 1600
Cache sizes per package ... I must look up ( where can I find in the source ? )
ahh.. cpu package.. 12 MB

Leaps-from-Shadows · « **Reply #413 on:** 27 Nov 2008, 07:46:06 pm »

Current Nehalem CPUs (920, 940, 965) have 32k L1 instruction cache per core, 32k L1 data cache per core, 256k L2 cache per core, and 8MB shared L3 cache.

_heinz · « **Reply #414 on:** 27 Nov 2008, 07:47:11 pm »

Intel® Parallel Composer Beta is installed and running, but not in the VS2005/2008 Express versions.
>------ Erstellen gestartet: Projekt: fibonacci, Konfiguration: Release x64 ------
1>Compiling with Intel(R) C++ Compiler 11.1.032 [Intel(R) 64]... (Intel C++ Environment)
1>Intel(R) C++ Compiler for applications running on Intel(R) 64, Version 11.1 Beta Build 20081112 Package ID: composer_beta_update2.032
1>Copyright (C) 1985-2008 Intel Corporation. All rights reserved.
1>icl /c /I C:\I\INTEL\tbb21_012oss\include -D WIN64 -D NDEBUG -D _CONSOLE -D _MBCS /EHsc /MD /GS /fp:fast /FoC:\Users\heinz\AppData\Local\Temp\tbb_examples\fibonacci\x64\Release/ /W1 /nologo /Qvc9 "/Qlocation,link,C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin\x86_amd64" ..\Fibonacci.cpp
1>
1>Fibonacci.cpp
1>Linking... (Intel C++ Environment)
1>xilink: executing 'link'
1>Embedding manifest... (Microsoft VC++ Environment)
1>Copying tbb.dll (Microsoft VC++ Environment)
1> 1 Datei(en) kopiert.
1>Build log was saved at "file://C:\Users\heinz\AppData\Local\Temp\tbb_examples\fibonacci\x64\Release\BuildLog.htm"
1>fibonacci - 0 error(s), 0 warning(s)
========== Erstellen: 1 erfolgreich, Fehler bei 0, 0 aktuell, 0 übersprungen ==========

I give you 2 results on the hand, both compiled with VS2008, but one with integrated Parallel Composer.
VS2008 TBB --> fibonacci_1000_out.txt
VS2008 TBB Parallel Composer -->fibonacciopt_1000_out.txt
files attached

heinz

[attachment deleted by admin]

Jason G · « **Reply #415 on:** 28 Nov 2008, 01:58:02 am »

Quote from: _heinz on 27 Nov 2008, 06:09:38 pm

Quote from: Jason G on 27 Nov 2008, 07:34:50 am
Thanks Heinz,
Could you let me know:
- Current CPU speed at time of test
- Cache sizes per package
- Bus speed
CPU speed 2398 MHz
FSB speed 400(QP) 1600
Cache sizes per package ... I must look up ( where can I find in the source ? )
ahh.. cpu package.. 12 MB

Thanks again, looks like my single thread estimates come good for your parameters: Could you try a comparison run to this bench I compiled? (attached) Still Single threaded, but will make sure we have reference for future numbers.

same parameter usage: benchf_sse_icc -opatient [same FFT lengths as before]

Jason

[attachment deleted by admin]

_heinz · « **Reply #416 on:** 28 Nov 2008, 04:27:50 am »

fftw-3.1.2 benchf_sse_icc(jason) started
benchf_sse_icc.exe -opatient 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32
768 131072
Problem: 8, setup: 273.78 us, time: 49.65 ns, ``mflops'': 2416.8
Problem: 16, setup: 262.88 us, time: 98.21 ns, ``mflops'': 3258.2
Problem: 32, setup: 7.68 ms, time: 117.86 ns, ``mflops'': 6787.9
Problem: 64, setup: 26.83 ms, time: 222.62 ns, ``mflops'': 8624.6
Problem: 128, setup: 61.58 ms, time: 429.96 ns, ``mflops'': 10420
Problem: 256, setup: 124.30 ms, time: 925.40 ns, ``mflops'': 11066
Problem: 512, setup: 235.98 ms, time: 2.13 us, ``mflops'': 10816
Problem: 1024, setup: 401.79 ms, time: 4.50 us, ``mflops'': 11366
Problem: 2048, setup: 710.67 ms, time: 11.17 us, ``mflops'': 10080
Problem: 4096, setup: 1.39 s, time: 27.94 us, ``mflops'': 8797.1
Problem: 8192, setup: 3.08 s, time: 60.62 us, ``mflops'': 8783.6
Problem: 16384, setup: 6.91 s, time: 134.93 us, ``mflops'': 8499.6
Problem: 32768, setup: 15.86 s, time: 289.70 us, ``mflops'': 8483.2
Problem: 131072, setup: 86.42 s, time: 1.39 ms, ``mflops'': 7988.8
fftw-3.1.2 benchf_sse_icc ended.
----------------------------------------------
... great results

heinz

Jason G · « **Reply #417 on:** 28 Nov 2008, 04:47:30 am »

Huh... ~~now my estimates are way out~~

, That places cost of a complex multiply-add pair about 1.5 cycles and half the initial startup latency (now 35nS). What was your original bench? non-sse floats fftw 3.1.2? (before cost estimate was 10.5 cycles per mul-add & startup latency 60nS). Must be seeing effect of SSE instruction level parallelism and out-of-order execution hiding some of the latency maybe.

_heinz · « **Reply #418 on:** 28 Nov 2008, 08:45:53 am »

Quote from: Jason G on 28 Nov 2008, 04:47:30 am

What was your original bench? non-sse floats fftw 3.1.2?

Configuration: Active(Release float SSE) Platform: Active(Win32)
/I "." /I ".." /I "../libbench2" /I "../api" /I "../kernel" /I "../dft" /I "../rdft" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "FFTW_SINGLE" /D "BENCHFFT_SINGLE" /D "HAVE_SSE" /D "_VC80_UPGRADE=0x0710" /D "_MBCS" /FD /EHsc /MT /Fp".\bench___Win32_Release_float/bench.pch" /Fo".\bench___Win32_Release_float/" /Fd".\bench___Win32_Release_float/" /W3 /nologo /c /errorReport:prompt
******************************************
/OUT:"..\benchf_sse.exe" /INCREMENTAL:NO /NOLOGO /LIBPATH:"C:\I\SC\fftw-3.1.2\libfftwf_sse.lib" /MANIFEST /MANIFESTFILE:".\bench___Win32_Release_float_SSE\benchf_sse.exe.intermediate.manifest" /PDB:".\bench___Win32_Release_float/benchf.pdb" /SUBSYSTEM:CONSOLE /MACHINE:X86 /ERRORREPORT:PROMPT ..\libfftwf_sse.lib kernel32.lib

heinz

Jason G · « **Reply #419 on:** 28 Nov 2008, 08:54:48 am »

ugghhh... ever stranger...same build (except mine with ICC), I guess when they say ICC builds aren't much faster they must mean against GCC builds. Don't have my MinGW/GCC setup anymore to try that build, and that one managed to strangle my p4 back last year. Maybe I'll have better luck this year with improved hardware.

[In any case, we have some reference FFT speeds for the skulltrail now thanks, next is to come up with something that equals that, that can be more easily scaled to parallel.]

Jason

Author Topic: optimized sources (Read 548393 times)

_heinz

Re: optimized sources

_heinz

Re: optimized sources

Raistmer

Re: optimized sources

Raistmer

Re: optimized sources

_heinz

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Leaps-from-Shadows

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources