+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: optimized sources  (Read 624482 times)

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #405 on: 26 Nov 2008, 01:35:14 pm »
@
You are among the first to receive notification of the groundbreaking Intel® Parallel Composer Beta. Download this exciting new tool and get instant access to an advanced parallelism C/C++ compiler, debugger, and libraries that can change the way you develop parallel applications. @
 ;D
Will look...
I'm registered and downloading now..
We will see how this works...
:-)

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #406 on: 26 Nov 2008, 02:01:59 pm »
Hi Raistmer,
have you seen this
Note when Installing the Intel(R) Parallel Composer Beta on a system with Intel(R) C++ Compiler
This is a limitation of the Intel(R) Parallel Composer beta's Integration with Microsoft Visual Studio*.

If you install the Intel Parallel Composer beta on a system that has Intel C++ Compiler 9.x, 10.x or 11.0 installed already, the IDE integration of Intel Parallel Composer will replace the existing IDE integration from Intel C++ Compiler. This causes the existing Intel C++ Compiler 9.x, 10.x or 11.0 not usable from within the Visual Studio IDE.

If you'd like to use the Intel C++ Compiler 9.x, 10.x or 11.0, please uninstall the Intel Parallel Composer, and repair the old compiler.

-------------------------
uuuuhhh... requires a VM for me to try out...or a second parallel installation of OS..
greetings... ;D

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: optimized sources
« Reply #407 on: 26 Nov 2008, 02:20:33 pm »
Hi :)
No prob, I not using ICC right now :)

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: optimized sources
« Reply #408 on: 26 Nov 2008, 02:34:39 pm »
Just Released: AMD Core Math Library v4.2.0


New features in v4.2.0 include:
Further optimized DGEMM for better performance, and requiring less memory bandwidth
Improved 3D Complex-Complex FFT routines with significantly reduced work space requirements
New optimized RNG base generators for 32-bit builds
Updated version of GFORTRAN to 4.3.2
And another news form AMD - new Shanghai core
http://forums.amd.com/devblog/blogpost.cfm?threadid=103010&catid=271

"
A comprehensive suite of Fast Fourier Transforms (FFTs) in both single-, double-, single-complex and double-complex data types.
"
Always wanted to do comparison between IPP/FFTW and ACML for AMD CPUs :)

OMG it's FORTRAN library %)
http://developer.amd.com/cpu/Libraries/acml/downloads/Pages/default.aspx#downloads
And it has many flavors... Interesting, can it be used w/o any FORTRAN installation, just as simple lib-file? ....
« Last Edit: 26 Nov 2008, 02:54:01 pm by Raistmer »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #409 on: 26 Nov 2008, 03:47:42 pm »
Just Released: AMD Core Math Library v4.2.0

OMG it's FORTRAN library %)
http://developer.amd.com/cpu/Libraries/acml/downloads/Pages/default.aspx#downloads
And it has many flavors... Interesting, can it be used w/o any FORTRAN installation, just as simple lib-file? ....
Out of my view we can link from different libs....often this is used in scientific work.
Have seen it AMD Core Math Library v4.2.0 is published...
Thanks

heinz

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #410 on: 26 Nov 2008, 07:24:20 pm »
@Heinz: Do you happen to have any single and multithreaded FFT processing times benched on your skulltrail?  Time for 1,2,4 & 8 threads would be nice for 32k element &/or 128k elements, if you have them. 

I'm trying to verify/refine some efficiency calculations & have no reference but my dual core.

Jason

compiled the fftw project (single thread) as 32 bit
 /I "." /I ".." /I "../libbench2" /I "../api" /I "../kernel" /I "../dft" /I "../rdft" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "FFTW_SINGLE" /D "BENCHFFT_SINGLE" /D "HAVE_SSE" /D "_VC80_UPGRADE=0x0710" /D "_MBCS" /FD /EHsc /MT /Fp".\bench___Win32_Release_float/bench.pch" /Fo".\bench___Win32_Release_float/" /Fd".\bench___Win32_Release_float/" /W3 /nologo /c /errorReport:prompt

Results:
C:\Windows\system32>echo off
fftw-3.1.2 benchfsse(VS2005) started
benchf_sse.exe -opatient 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
131072
Problem: 8, setup: 300.32 us, time: 169.69 ns, ``mflops'': 707.16
Problem: 16, setup: 288.86 us, time: 332.84 ns, ``mflops'': 961.43
Problem: 32, setup: 7.91 ms, time: 726.79 ns, ``mflops'': 1100.7
Problem: 64, setup: 27.46 ms, time: 1.67 us, ``mflops'': 1148.4
Problem: 128, setup: 62.98 ms, time: 4.19 us, ``mflops'': 1069.1
Problem: 256, setup: 137.48 ms, time: 9.18 us, ``mflops'': 1115
Problem: 512, setup: 267.80 ms, time: 20.95 us, ``mflops'': 1099.6
Problem: 1024, setup: 575.47 ms, time: 46.10 us, ``mflops'': 1110.7
Problem: 2048, setup: 1.37 s, time: 99.17 us, ``mflops'': 1135.8
Problem: 4096, setup: 3.42 s, time: 220.42 us, ``mflops'': 1115
Problem: 8192, setup: 8.83 s, time: 530.79 us, ``mflops'': 1003.2
Problem: 16384, setup: 21.99 s, time: 1.13 ms, ``mflops'': 1014.9
Problem: 32768, setup: 53.80 s, time: 2.41 ms, ``mflops'': 1020
Problem: 131072, setup: 369.12 s, time: 9.89 ms, ``mflops'': 1126
fftw-3.1.2 benchfsse ended.
Drücken Sie eine beliebige Taste . . .
----------------------------------------------------------------------------------------------------
For the threaded variants I must first read doku again...
Did you mean this ? or if you want some other Compiler options let me know..
If I have installed the Intel® Parallel Composer Beta, I will recompile the project...

regards heinz

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #411 on: 27 Nov 2008, 07:34:50 am »
Thanks Heinz,
   Could you let me know:
   - Current CPU speed at time of test
   - Cache sizes per package
   - Bus speed

My single core computations are so far within around 10% of your numbers at least, but don't allow for those overheads for large problems, so I factor them into the instruction cost at the moment. 

For multithreaded (eventually)  FFTW i think it would require a different package they have, (alpha?).  In any case the purpose is to refine my textbook efficiency approximations into more practical ones that can be used to assess scalability of parallel FFT algorithms. 
« Last Edit: 27 Nov 2008, 08:29:37 am by Jason G »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #412 on: 27 Nov 2008, 06:09:38 pm »
Thanks Heinz,
   Could you let me know:
   - Current CPU speed at time of test
   - Cache sizes per package
   - Bus speed
CPU speed 2398 MHz
FSB speed 400(QP) 1600
Cache sizes per package ... I must look up ( where can I find in the source ? )
ahh.. cpu package.. 12 MB
« Last Edit: 27 Nov 2008, 07:51:19 pm by _heinz »

Leaps-from-Shadows

  • Guest
Re: optimized sources
« Reply #413 on: 27 Nov 2008, 07:46:06 pm »
Current Nehalem CPUs (920, 940, 965) have 32k L1 instruction cache per core, 32k L1 data cache per core, 256k L2 cache per core, and 8MB shared L3 cache.

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #414 on: 27 Nov 2008, 07:47:11 pm »
Intel® Parallel Composer Beta is installed and running, but not in the VS2005/2008 Express versions.
>------ Erstellen gestartet: Projekt: fibonacci, Konfiguration: Release x64 ------
1>Compiling with Intel(R) C++ Compiler 11.1.032 [Intel(R) 64]... (Intel C++ Environment)
1>Intel(R) C++ Compiler for applications running on Intel(R) 64, Version 11.1  Beta  Build 20081112 Package ID: composer_beta_update2.032
1>Copyright (C) 1985-2008 Intel Corporation.  All rights reserved.
1>icl /c /I C:\I\INTEL\tbb21_012oss\include -D WIN64 -D NDEBUG -D _CONSOLE -D _MBCS /EHsc /MD /GS /fp:fast /FoC:\Users\heinz\AppData\Local\Temp\tbb_examples\fibonacci\x64\Release/ /W1 /nologo /Qvc9 "/Qlocation,link,C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin\x86_amd64" ..\Fibonacci.cpp
1>
1>Fibonacci.cpp
1>Linking... (Intel C++ Environment)
1>xilink: executing 'link'
1>Embedding manifest... (Microsoft VC++ Environment)
1>Copying tbb.dll (Microsoft VC++ Environment)
1>        1 Datei(en) kopiert.
1>Build log was saved at "file://C:\Users\heinz\AppData\Local\Temp\tbb_examples\fibonacci\x64\Release\BuildLog.htm"
1>fibonacci - 0 error(s), 0 warning(s)
========== Erstellen: 1 erfolgreich, Fehler bei 0, 0 aktuell, 0 übersprungen ==========

I give you 2 results on the hand, both compiled with VS2008, but one with integrated Parallel Composer.
VS2008 TBB --> fibonacci_1000_out.txt
VS2008 TBB Parallel Composer -->fibonacciopt_1000_out.txt
files attached

heinz  ;D

[attachment deleted by admin]

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #415 on: 28 Nov 2008, 01:58:02 am »
Thanks Heinz,
   Could you let me know:
   - Current CPU speed at time of test
   - Cache sizes per package
   - Bus speed
CPU speed 2398 MHz
FSB speed 400(QP) 1600
Cache sizes per package ... I must look up ( where can I find in the source ? )
ahh.. cpu package.. 12 MB

Thanks again, looks like my single thread estimates come good for your parameters:  Could you try a comparison run to this bench I compiled? (attached) Still Single threaded, but will make sure we have reference for future numbers.

same parameter usage: benchf_sse_icc  -opatient [same FFT lengths as before]

Jason



[attachment deleted by admin]

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #416 on: 28 Nov 2008, 04:27:50 am »
fftw-3.1.2 benchf_sse_icc(jason) started
benchf_sse_icc.exe -opatient 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32
768 131072
Problem: 8, setup: 273.78 us, time: 49.65 ns, ``mflops'': 2416.8
Problem: 16, setup: 262.88 us, time: 98.21 ns, ``mflops'': 3258.2
Problem: 32, setup: 7.68 ms, time: 117.86 ns, ``mflops'': 6787.9
Problem: 64, setup: 26.83 ms, time: 222.62 ns, ``mflops'': 8624.6
Problem: 128, setup: 61.58 ms, time: 429.96 ns, ``mflops'': 10420
Problem: 256, setup: 124.30 ms, time: 925.40 ns, ``mflops'': 11066
Problem: 512, setup: 235.98 ms, time: 2.13 us, ``mflops'': 10816
Problem: 1024, setup: 401.79 ms, time: 4.50 us, ``mflops'': 11366
Problem: 2048, setup: 710.67 ms, time: 11.17 us, ``mflops'': 10080
Problem: 4096, setup: 1.39 s, time: 27.94 us, ``mflops'': 8797.1
Problem: 8192, setup: 3.08 s, time: 60.62 us, ``mflops'': 8783.6
Problem: 16384, setup: 6.91 s, time: 134.93 us, ``mflops'': 8499.6
Problem: 32768, setup: 15.86 s, time: 289.70 us, ``mflops'': 8483.2
Problem: 131072, setup: 86.42 s, time: 1.39 ms, ``mflops'': 7988.8
fftw-3.1.2 benchf_sse_icc ended.
----------------------------------------------
... great results   ;D
heinz

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #417 on: 28 Nov 2008, 04:47:30 am »
Huh... now my estimates are way out :o, That places cost of a complex multiply-add pair about 1.5 cycles and half the initial startup latency (now 35nS).  What was your original bench? non-sse floats fftw 3.1.2? (before cost estimate was 10.5 cycles per mul-add & startup latency 60nS).  Must be seeing effect of SSE instruction level parallelism and out-of-order execution hiding some of the latency maybe.
« Last Edit: 28 Nov 2008, 05:34:02 am by Jason G »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #418 on: 28 Nov 2008, 08:45:53 am »
What was your original bench? non-sse floats fftw 3.1.2?
Configuration: Active(Release float SSE) Platform: Active(Win32)
/I "." /I ".." /I "../libbench2" /I "../api" /I "../kernel" /I "../dft" /I "../rdft" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "FFTW_SINGLE" /D "BENCHFFT_SINGLE" /D "HAVE_SSE" /D "_VC80_UPGRADE=0x0710" /D "_MBCS" /FD /EHsc /MT /Fp".\bench___Win32_Release_float/bench.pch" /Fo".\bench___Win32_Release_float/" /Fd".\bench___Win32_Release_float/" /W3 /nologo /c /errorReport:prompt
******************************************
/OUT:"..\benchf_sse.exe" /INCREMENTAL:NO /NOLOGO /LIBPATH:"C:\I\SC\fftw-3.1.2\libfftwf_sse.lib" /MANIFEST /MANIFESTFILE:".\bench___Win32_Release_float_SSE\benchf_sse.exe.intermediate.manifest" /PDB:".\bench___Win32_Release_float/benchf.pdb" /SUBSYSTEM:CONSOLE /MACHINE:X86 /ERRORREPORT:PROMPT ..\libfftwf_sse.lib  kernel32.lib

heinz

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #419 on: 28 Nov 2008, 08:54:48 am »
ugghhh... ever stranger...same build (except mine with ICC), I guess when they say ICC builds aren't much faster they must mean against GCC builds.  Don't have my MinGW/GCC setup anymore to try that build, and that one managed to strangle my p4 back last year. Maybe I'll have better luck this year with improved hardware.

[In any case, we have some reference FFT speeds for the skulltrail now thanks, next is to come up with something that equals that, that can be more easily scaled to parallel.]

Jason

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 19
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 21
Total: 21
Powered by EzPortal