Author Topic: optimized sources (Read 548388 times)

_heinz · « **Reply #435 on:** 28 Nov 2008, 12:46:23 pm »

Quote from: Jason G on 28 Nov 2008, 12:30:43 pm

Ahhh, 6 meg per package ( 1.5 meg per core )... Okay, yep it is 12 meg total for the 8 cores.

Compared 32 bit ICC 10.1 / TBB 2.0 build of fibonacci, and it IS slower than Parallel composer 32 bit build under XP64 ... Will have to try that build under XP32 to confiirm though. I will probably update all my ICC/IPP base packages as soon as I get time, in a few week.

Jason

12 MB per chip
BX80574E5405A Aktivkühler oder für 1-HE-Systeme 45 nm E5405 2,00 GHz (80 W) 1333 12 MB gesamt
we have 2 processors so we have 24MB for 8 Cores

Jason G · « **Reply #436 on:** 28 Nov 2008, 12:49:42 pm »

Err well CPU-Z shows only per core then? In any case:

Hmm, not a lot of Fibonacci difference here, but some: (fastest thread number was 2)

Built under xp32 with ICC 10.1 + TBB (run on XP 32)

Quote

Threads number is 2
Shared serial (mutex) - in 0.286294 msec
Shared serial (spin_mutex) - in 0.196978 msec
Shared serial (queuing_mutex) - in 0.301214 msec
Shared serial (Conc.HashTable) - in 4.313505 msec
Parallel while+for/queue - in 1.485761 msec
Parallel pipe/queue - in 1.980293 msec
Parallel reduce - in 0.523162 msec
Parallel scan - in 0.338611 msec
Parallel tasks - in 0.566134 msec

and Built under XP64 with Parallel Composer Beta Update 2 + TBB 2.0 ( but run on XP 32 also)

Quote

Threads number is 2
Shared serial (mutex) - in 0.279819 msec
Shared serial (spin_mutex) - in 0.208223 msec
Shared serial (queuing_mutex) - in 0.284642 msec
Shared serial (Conc.HashTable) - in 4.461598 msec
Parallel while+for/queue - in 1.718736 msec
Parallel pipe/queue - in 2.188073 msec
Parallel reduce - in 0.571781 msec
Parallel scan - in 0.357319 msec
Parallel tasks - in 0.534837 msec

So some things look a bit slower, but I will carefully consider shifting to ICC 11 soon, and check how our projects of interest compare.

_heinz · « **Reply #437 on:** 28 Nov 2008, 12:58:28 pm »

How many number let you generate ? 1000 ?

Jason G · « **Reply #438 on:** 28 Nov 2008, 01:00:42 pm »

No, just used default which was 100... will try 1000

[Later:] Fastest 32 bit run built on XP32 ICC10.1 / TBB2.0 now 3 threads

:

Quote

Threads number is 3
Shared serial (mutex) - in 162.014407 msec
Shared serial (spin_mutex) - in 11.609819 msec
Shared serial (queuing_mutex) - in 50.960339 msec
Shared serial (Conc.HashTable) - in 401.327768 msec
Parallel while+for/queue - in 93.399315 msec
Parallel pipe/queue - in 164.994829 msec
Parallel reduce - in 27.500117 msec
Parallel scan - in 22.918168 msec
Parallel tasks - in 25.904447 msec

Getting parallel composer build data:

Quote

Threads number is 3
Shared serial (mutex) - in 76.449678 msec
Shared serial (spin_mutex) - in 13.449323 msec
Shared serial (queuing_mutex) - in 50.961819 msec
Shared serial (Conc.HashTable) - in 413.186277 msec
Parallel while+for/queue - in 93.995606 msec
Parallel pipe/queue - in 171.541281 msec
Parallel reduce - in 28.647254 msec
Parallel scan - in 27.231642 msec
Parallel tasks - in 24.389762 msec

_heinz · « **Reply #439 on:** 28 Nov 2008, 02:48:58 pm »

Quote from: Jason G on 28 Nov 2008, 01:00:42 pm

No, just used default which was 100... will try 1000

[Later:] Fastest 32 bit run built on XP32 ICC10.1 / TBB2.0 now 3 threads :
Quote
Threads number is 3
Now you know why I choosed 5 .. a not even number
We can create every number of threads 1, 2, 3, 4.. 128, 256, 512 etc. not even numbers also.
and we can use /QxHOST ---> Best performance on latest features of the processor supported by the compilation host.

heinz

_heinz · « **Reply #440 on:** 28 Nov 2008, 04:02:41 pm »

Quote from: _heinz on 26 Nov 2008, 07:24:20 pm

Quote from: Jason G on 26 Nov 2008, 11:44:44 am
@Heinz: Do you happen to have any single and multithreaded FFT processing times benched on your skulltrail? Time for 1,2,4 & 8 threads would be nice for 32k element &/or 128k elements, if you have them.

I'm trying to verify/refine some efficiency calculations & have no reference but my dual core.

Jason

compiled the fftw project (single thread) as 32 bit
/I "." /I ".." /I "../libbench2" /I "../api" /I "../kernel" /I "../dft" /I "../rdft" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "FFTW_SINGLE" /D "BENCHFFT_SINGLE" /D "HAVE_SSE" /D "_VC80_UPGRADE=0x0710" /D "_MBCS" /FD /EHsc /MT /Fp".\bench___Win32_Release_float/bench.pch" /Fo".\bench___Win32_Release_float/" /Fd".\bench___Win32_Release_float/" /W3 /nologo /c /errorReport:prompt

Results:
C:\Windows\system32>echo off
fftw-3.1.2 benchfsse(VS2005) started
benchf_sse.exe -opatient 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
131072
Problem: 8, setup: 300.32 us, time: 169.69 ns, ``mflops'': 707.16
Problem: 16, setup: 288.86 us, time: 332.84 ns, ``mflops'': 961.43
Problem: 32, setup: 7.91 ms, time: 726.79 ns, ``mflops'': 1100.7
Problem: 64, setup: 27.46 ms, time: 1.67 us, ``mflops'': 1148.4
Problem: 128, setup: 62.98 ms, time: 4.19 us, ``mflops'': 1069.1
Problem: 256, setup: 137.48 ms, time: 9.18 us, ``mflops'': 1115
Problem: 512, setup: 267.80 ms, time: 20.95 us, ``mflops'': 1099.6
Problem: 1024, setup: 575.47 ms, time: 46.10 us, ``mflops'': 1110.7
Problem: 2048, setup: 1.37 s, time: 99.17 us, ``mflops'': 1135.8
Problem: 4096, setup: 3.42 s, time: 220.42 us, ``mflops'': 1115
Problem: 8192, setup: 8.83 s, time: 530.79 us, ``mflops'': 1003.2
Problem: 16384, setup: 21.99 s, time: 1.13 ms, ``mflops'': 1014.9
Problem: 32768, setup: 53.80 s, time: 2.41 ms, ``mflops'': 1020
Problem: 131072, setup: 369.12 s, time: 9.89 ms, ``mflops'': 1126
fftw-3.1.2 benchfsse ended.
Drücken Sie eine beliebige Taste . . .
----------------------------------------------------------------------------------------------------
For the threaded variants I must first read doku again...
Did you mean this ? or if you want some other Compiler options let me know..
If I have installed the Intel® Parallel Composer Beta, I will recompile the project...

regards heinz

sample above compiled with MSC-Compiler

C:\Windows\system32>echo off
compiled with Parallel Composer Configuration(Release float SSE) Platform(Win32)
fftw-3.1.2 benchf_sse started
benchf_sse.exe -opatient 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
131072
Problem: 8, setup: 241.93 us, time: 49.93 ns, ``mflops'': 2403.6
Problem: 16, setup: 276.57 us, time: 94.39 ns, ``mflops'': 3390
Problem: 32, setup: 7.91 ms, time: 117.86 ns, ``mflops'': 6787.9
Problem: 64, setup: 26.76 ms, time: 219.35 ns, ``mflops'': 8753.3
Problem: 128, setup: 61.71 ms, time: 447.42 ns, ``mflops'': 10013
Problem: 256, setup: 124.16 ms, time: 855.56 ns, ``mflops'': 11969
Problem: 512, setup: 238.18 ms, time: 1.99 us, ``mflops'': 11575
Problem: 1024, setup: 403.56 ms, time: 4.47 us, ``mflops'': 11455
Problem: 2048, setup: 719.56 ms, time: 10.62 us, ``mflops'': 10611
Problem: 4096, setup: 1.41 s, time: 25.84 us, ``mflops'': 9510.4
Problem: 8192, setup: 3.14 s, time: 58.67 us, ``mflops'': 9076.4
Problem: 16384, setup: 7.01 s, time: 125.16 us, ``mflops'': 9163.6
Problem: 32768, setup: 16.08 s, time: 279.92 us, ``mflops'': 8779.5
Problem: 131072, setup: 87.35 s, time: 1.29 ms, ``mflops'': 8658.3
fftw-3.1.2 benchf_sse ended.

with 128K 8658,3 mflops
best relation ~1:10
let's everybody make his own thoughts..
heinz

Jason G · « **Reply #441 on:** 28 Nov 2008, 10:26:52 pm »

Ahhh, so FFTW's warnings about MS compiler generating incorrect SSE code for FFTW might be correct. Good to know. I'm pretty sure the stock DLL would have been built with GCC/MinGW.

Much better numbers

_heinz · « **Reply #442 on:** 29 Dec 2008, 05:26:30 am »

Hi Jason,
the new Intel Board is available -->Intel SmackOver DX58SO X58 price 228,58 € in Germany
Produkttyp Motherboard
Formfaktor ATX
Abmessungen (Breite x Tiefe x Höhe) 30.5 cm x 24.4 cm
Chipsatz Intel X58 Express / Intel ICH10R
Multi-Core-Unterstützung 4-Core
Prozessor 0 ( 1 ) - LGA1366 Socket
Kompatible Prozessoren Core i7, Core i7 Extreme
64-Bit-Prozessor-Kompatibilität Eingebaut
RAM 0 MB (installiert) / 16 GB (Max)
Unterstützte RAM-Technologie DDR3 SDRAM
Unterstützte RAM-Integritätsprüfung Nicht-ECC
Storage Controller Serial ATA-300 (RAID)
Konfiguration von USB-Steckplätzen 12 x USB
Konfiguration von Speichersteckplätzen 6 x SATA, 2 x eSATA
Konfiguration von FireWire-Steckplätzen 2 x FireWire
Audioausgang Soundkarte - 7.1 Channel Surround
Netzwerk Netzwerkkarte - Intel 82567LM - Ethernet, Fast Ethernet, Gigabit Ethernet

have a look http://www.kmelektronik.de/

heinz

_heinz · « **Reply #443 on:** 05 Jan 2009, 08:10:12 am »

Happy New Year,
the new year started with some strong issues.
short before chrismas the last AP is out now, thanks to all who are involved to make it possible.
1. AP rev69 duration time now 9 - 10 hours , Standard AP need ca 70-90 hours (measured Intel E8600 @3,6 Ghz)
2. we are working on AP, to make it fit for much more parallelism.
3. my test and developer machine AK-V8 suffered by a bad disk, which I took off today. Now it runs again.
4. some support requests are still open btw ati 8.12 driver, which I need for ati developer environment.
5. our actions will be in the closed forums, so let you surprize from time to time.

heinz

Crunch3r · « **Reply #444 on:** 05 Jan 2009, 01:36:35 pm »

Quote from: _heinz on 28 Nov 2008, 04:02:41 pm

Quote from: _heinz on 26 Nov 2008, 07:24:20 pm
Quote from: Jason G on 26 Nov 2008, 11:44:44 am
@Heinz: Do you happen to have any single and multithreaded FFT processing times benched on your skulltrail? Time for 1,2,4 & 8 threads would be nice for 32k element &/or 128k elements, if you have them.

I'm trying to verify/refine some efficiency calculations & have no reference but my dual core.

Jason

compiled the fftw project (single thread) as 32 bit
/I "." /I ".." /I "../libbench2" /I "../api" /I "../kernel" /I "../dft" /I "../rdft" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "FFTW_SINGLE" /D "BENCHFFT_SINGLE" /D "HAVE_SSE" /D "_VC80_UPGRADE=0x0710" /D "_MBCS" /FD /EHsc /MT /Fp".\bench___Win32_Release_float/bench.pch" /Fo".\bench___Win32_Release_float/" /Fd".\bench___Win32_Release_float/" /W3 /nologo /c /errorReport:prompt

Results:
C:\Windows\system32>echo off
fftw-3.1.2 benchfsse(VS2005) started
benchf_sse.exe -opatient 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
131072
Problem: 8, setup: 300.32 us, time: 169.69 ns, ``mflops'': 707.16
Problem: 16, setup: 288.86 us, time: 332.84 ns, ``mflops'': 961.43
Problem: 32, setup: 7.91 ms, time: 726.79 ns, ``mflops'': 1100.7
Problem: 64, setup: 27.46 ms, time: 1.67 us, ``mflops'': 1148.4
Problem: 128, setup: 62.98 ms, time: 4.19 us, ``mflops'': 1069.1
Problem: 256, setup: 137.48 ms, time: 9.18 us, ``mflops'': 1115
Problem: 512, setup: 267.80 ms, time: 20.95 us, ``mflops'': 1099.6
Problem: 1024, setup: 575.47 ms, time: 46.10 us, ``mflops'': 1110.7
Problem: 2048, setup: 1.37 s, time: 99.17 us, ``mflops'': 1135.8
Problem: 4096, setup: 3.42 s, time: 220.42 us, ``mflops'': 1115
Problem: 8192, setup: 8.83 s, time: 530.79 us, ``mflops'': 1003.2
Problem: 16384, setup: 21.99 s, time: 1.13 ms, ``mflops'': 1014.9
Problem: 32768, setup: 53.80 s, time: 2.41 ms, ``mflops'': 1020
Problem: 131072, setup: 369.12 s, time: 9.89 ms, ``mflops'': 1126
fftw-3.1.2 benchfsse ended.
Drücken Sie eine beliebige Taste . . .
----------------------------------------------------------------------------------------------------
For the threaded variants I must first read doku again...
Did you mean this ? or if you want some other Compiler options let me know..
If I have installed the Intel® Parallel Composer Beta, I will recompile the project...

regards heinz

sample above compiled with MSC-Compiler

C:\Windows\system32>echo off
compiled with Parallel Composer Configuration(Release float SSE) Platform(Win32)
fftw-3.1.2 benchf_sse started
benchf_sse.exe -opatient 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
131072
Problem: 8, setup: 241.93 us, time: 49.93 ns, ``mflops'': 2403.6
Problem: 16, setup: 276.57 us, time: 94.39 ns, ``mflops'': 3390
Problem: 32, setup: 7.91 ms, time: 117.86 ns, ``mflops'': 6787.9
Problem: 64, setup: 26.76 ms, time: 219.35 ns, ``mflops'': 8753.3
Problem: 128, setup: 61.71 ms, time: 447.42 ns, ``mflops'': 10013
Problem: 256, setup: 124.16 ms, time: 855.56 ns, ``mflops'': 11969
Problem: 512, setup: 238.18 ms, time: 1.99 us, ``mflops'': 11575
Problem: 1024, setup: 403.56 ms, time: 4.47 us, ``mflops'': 11455
Problem: 2048, setup: 719.56 ms, time: 10.62 us, ``mflops'': 10611
Problem: 4096, setup: 1.41 s, time: 25.84 us, ``mflops'': 9510.4
Problem: 8192, setup: 3.14 s, time: 58.67 us, ``mflops'': 9076.4
Problem: 16384, setup: 7.01 s, time: 125.16 us, ``mflops'': 9163.6
Problem: 32768, setup: 16.08 s, time: 279.92 us, ``mflops'': 8779.5
Problem: 131072, setup: 87.35 s, time: 1.29 ms, ``mflops'': 8658.3
fftw-3.1.2 benchf_sse ended.

with 128K 8658,3 mflops
best relation ~1:10
let's everybody make his own thoughts..
heinz

you gotta be carefull with fftw and which compiler to use. From my own experience the pre-packaged gcc builds where always faster than the icc compiled code !

Jason G · « **Reply #445 on:** 05 Jan 2009, 06:37:36 pm »

Quote from: Crunch3r on 05 Jan 2009, 01:36:35 pm

you gotta be carefull with fftw and which compiler to use. From my own experience the pre-packaged gcc builds where always faster than the icc compiled code !

Good tip, when I tried this back on my p4 last year, the machine didn't seem to handle the GCC build, will have to get around to trying again on newer hardware.

_heinz · « **Reply #446 on:** 24 Feb 2009, 11:36:26 am »

The new astropulse_v5 5.03 was published february 21th, have a look at our frontpage http://lunatics.kwsn.net/index.php

_heinz · « **Reply #447 on:** 31 Mar 2009, 03:12:18 am »

Running the new astropulse 5.03:
21 _heinz 13,969.73 1,857,340 GenuineIntel
Intel(R) Xeon(R) CPU E5405 @ 2.00GHz [Intel64 Family 6 Model 23 Stepping 6]
(8 processors)
---------------------------------------------------------------------------
now number 21 with a rac of 13969,73 without any GPU app.
Hope to get the 14000 tonight.

heinz

_heinz · « **Reply #448 on:** 01 Apr 2009, 12:19:03 pm »

21 _heinz 14,405.11 1,880,786
modify:
two days later
18 _heinz 15,307.10 1,915,535
15 _heinz 15,699.14 1,924,733
V8-SK01 home 16,033.16 1,929,683
14 _heinz 16,033.16 1,929,683

Jason G · « **Reply #449 on:** 01 Apr 2009, 12:25:46 pm »

LoL, Go Heinz! My downward dive has started. Both machines cooling down ready for cleanout. Two more weeks 'till our refactoring microscopic disassembly sessions. Dust off your dork hat, I still need to find mine .. I left it somewhere...

Author Topic: optimized sources (Read 548388 times)

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

_heinz

Re: optimized sources

Crunch3r

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

_heinz

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources