+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: optimized sources  (Read 615825 times)

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #435 on: 28 Nov 2008, 12:46:23 pm »
Ahhh, 6 meg per package ( 1.5 meg per core )... Okay, yep it is 12 meg total for the 8 cores.

Compared 32 bit ICC 10.1 / TBB 2.0 build of fibonacci, and it IS slower than Parallel composer 32 bit build under XP64 ... Will have to try that build under XP32 to confiirm though.  I will probably update all my ICC/IPP base packages as soon as I get time, in a few week.

Jason


12 MB per chip
BX80574E5405A Aktivkühler oder für 1-HE-Systeme 45 nm E5405 2,00 GHz (80 W) 1333 12 MB gesamt
we have 2 processors so we have 24MB for 8 Cores
« Last Edit: 28 Nov 2008, 12:48:58 pm by _heinz »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #436 on: 28 Nov 2008, 12:49:42 pm »
Err well CPU-Z shows only per core then?  In any case:

Hmm, not a lot of Fibonacci difference here, but some: (fastest thread number was 2)

Built under xp32 with ICC 10.1 + TBB (run on XP 32)
Quote
Threads number is 2
Shared serial (mutex)           - in 0.286294 msec
Shared serial (spin_mutex)      - in 0.196978 msec
Shared serial (queuing_mutex)   - in 0.301214 msec
Shared serial (Conc.HashTable)  - in 4.313505 msec
Parallel while+for/queue        - in 1.485761 msec
Parallel pipe/queue             - in 1.980293 msec
Parallel reduce                 - in 0.523162 msec
Parallel scan                   - in 0.338611 msec
Parallel tasks                  - in 0.566134 msec

and Built under XP64 with Parallel Composer Beta Update 2 + TBB 2.0 ( but run on XP 32 also)
Quote
Threads number is 2
Shared serial (mutex)           - in 0.279819 msec
Shared serial (spin_mutex)      - in 0.208223 msec
Shared serial (queuing_mutex)   - in 0.284642 msec
Shared serial (Conc.HashTable)  - in 4.461598 msec
Parallel while+for/queue        - in 1.718736 msec
Parallel pipe/queue             - in 2.188073 msec
Parallel reduce                 - in 0.571781 msec
Parallel scan                   - in 0.357319 msec
Parallel tasks                  - in 0.534837 msec

So some things look a bit slower, but I will carefully consider shifting to ICC 11 soon, and check how our projects of interest compare.

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #437 on: 28 Nov 2008, 12:58:28 pm »
How many number let you generate ? 1000 ?

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #438 on: 28 Nov 2008, 01:00:42 pm »
No, just used default which was 100... will try 1000

[Later:]  Fastest 32 bit run built on XP32 ICC10.1 / TBB2.0 now 3 threads  :o:
Quote
Threads number is 3
Shared serial (mutex)           - in 162.014407 msec
Shared serial (spin_mutex)      - in 11.609819 msec
Shared serial (queuing_mutex)   - in 50.960339 msec
Shared serial (Conc.HashTable)  - in 401.327768 msec
Parallel while+for/queue        - in 93.399315 msec
Parallel pipe/queue             - in 164.994829 msec
Parallel reduce                 - in 27.500117 msec
Parallel scan                   - in 22.918168 msec
Parallel tasks                  - in 25.904447 msec

Getting parallel composer build data:
Quote
Threads number is 3
Shared serial (mutex)           - in 76.449678 msec
Shared serial (spin_mutex)      - in 13.449323 msec
Shared serial (queuing_mutex)   - in 50.961819 msec
Shared serial (Conc.HashTable)  - in 413.186277 msec
Parallel while+for/queue        - in 93.995606 msec
Parallel pipe/queue             - in 171.541281 msec
Parallel reduce                 - in 28.647254 msec
Parallel scan                   - in 27.231642 msec
Parallel tasks                  - in 24.389762 msec


« Last Edit: 28 Nov 2008, 01:07:33 pm by Jason G »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #439 on: 28 Nov 2008, 02:48:58 pm »
No, just used default which was 100... will try 1000

[Later:]  Fastest 32 bit run built on XP32 ICC10.1 / TBB2.0 now 3 threads  :o:
Quote
Threads number is 3
Now you know why I choosed 5 .. a not even number
We can create every number of threads 1, 2, 3, 4.. 128, 256, 512 etc.   not even numbers also.
and we can use /QxHOST ---> Best performance on latest features of the processor supported by the compilation host.
 ::)
heinz
« Last Edit: 28 Nov 2008, 02:56:17 pm by _heinz »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #440 on: 28 Nov 2008, 04:02:41 pm »
@Heinz: Do you happen to have any single and multithreaded FFT processing times benched on your skulltrail?  Time for 1,2,4 & 8 threads would be nice for 32k element &/or 128k elements, if you have them. 

I'm trying to verify/refine some efficiency calculations & have no reference but my dual core.

Jason

compiled the fftw project (single thread) as 32 bit
 /I "." /I ".." /I "../libbench2" /I "../api" /I "../kernel" /I "../dft" /I "../rdft" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "FFTW_SINGLE" /D "BENCHFFT_SINGLE" /D "HAVE_SSE" /D "_VC80_UPGRADE=0x0710" /D "_MBCS" /FD /EHsc /MT /Fp".\bench___Win32_Release_float/bench.pch" /Fo".\bench___Win32_Release_float/" /Fd".\bench___Win32_Release_float/" /W3 /nologo /c /errorReport:prompt

Results:
C:\Windows\system32>echo off
fftw-3.1.2 benchfsse(VS2005) started
benchf_sse.exe -opatient 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
131072
Problem: 8, setup: 300.32 us, time: 169.69 ns, ``mflops'': 707.16
Problem: 16, setup: 288.86 us, time: 332.84 ns, ``mflops'': 961.43
Problem: 32, setup: 7.91 ms, time: 726.79 ns, ``mflops'': 1100.7
Problem: 64, setup: 27.46 ms, time: 1.67 us, ``mflops'': 1148.4
Problem: 128, setup: 62.98 ms, time: 4.19 us, ``mflops'': 1069.1
Problem: 256, setup: 137.48 ms, time: 9.18 us, ``mflops'': 1115
Problem: 512, setup: 267.80 ms, time: 20.95 us, ``mflops'': 1099.6
Problem: 1024, setup: 575.47 ms, time: 46.10 us, ``mflops'': 1110.7
Problem: 2048, setup: 1.37 s, time: 99.17 us, ``mflops'': 1135.8
Problem: 4096, setup: 3.42 s, time: 220.42 us, ``mflops'': 1115
Problem: 8192, setup: 8.83 s, time: 530.79 us, ``mflops'': 1003.2
Problem: 16384, setup: 21.99 s, time: 1.13 ms, ``mflops'': 1014.9
Problem: 32768, setup: 53.80 s, time: 2.41 ms, ``mflops'': 1020
Problem: 131072, setup: 369.12 s, time: 9.89 ms, ``mflops'': 1126
fftw-3.1.2 benchfsse ended.
Drücken Sie eine beliebige Taste . . .
----------------------------------------------------------------------------------------------------
For the threaded variants I must first read doku again...
Did you mean this ? or if you want some other Compiler options let me know..
If I have installed the Intel® Parallel Composer Beta, I will recompile the project...

regards heinz

sample above compiled with MSC-Compiler

C:\Windows\system32>echo off
compiled with Parallel Composer  Configuration(Release float SSE) Platform(Win32)
fftw-3.1.2 benchf_sse started
benchf_sse.exe -opatient 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
131072
Problem: 8, setup: 241.93 us, time: 49.93 ns, ``mflops'': 2403.6
Problem: 16, setup: 276.57 us, time: 94.39 ns, ``mflops'': 3390
Problem: 32, setup: 7.91 ms, time: 117.86 ns, ``mflops'': 6787.9
Problem: 64, setup: 26.76 ms, time: 219.35 ns, ``mflops'': 8753.3
Problem: 128, setup: 61.71 ms, time: 447.42 ns, ``mflops'': 10013
Problem: 256, setup: 124.16 ms, time: 855.56 ns, ``mflops'': 11969
Problem: 512, setup: 238.18 ms, time: 1.99 us, ``mflops'': 11575
Problem: 1024, setup: 403.56 ms, time: 4.47 us, ``mflops'': 11455
Problem: 2048, setup: 719.56 ms, time: 10.62 us, ``mflops'': 10611
Problem: 4096, setup: 1.41 s, time: 25.84 us, ``mflops'': 9510.4
Problem: 8192, setup: 3.14 s, time: 58.67 us, ``mflops'': 9076.4
Problem: 16384, setup: 7.01 s, time: 125.16 us, ``mflops'': 9163.6
Problem: 32768, setup: 16.08 s, time: 279.92 us, ``mflops'': 8779.5
Problem: 131072, setup: 87.35 s, time: 1.29 ms, ``mflops'': 8658.3
fftw-3.1.2 benchf_sse ended.

with 128K  8658,3 mflops
best relation ~1:10
let's everybody make his own thoughts..
heinz
« Last Edit: 28 Nov 2008, 04:09:39 pm by _heinz »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #441 on: 28 Nov 2008, 10:26:52 pm »
Ahhh, so FFTW's warnings about MS compiler generating incorrect SSE code for FFTW might be correct.   Good to know.  I'm pretty sure the stock DLL would have been built with GCC/MinGW.

Much better numbers  ;D

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #442 on: 29 Dec 2008, 05:26:30 am »
Hi Jason,
the new Intel Board is available -->Intel SmackOver DX58SO X58 price 228,58 € in Germany
Produkttyp Motherboard
Formfaktor ATX
Abmessungen (Breite x Tiefe x Höhe) 30.5 cm x 24.4 cm
Chipsatz Intel X58 Express / Intel ICH10R
Multi-Core-Unterstützung 4-Core
Prozessor 0 ( 1 ) - LGA1366 Socket
Kompatible Prozessoren Core i7, Core i7 Extreme
64-Bit-Prozessor-Kompatibilität Eingebaut
RAM 0 MB (installiert) / 16 GB (Max)
Unterstützte RAM-Technologie DDR3 SDRAM
Unterstützte RAM-Integritätsprüfung Nicht-ECC
Storage Controller Serial ATA-300 (RAID)
Konfiguration von USB-Steckplätzen 12 x USB
Konfiguration von Speichersteckplätzen 6 x SATA, 2 x eSATA
Konfiguration von FireWire-Steckplätzen 2 x FireWire
Audioausgang Soundkarte - 7.1 Channel Surround
Netzwerk Netzwerkkarte - Intel 82567LM - Ethernet, Fast Ethernet, Gigabit Ethernet

have a look http://www.kmelektronik.de/
 
heinz


Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #443 on: 05 Jan 2009, 08:10:12 am »
Happy New Year,
the new year started with some strong issues.
short before chrismas the last AP is out now, thanks to all who are involved to make it possible.
1. AP rev69 duration time now 9 - 10 hours , Standard AP need ca 70-90 hours (measured Intel E8600 @3,6 Ghz)
2. we are working on AP, to make it fit for much more parallelism.
3. my test and developer machine AK-V8 suffered by a  bad disk, which I took off today. Now it runs again.
4. some support requests are still open btw ati 8.12 driver, which I need for ati developer environment.
5. our actions will be in the closed forums, so let you surprize from time to time.

heinz
 ;D

Offline Crunch3r

  • Knight who says 'Ni!'
  • *****
  • Posts: 602
    • 64 bit boinc clients
Re: optimized sources
« Reply #444 on: 05 Jan 2009, 01:36:35 pm »
@Heinz: Do you happen to have any single and multithreaded FFT processing times benched on your skulltrail?  Time for 1,2,4 & 8 threads would be nice for 32k element &/or 128k elements, if you have them. 

I'm trying to verify/refine some efficiency calculations & have no reference but my dual core.

Jason

compiled the fftw project (single thread) as 32 bit
 /I "." /I ".." /I "../libbench2" /I "../api" /I "../kernel" /I "../dft" /I "../rdft" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "FFTW_SINGLE" /D "BENCHFFT_SINGLE" /D "HAVE_SSE" /D "_VC80_UPGRADE=0x0710" /D "_MBCS" /FD /EHsc /MT /Fp".\bench___Win32_Release_float/bench.pch" /Fo".\bench___Win32_Release_float/" /Fd".\bench___Win32_Release_float/" /W3 /nologo /c /errorReport:prompt

Results:
C:\Windows\system32>echo off
fftw-3.1.2 benchfsse(VS2005) started
benchf_sse.exe -opatient 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
131072
Problem: 8, setup: 300.32 us, time: 169.69 ns, ``mflops'': 707.16
Problem: 16, setup: 288.86 us, time: 332.84 ns, ``mflops'': 961.43
Problem: 32, setup: 7.91 ms, time: 726.79 ns, ``mflops'': 1100.7
Problem: 64, setup: 27.46 ms, time: 1.67 us, ``mflops'': 1148.4
Problem: 128, setup: 62.98 ms, time: 4.19 us, ``mflops'': 1069.1
Problem: 256, setup: 137.48 ms, time: 9.18 us, ``mflops'': 1115
Problem: 512, setup: 267.80 ms, time: 20.95 us, ``mflops'': 1099.6
Problem: 1024, setup: 575.47 ms, time: 46.10 us, ``mflops'': 1110.7
Problem: 2048, setup: 1.37 s, time: 99.17 us, ``mflops'': 1135.8
Problem: 4096, setup: 3.42 s, time: 220.42 us, ``mflops'': 1115
Problem: 8192, setup: 8.83 s, time: 530.79 us, ``mflops'': 1003.2
Problem: 16384, setup: 21.99 s, time: 1.13 ms, ``mflops'': 1014.9
Problem: 32768, setup: 53.80 s, time: 2.41 ms, ``mflops'': 1020
Problem: 131072, setup: 369.12 s, time: 9.89 ms, ``mflops'': 1126
fftw-3.1.2 benchfsse ended.
Drücken Sie eine beliebige Taste . . .
----------------------------------------------------------------------------------------------------
For the threaded variants I must first read doku again...
Did you mean this ? or if you want some other Compiler options let me know..
If I have installed the Intel® Parallel Composer Beta, I will recompile the project...

regards heinz

sample above compiled with MSC-Compiler

C:\Windows\system32>echo off
compiled with Parallel Composer  Configuration(Release float SSE) Platform(Win32)
fftw-3.1.2 benchf_sse started
benchf_sse.exe -opatient 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
131072
Problem: 8, setup: 241.93 us, time: 49.93 ns, ``mflops'': 2403.6
Problem: 16, setup: 276.57 us, time: 94.39 ns, ``mflops'': 3390
Problem: 32, setup: 7.91 ms, time: 117.86 ns, ``mflops'': 6787.9
Problem: 64, setup: 26.76 ms, time: 219.35 ns, ``mflops'': 8753.3
Problem: 128, setup: 61.71 ms, time: 447.42 ns, ``mflops'': 10013
Problem: 256, setup: 124.16 ms, time: 855.56 ns, ``mflops'': 11969
Problem: 512, setup: 238.18 ms, time: 1.99 us, ``mflops'': 11575
Problem: 1024, setup: 403.56 ms, time: 4.47 us, ``mflops'': 11455
Problem: 2048, setup: 719.56 ms, time: 10.62 us, ``mflops'': 10611
Problem: 4096, setup: 1.41 s, time: 25.84 us, ``mflops'': 9510.4
Problem: 8192, setup: 3.14 s, time: 58.67 us, ``mflops'': 9076.4
Problem: 16384, setup: 7.01 s, time: 125.16 us, ``mflops'': 9163.6
Problem: 32768, setup: 16.08 s, time: 279.92 us, ``mflops'': 8779.5
Problem: 131072, setup: 87.35 s, time: 1.29 ms, ``mflops'': 8658.3
fftw-3.1.2 benchf_sse ended.

with 128K  8658,3 mflops
best relation ~1:10
let's everybody make his own thoughts..
heinz

you gotta be carefull with fftw and which compiler to use. From my own experience the pre-packaged gcc builds where always faster than the icc compiled code !

I want to share something with you: The three little sentences that will get you through life. Number 1: Cover for me. Number 2: Oh, good idea, Boss! Number 3: It was like that when I got here.

Homer Simpson

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #445 on: 05 Jan 2009, 06:37:36 pm »
you gotta be carefull with fftw and which compiler to use. From my own experience the pre-packaged gcc builds where always faster than the icc compiled code !

Good tip, when I tried this back on my p4 last year, the machine didn't seem to handle the GCC build, will have to get around to trying again on newer hardware.
« Last Edit: 05 Jan 2009, 06:41:44 pm by Jason G »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #446 on: 24 Feb 2009, 11:36:26 am »
The new astropulse_v5 5.03 was published february 21th, have a look at our frontpage http://lunatics.kwsn.net/index.php

 ;D

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #447 on: 31 Mar 2009, 03:12:18 am »
Running the new astropulse 5.03:
21 _heinz 13,969.73 1,857,340 GenuineIntel
Intel(R) Xeon(R) CPU E5405 @ 2.00GHz [Intel64 Family 6 Model 23 Stepping 6]
(8 processors) 
---------------------------------------------------------------------------
now number 21 with a rac of 13969,73 without any GPU app.
Hope to get the 14000 tonight.

heinz  ;D

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #448 on: 01 Apr 2009, 12:19:03 pm »
21 _heinz 14,405.11 1,880,786
modify:
two days later
18 _heinz 15,307.10 1,915,535
15 _heinz 15,699.14 1,924,733
V8-SK01 home 16,033.16 1,929,683
14 _heinz 16,033.16 1,929,683
« Last Edit: 03 Apr 2009, 10:18:15 am by _heinz »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #449 on: 01 Apr 2009, 12:25:46 pm »
LoL, Go Heinz! My downward dive has started. Both machines cooling down ready for cleanout.  Two more weeks 'till our refactoring microscopic disassembly sessions.  Dust off your dork hat, I still need to find mine .. I left it somewhere...

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 29
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 11
Total: 11
Powered by EzPortal