+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: BOINC as library  (Read 32929 times)

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: BOINC as library
« Reply #30 on: 29 Oct 2007, 10:27:07 am »
Well, "results strong similar" ....

Nice :D, sounds to me like it might be only some compiler flags different for only 2% difference!

Jason

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: BOINC as library
« Reply #31 on: 16 Nov 2007, 01:45:04 pm »
Here all diffs that were done by me to compile 2.4 sources (actually, 2.39S but there only 2 differences in #define strings that was added to diffs after build and could not prevent to rebuild client again) with VS 2005 and trial versions of ICC and IPP.
opt_config.h was added to simplify tuning of conditional defines and compilation through few source files affected.



[attachment deleted by admin]

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: BOINC as library
« Reply #32 on: 17 Nov 2007, 09:29:35 am »
Good one to record the changes like that, mine are scribbled on an old envelope  :P, Looks like similar changes overall.  Did you end up with favourite compiler settings ?  the 2.4lunatics one for QxN looks pretty close for the ones I've played with on my p4.

Jason

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: BOINC as library
« Reply #33 on: 17 Nov 2007, 12:34:16 pm »
Well, I'm not even record my changes ;) Just downloaded yesterday lunatic 2.4 sources from link on main seti board, ran WinDiff utility and collected all discrepancies in one rar  ::)

I used SSE2 build options cause that binary was intented to run and be profiled on AMD 64 host. I use CodeAnalyst as profiling tool (governing by assumption that AMD should know their own CPUs better than Intel ;) ) It would be interesting to compare your's vTune data with CodeAnalyst one to highlight area of interests for some improvements.
Probably need to check that options set more presisely cause my build little less than optimal.  Another possibility - options are fine and 2% difference in speed comes from trial nature of mine IPP installation. Intel approves only dynamic linking with trial IPP library. So dll-calls... Don't know really could this accont for 2% slowness or not (even 2% still preliminary - tested only on short WU).


Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: BOINC as library
« Reply #34 on: 17 Nov 2007, 12:54:40 pm »
Very good idea to compare SSE2 QxN p4 vtune data against your sse2 AMD build.  There is some Arguments about that ! :D

Maybe you found Hotspots in an inner folding routine ? Mine chooses FoldArrayBy2AL and spends a about 10% of total time in there.  Maybe yours chooses a different routine? either way we could compare asm listing output of those even, which might explain some differences between the chips! (Those functions don't depend on IPP as far as I know.)

[Note that also because I am using ICC, about 11% of time is being spent in _Intel_fast_memcpy,  Which having looked at a mixture of improved memcopies, elimination of them, and hybrid processing techniques in other areas ,might make some generally applicable improvements.(not just intel chips) ]

 Even though yours calls the dynamic library  it would be nice to see if the dispatching is calling the same IPP functions (but DLL versions) ...or some different maybe more generic one...  mine calls the w7 static ones which are p4 sse2, but the internal names given by vtune / codeanalyst will give the real names.

Jason
« Last Edit: 17 Nov 2007, 01:09:41 pm by j_groothu »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: BOINC as library
« Reply #35 on: 22 Nov 2007, 05:49:44 pm »
Well, some initial results.
Most time my version spends in sse3_ChirpData_ak
(1 function, 78 instructions, Total: 12300 samples, 19.37% of samples in the module, 5.12% of total session samples)
[these line take most:
    Address     Line    Trace    Source                                                                              Code Bytes    Timer samples    
    0x4a9dfe    125                    m = vec_recip3(_mm_add_ps(_mm_mul_ps(x, x), _mm_mul_ps(y, y)));                                 3989             
    0x4a9d76    111                    c = _mm_add_ps(_mm_mul_ps(_mm_add_ps(_mm_mul_ps(_mm_add_ps(_mm_mul_ps(y, CC3),                  3631             
]

It's pretty strange cause I used SSE2 build options... Maybe function name not quite adequate?... (or maybe smth wrong with profiler or my understanding of its results :) )

Next one is fastcopy_I     
(1 function, 24 instructions, Total: 8714 samples, 13.72% of samples in the module, 3.63% of total session samples)

and in IPP dll most samples hitted   ippsZero_8u
(1 function, 1160 instructions, Total: 46923 samples, 99.89% of samples in the module, 19.55% of total session samples)
[this line leader:
Address      Code Bytes            Instruction                  Symbol                 Timer samples    
0x200ede9    0x 0F 28 4C 32 10     movaps xmm1,[edx+esi+10h]    ippsZero_8u+1473833    5965             
]
1 instructions, Total: 5965 samples, 4.87% of samples in module p:\bin\intel\ipp\5.3\ia32\bin\ippst7-5.3.dll, 0.99% of total session samples
As one can see it's almost single called function in whole dll ... (very strange too).
It was 240 sec profiling run. What time scale best suitable for profiling all main app activities? I will try to increase profiling time, maybe it will get more adequate results...

Some addon:
sse_sum2,3,4,5 and sse_f_GetPeak have the most unaligned accesses number.
sse3_ChirpData_ak and fastcopy_I have the most data cache misses
« Last Edit: 22 Nov 2007, 06:39:19 pm by Raistmer »

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: BOINC as library
« Reply #36 on: 22 Nov 2007, 07:51:20 pm »
Well, some initial results.
Most time my version spends in sse3_ChirpData_ak
...
It's pretty strange cause I used SSE2 build options... Maybe function name not quite adequate?... (or maybe smth wrong with profiler or my understanding of its results :) )

The program design is to build all the hand-optimized code with at least whatever minimum options are required, then use run-time testing of the host to decide which of those routines to test. So the opt_SSE3.cpp module is built with its needed SSE3 setting, your CPU supports SSE3, and it tests faster than the other chirp routines on your system so is chosen as the one to use during actual crunching.

The fraction of time spent chirping is very much affected by the angle range. The reason WUs at high angle range are quick is that they do no Gaussian fitting and not much Pulse or Triplet finding. Chirping is also reduced, but not so much, so it becomes more of the total run time. I don't know CodeAnalyst, so don't understand the "19.37% of samples in the module, 5.12% of total session samples" distinction.
                                                       Joe

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: BOINC as library
« Reply #37 on: 23 Nov 2007, 01:36:08 am »
...so don't understand the "19.37% of samples in the module, 5.12% of total session samples" distinction.                                                       Joe

As well as the familiar/traditional instrumented 'Device Under Test' Style  profiling, vTune, and I guess from this data CodeAnalyst too, collects the OS/System Counters, so Data is available on all processes /Threads running at the time of test. 

Without having seen the rest of the data: ( And presuming Time-based sampling was used rather than Event-Based Sampling)
From the given information, if it were vTune, for the module/process which spent 20% of its time in the chirp routine, that 20% self time constituted about 5% system time .... This 'might' imply the total self time of the module makes 25% of the system time.

That might suggest a single threaded module going full pelt (constant 100% usage) on 1 core of a quad,  Constant 100% usage would be one of the first System level optimisation Goals. !!!!GOAL!!!! move onto further optimisation levels.--> Application achitecture level --> MicroArchitecure level

otherwise if it's a dual or single core then it may be  using less than 100% of available system cpu time ... either other processes running taking system resources during the profile (can diagnose system problems like this),  or the module is either IO or memory bound (might suggest deeper optimisation if system problems are eliminated).

Again those are just guesses / general guidelines without looking at other data... at system level, for example, what proportion the Total module samples were of total system samples might be, especially a cpu usage graph by module, might tell you that you forgot to stop boinc (done it many times), maybe a virus scan had started, a windows update, maybe you were watching a DVD? LOL (joke)

Jason
« Last Edit: 23 Nov 2007, 01:59:14 am by j_groothu »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: BOINC as library
« Reply #38 on: 23 Nov 2007, 03:18:31 am »
:)
Not watched DVD ;) Host under testing is AMD 63 3200 Venice, SSE3 support available indeed.
Yes, with timer-based profile CodeAnalyst gathers data on whole system. Yes, there was BOINC run in background (einstein project in very that time). I interesting only time distribution inside SETI exe and IPP dll so didn't care about stopping/restarting BOINC during test. It makes "total system time %" meaningless sure. But SETI should take ~50% of CPU time in this situation, not just 25%. Maybe CodeAnalyst counted IPP dll as distinct module?...

Work Unit Info
True angle range:  0.405774

Any comments about why  ippsZero_8u takes most time, please ?
and (accordingly dll name) it seems IPP dispatcher chose "standart" library version, not one of specificaly optimized (not w7 for example).

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: BOINC as library
« Reply #39 on: 23 Nov 2007, 07:01:12 am »
:)
Not watched DVD ;) ... But SETI should take ~50% of CPU time in this situation, not just 25%. ....

Right so Single core (like my non HT p4), 
more Guesses:
    ~50% 1 Einstein task
    ~1 to 5% -  boinc ( is higher because of context switching on single core, I've measured see below)
    ~2 to 20% -  CodeAnalyst (high sampling rate increases load
    ~2 to 10%  - Other system/kernel drivers & services
           subtotal : 55% ->85%  ... Average ~70%  ;)
    remaining 45%~15% Average 30% -  your seti run.

  So before you can move on to deeper optimisation level , you need to measure/graph with codeanalyst: whatever the equivalent system counters are for vTune names:
1) With Boinc+Einstein+your seti task (Same conditions as you did)
        - "System: Processor Queue Length"
        - "System: Context Switches/sec" might also be helpful
2) Without Boinc+Einstein, just your seti task
        - "System: Processor Queue Length"
        - "System: Context Switches/sec" might also be helpful

Maybe too some memory usage might show something if you have limited physical RAM etc...

"System: Processor Queue Length" (vTune name)  Gives a reading of how many NON-IDLE threads are waiting in the queue for CPU time .... on my 2.0GHz non HT p4 this typically averages about 5 with a seti run (but no boinc+seti), that means I could benefit, for the software I run,  from A dual core of at least 2GHz, preferably a bit more to bring it into the range of 1 to 2. ( A fast quad would probably be wasted on me, but give practically every running thread, on average, a fresh whole core to itself...)

"System: Context Switches/sec", might also give an idea of how much priority competition is happening on your machine (Threads/Modules competing ... You see this raise slightly during mouse moves, or having more active background programs that poll for something regularly (e.g. speedfan, boincview), that looks like speed humps in the context switches/sec.

Quote
Any comments about why  ippsZero_8u takes most time, please ?
and (accordingly dll name) it seems IPP dispatcher chose "standart" library version, not one of specificaly optimized (not w7 for example).
Mine spends some large times in a few of the IPP functions.  When you get to do some application and / architectural level performance measurement you will see the reasons, it in some small way might partially be related to the 'denormal data' issue you brought up before (take a look at the IPP flash tutorials about that).  I've been thinking about ways to approach a custom (stripped down) FFTW build for a while now, but aren't ready yet.

The use of the standard library and the fact that it would be a DLL on a single core would be an issue too(probably extra context switches / cpu queue length)... means that like me you'd probably benefit from an extra core  ;) so if you need to justify going to more cores for santa to bring one then "I need one for software development purposes" is probably a pretty good reason to add to the list   ;D.

IMO, from the measurements I get, It is a myth that software doesn't benefit from multicore or even HT yet. Who runs only 1 single threaded process at a time? Only DOS! [ And perhaps reviewers doing synthetic benchmarks]  The windows OS handles all the thread switching much better with multicore or even HT, for DLLs and services.  Even without boinc/seti running, system responsiveness and use of system resources would improve for us :D

Jason
« Last Edit: 23 Nov 2007, 07:18:40 am by j_groothu »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: BOINC as library
« Reply #40 on: 23 Nov 2007, 01:11:09 pm »
Yes.... but if it would be multicore there were multi seti/einstein processes to eat CPU too ;)
It seems my version still not appropriate for profiling, it better suits for debugging - checkpointing broken.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: BOINC as library
« Reply #41 on: 23 Nov 2007, 01:27:01 pm »
Yes.... but if it would be multicore there were multi seti/einstein processes to eat CPU too ;)
It seems my version still not appropriate for profiling, it better suits for debugging - checkpointing broken.

LOL, Good point, though you would tend to use the fully loaded cores profile data just for overall system performance analysis rather than program profile information.  You would stop boinc for deeper module profile to not obscure the run.

Checkpointing? sounds like boincapi problem maybe
« Last Edit: 23 Nov 2007, 04:39:55 pm by j_groothu »

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: BOINC as library
« Reply #42 on: 24 Nov 2007, 02:37:38 pm »
...checkpointing broken.

The default checkpoint interval is 300 seconds. When running with BOINC, the "Write to disk" preference overrides that, when running standalone you need to use an init_data.xml file to supply that and maybe a useful memory size. The knabench package has a suitable one, but I often use this simpler one:
-----------------------------------------------------------------------------
<app_init_data>
<wu_cpu_time>0</wu_cpu_time>
<checkpoint_period>60.000000</checkpoint_period>
<host_info>
    <m_nbytes>134217728.000000</m_nbytes>
</host_info>
</app_init_data>
-----------------------------------------------------------------------------
                                                       Joe

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: BOINC as library
« Reply #43 on: 25 Nov 2007, 03:51:59 am »
Thank you, but my app does write checkpoint (every 300 sec only maybe but it does). It cant restore computation state from saved data - that i meant when wrote "checkpointing broken".
« Last Edit: 25 Nov 2007, 04:31:49 am by Raistmer »

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 355
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 38
Total: 38
Powered by EzPortal