Forum > Windows
BOINC as library
Raistmer:
Well, some initial results.
Most time my version spends in sse3_ChirpData_ak
(1 function, 78 instructions, Total: 12300 samples, 19.37% of samples in the module, 5.12% of total session samples)
[these line take most:
Address Line Trace Source Code Bytes Timer samples
0x4a9dfe 125 m = vec_recip3(_mm_add_ps(_mm_mul_ps(x, x), _mm_mul_ps(y, y))); 3989
0x4a9d76 111 c = _mm_add_ps(_mm_mul_ps(_mm_add_ps(_mm_mul_ps(_mm_add_ps(_mm_mul_ps(y, CC3), 3631
]
It's pretty strange cause I used SSE2 build options... Maybe function name not quite adequate?... (or maybe smth wrong with profiler or my understanding of its results :) )
Next one is fastcopy_I
(1 function, 24 instructions, Total: 8714 samples, 13.72% of samples in the module, 3.63% of total session samples)
and in IPP dll most samples hitted ippsZero_8u
(1 function, 1160 instructions, Total: 46923 samples, 99.89% of samples in the module, 19.55% of total session samples)
[this line leader:
Address Code Bytes Instruction Symbol Timer samples
0x200ede9 0x 0F 28 4C 32 10 movaps xmm1,[edx+esi+10h] ippsZero_8u+1473833 5965
]
1 instructions, Total: 5965 samples, 4.87% of samples in module p:\bin\intel\ipp\5.3\ia32\bin\ippst7-5.3.dll, 0.99% of total session samples
As one can see it's almost single called function in whole dll ... (very strange too).
It was 240 sec profiling run. What time scale best suitable for profiling all main app activities? I will try to increase profiling time, maybe it will get more adequate results...
Some addon:
sse_sum2,3,4,5 and sse_f_GetPeak have the most unaligned accesses number.
sse3_ChirpData_ak and fastcopy_I have the most data cache misses
Josef W. Segur:
--- Quote from: Raistmer on 22 Nov 2007, 05:49:44 pm ---Well, some initial results.
Most time my version spends in sse3_ChirpData_ak
...
It's pretty strange cause I used SSE2 build options... Maybe function name not quite adequate?... (or maybe smth wrong with profiler or my understanding of its results :) )
--- End quote ---
The program design is to build all the hand-optimized code with at least whatever minimum options are required, then use run-time testing of the host to decide which of those routines to test. So the opt_SSE3.cpp module is built with its needed SSE3 setting, your CPU supports SSE3, and it tests faster than the other chirp routines on your system so is chosen as the one to use during actual crunching.
The fraction of time spent chirping is very much affected by the angle range. The reason WUs at high angle range are quick is that they do no Gaussian fitting and not much Pulse or Triplet finding. Chirping is also reduced, but not so much, so it becomes more of the total run time. I don't know CodeAnalyst, so don't understand the "19.37% of samples in the module, 5.12% of total session samples" distinction.
Joe
Jason G:
--- Quote from: Josef W. Segur on 22 Nov 2007, 07:51:20 pm ---...so don't understand the "19.37% of samples in the module, 5.12% of total session samples" distinction. Joe
--- End quote ---
As well as the familiar/traditional instrumented 'Device Under Test' Style profiling, vTune, and I guess from this data CodeAnalyst too, collects the OS/System Counters, so Data is available on all processes /Threads running at the time of test.
Without having seen the rest of the data: ( And presuming Time-based sampling was used rather than Event-Based Sampling)
From the given information, if it were vTune, for the module/process which spent 20% of its time in the chirp routine, that 20% self time constituted about 5% system time .... This 'might' imply the total self time of the module makes 25% of the system time.
That might suggest a single threaded module going full pelt (constant 100% usage) on 1 core of a quad, Constant 100% usage would be one of the first System level optimisation Goals. !!!!GOAL!!!! move onto further optimisation levels.--> Application achitecture level --> MicroArchitecure level
otherwise if it's a dual or single core then it may be using less than 100% of available system cpu time ... either other processes running taking system resources during the profile (can diagnose system problems like this), or the module is either IO or memory bound (might suggest deeper optimisation if system problems are eliminated).
Again those are just guesses / general guidelines without looking at other data... at system level, for example, what proportion the Total module samples were of total system samples might be, especially a cpu usage graph by module, might tell you that you forgot to stop boinc (done it many times), maybe a virus scan had started, a windows update, maybe you were watching a DVD? LOL (joke)
Jason
Raistmer:
:)
Not watched DVD ;) Host under testing is AMD 63 3200 Venice, SSE3 support available indeed.
Yes, with timer-based profile CodeAnalyst gathers data on whole system. Yes, there was BOINC run in background (einstein project in very that time). I interesting only time distribution inside SETI exe and IPP dll so didn't care about stopping/restarting BOINC during test. It makes "total system time %" meaningless sure. But SETI should take ~50% of CPU time in this situation, not just 25%. Maybe CodeAnalyst counted IPP dll as distinct module?...
Work Unit Info
True angle range: 0.405774
Any comments about why ippsZero_8u takes most time, please ?
and (accordingly dll name) it seems IPP dispatcher chose "standart" library version, not one of specificaly optimized (not w7 for example).
Jason G:
--- Quote from: Raistmer on 23 Nov 2007, 03:18:31 am ---:)
Not watched DVD ;) ... But SETI should take ~50% of CPU time in this situation, not just 25%. ....
--- End quote ---
Right so Single core (like my non HT p4),
more Guesses:
~50% 1 Einstein task
~1 to 5% - boinc ( is higher because of context switching on single core, I've measured see below)
~2 to 20% - CodeAnalyst (high sampling rate increases load
~2 to 10% - Other system/kernel drivers & services
subtotal : 55% ->85% ... Average ~70% ;)
remaining 45%~15% Average 30% - your seti run.
So before you can move on to deeper optimisation level , you need to measure/graph with codeanalyst: whatever the equivalent system counters are for vTune names:
1) With Boinc+Einstein+your seti task (Same conditions as you did)
- "System: Processor Queue Length"
- "System: Context Switches/sec" might also be helpful
2) Without Boinc+Einstein, just your seti task
- "System: Processor Queue Length"
- "System: Context Switches/sec" might also be helpful
Maybe too some memory usage might show something if you have limited physical RAM etc...
"System: Processor Queue Length" (vTune name) Gives a reading of how many NON-IDLE threads are waiting in the queue for CPU time .... on my 2.0GHz non HT p4 this typically averages about 5 with a seti run (but no boinc+seti), that means I could benefit, for the software I run, from A dual core of at least 2GHz, preferably a bit more to bring it into the range of 1 to 2. ( A fast quad would probably be wasted on me, but give practically every running thread, on average, a fresh whole core to itself...)
"System: Context Switches/sec", might also give an idea of how much priority competition is happening on your machine (Threads/Modules competing ... You see this raise slightly during mouse moves, or having more active background programs that poll for something regularly (e.g. speedfan, boincview), that looks like speed humps in the context switches/sec.
--- Quote ---Any comments about why ippsZero_8u takes most time, please ?
and (accordingly dll name) it seems IPP dispatcher chose "standart" library version, not one of specificaly optimized (not w7 for example).
--- End quote ---
Mine spends some large times in a few of the IPP functions. When you get to do some application and / architectural level performance measurement you will see the reasons, it in some small way might partially be related to the 'denormal data' issue you brought up before (take a look at the IPP flash tutorials about that). I've been thinking about ways to approach a custom (stripped down) FFTW build for a while now, but aren't ready yet.
The use of the standard library and the fact that it would be a DLL on a single core would be an issue too(probably extra context switches / cpu queue length)... means that like me you'd probably benefit from an extra core ;) so if you need to justify going to more cores for santa to bring one then "I need one for software development purposes" is probably a pretty good reason to add to the list ;D.
IMO, from the measurements I get, It is a myth that software doesn't benefit from multicore or even HT yet. Who runs only 1 single threaded process at a time? Only DOS! [ And perhaps reviewers doing synthetic benchmarks] The windows OS handles all the thread switching much better with multicore or even HT, for DLLs and services. Even without boinc/seti running, system responsiveness and use of system resources would improve for us :D
Jason
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version