First Time build try, 2.2B [&2.3S9] science App

Forum > Windows

<< < (6/6)

Jason G:
Okay, New profile pretty much agrees with the last, except for a few key differences/observations.

[ Notes for my later reference, and others interest, comparsion etc...]

Platform: Intel Northwood p4 2.0A @ 2.1 GHz , 1Gb DDR400 RAM@CAS 3
OS Windows XP Pro w/sp2
App: Local 2.3S9 Build, ICC+IPP, xW , SSE2
Applied build fixes: Multiplier, MKL Include, Switches, dinclude directories
Performance Colectors: Call Graph, Counter Data (cpu%&sys proc queue length), Primary EBS events
Total 19 Runs, with Test Workunit Only

From Counter summaries /Sampling (Test Unit Used expected to have skewed results, will rerun with real WU when more is understood)
-- memcpy is now second on the list @ 9.5%, and was auto mapped to _intel_new/fast_memcpy. (Call graph works, so I can see this properly now).
[removed error]
--- w7_ipps_BitRev1_C ~ 8% (Intel IPP libray internally called it seems)
--- sse_vTranspose4ntw ~7%
--- GaussFit ~5.3%
--- Then find_pulse ~5%

Top 4 Hotspots by Self Time from Call Graph
---- Gauss_Fitt, Find_Pulse !!! (That's Better, ) w7_ipps_BitRev1_C, _intel_fast_memcopy,

TODO: Compare Call Graph vs Sampling Timings...

Memcpy:
--- _intel_fast_memcpy looks good for small transfers to medium, a bit better than memcpy. it's a standard, no mmx/sse dword copy. We'll get little or no benefit changing this one for small runs (except perhaps by inline, dedicated, tailless)

--- of the roughly 10% total time in intel_memcpy:
- 75% = called from seti_analyse ---> 7.5% of total execution time
- 9% = called from optimize_init ---> ~0.9% of total execution
- 8% = called from analyse_pot --> ~0.8% of total execution
- 3.3% = called from GetFixedPot --> ~0.3% of total execution ( Matches previous profile/Joe's/My estimations!)
remaining tiny perecntage calls to intel memcpy total <1% of execution.

Jason

Jason G:
Ahhh, now finally starting to lead me to more rational places. The second profile run has brought be full circle back to the FoldArray routines underneath find_pulse (SSE routines I mentioned in an early post). There seem to be a few p4 specific issues in there for me to learn about :D It looks like even a tiny change here would effect the whole run.

No more messin' with memcopies just jet! there be cache problems in there :D

Jason

_heinz:
@ jason
here you see what francois found...
;)

Jason G:
Cool, nice to see some similarity despite different platforms (I assume he's done that on a core2). I think that may be showing Total events grouped by function, which will get all callees etc, that can help you drill down to see what inside the function is expensive. he would drill into seti_analyze now to get more detail.

If you do drill in through seti analyze, you find the source of the fat memcpy's and stop at those(they have no callees), and then you back up and drill through findpulse then you find as deep as you can go is the FoldArray Routiness (they have no callees either). The foldArray Routines have the SSE2 instructions, pointer aliasing and dependancy loops on a p4. It looks like the same issues may be relevant on the core2 after all.

(I thought many were just p4 issues)

All good fun, Nice to know I seem to be using vtune OK. Gotta get back to Maths study for a while .. but I'll be Back! :D

Navigation

[0] Message Index

[*] Previous page

Go to full version