Seti@Home optimized science apps and information
Optimized Seti@Home apps => Windows => Topic started by: Jason G on 30 Sep 2007, 01:46:00 pm
-
Well I tested the 2.2B science App I built using a Bench1.cmd I modfied to just test my build and the default 515:
"Results Strongly Similar" :D Maybe tommorrow I'll fix up the Multiplier and test the other WUs ...
Northwood p4 2.0 @ 2.1 GHz:
2.2B built here (QxW, USE_SSE2): 11 mins 39 secs
Default 515 (from KWSN-test-Package2): 28 mins 20 secs
Impressive work Chickenses! :D
I have some minor questions from my experience in getting this to build this evening....
1) Old Boinc APi source: I Gather the ones I'm using are old (the ones from the 1.31 source package, Which I "Jiggered" the deprecated code in a couple of places to work on VS2005Pro), I tried getting newer ones from Boinc but their site was down :( Is it recommended to use the latest available for this? or doesn't it matter ?
2)Setting Project Options: I found I had to set /QxW in almost every one of the nine projects, and change USE_SSE3 to USE_SSE2 where applicable, Is there a Master Place to set this that I've Missed? If not then It must have been a pain for you guys to build all those different versions!
3)Include Directories: Some of these seemed to have "Program Files (x86)" is that a Vista thing? (Sorry I don't do Vista :D)
4)2.4 Sources: Is 2.4 or 2.4V sources available? I noticed running this no longer has the 'stutter' is there much change to achieve that?
Thanks for the Hard work you guys have put in, It's been fun having a go that's for sure!
Jason
-
...
I have some minor questions from my experience in getting this to build this evening....
1) Old Boinc APi source: I Gather the ones I'm using are old (the ones from the 1.31 source package, Which I "Jiggered" the deprecated code in a couple of places to work on VS2005Pro), I tried getting newer ones from Boinc but their site was down :( Is it recommended to use the latest available for this? or doesn't it matter ?
The 2.2B sources were branched off the official cvs sources more than a year ago. I tend to use BOINC source from about the same time frame for my test builds, I'm not sure what Simon used for the 2.2B builds but I think it was unchanged from earlier builds. Since you're using VS2005, perhaps the best thing would be to choose BOINC sources from the time they were trying that, I think they've dropped back to VC2003 for the most recent builds.
2)Setting Project Options: I found I had to set /QxW in almost every one of the nine projects, and change USE_SSE3 to USE_SSE2 where applicable, Is there a Master Place to set this that I've Missed? If not then It must have been a pain for you guys to build all those different versions!
For the 2.4 builds Simon worked up a combined arrangement where he could just select which target he was building for, it does look like for 2.2B he was going through and making a bunch of changes.
3)Include Directories: Some of these seemed to have "Program Files (x86)" is that a Vista thing? (Sorry I don't do Vista :D)
4)2.4 Sources: Is 2.4 or 2.4V sources available? I noticed running this no longer has the 'stutter' is there much change to achieve that?
Thanks for the Hard work you guys have put in, It's been fun having a go that's for sure!
Jason
Simon doesn't do Vista either, but does have 64 bit XP.
I don't have the exact 2.4 sources, but have uploaded the final development branch (2.3S9) code as seti_boinc_2k3_2.3-S9-Win32-Sources.zip (http://users.westelcom.com/jsegur/SAHenh/seti_boinc_2k3_2.3-S9-Win32-Sources.zip). It is not actually PKzip, rather 7zip, but the servers at my ISP are dumb. I believe the code is identical to what Simon used for 2.4, the only changes were to compile options and the identification.
I also don't have Crunch3r's final 2.4V sources, IIRC there were a few revisions after the latest I have.
Joe
-
Thankyou Very Much Sir. That clarifies A LOT. I have come across some of the extra strict compile time type checking that seems to be introduced in VS2005 WRT to Boincapi etc...It won't bother me while playing with the source. I would suspect that there might also be added run time overhead involved there, so maybe 2003 produced slighlty faster executables as a result..... LOL. Thanks for the newer source, I'll give it a go against the boincapi I have tomorrow, then maybe start investigating some of the much more important math stuff.
Regards, Jason
-
Well that Source works!, (with similar 'jiggereing' necessary for VS2005)
Northwood p4 2.0 @ 2.1 GHz:, (Test WU 1 Only, made sure nothing else running this time :D)
jason-KWSN2-2B-xW, (SSE2) 11 mins 01 secs
jason-KWSN_2.3S9_MB_xW, (SSE2) 10 mins 59 secs
default-515 27 mins 10 secs
Jason
-
Hi Jason,
I´m impressed about your fast work, congratulation.
Which compiler did you use Intel or MSC ??
Heinz
-
thanks, Haven't really started yet though! just exploring the sources etc.. I'm using Intel [ ICC & IPP] , on VS 2005 pro, which seems to require a tweak here and there to compile. It seems to be easier to build this later one, as the /Qx & SSEn configuration seems to be set up to apply itself to each project.... much better.
-
Okay, some crude attempts at profiling [2.3S9] with the dummy workunits seem to be leading me straight to the already heavily vectorised SSE code (several sum and transposie functions). They look damn good at asm level.
I am a bit surprised that a lot of time (about 11% on my system) seems to be spent in intel's implementation of memcpy. Haven't worked out why yet. I'm pretty sure I've seen better vectorised version of that, but can't be sure...there seems to be a littlle something extra in that function...on a hunch I really think msvc's version might possibly be faster [due to that something extra].
The pulse finding and chirping functions themselves showed much lower down on the list as far as percent of total execution is concerned.
I guess this might be because I'm using dummy test WUs. at some stage I think I'll have to test with a few copies of real ones out of my boinc cache as I may be being led up the garden path :D.
Jason
-
Okay, some crude attempts at profiling [2.3S9] with the dummy workunits seem to be leading me straight to the already heavily vectorised SSE code (several sum and transposie functions). They look damn good at asm level.
I am a bit surprised that a lot of time (about 11% on my system) seems to be spent in intel's implementation of memcpy. Haven't worked out why yet. I'm pretty sure I've seen better vectorised version of that, but can't be sure...there seems to be a littlle something extra in that function...on a hunch I really think msvc's version might possibly be faster [due to that something extra].
Ben Herndon started (but didn't complete) an effort to add memcpy routines to the set of tested functions. It might be a useful addition.
The pulse finding and chirping functions themselves showed much lower down on the list as far as percent of total execution is concerned.
I guess this might be because I'm using dummy test WUs. at some stage I think I'll have to test with a few copies of real ones out of my boinc cache as I may be being led up the garden path :D.
Jason
The shortened test WUs do emphasize what's done during startup and give more weight to the zero chirp testing. I agree that profiling to determine the hot spots would be more accurate with full WUs, but getting a spread of angle ranges is important too.
Joe
-
Hmm thanks again Joe. After I collect some more data (to see if memcpy is really as hot as it looks ,,, well at least on my old p4 beast) I'll see if i still have some of my old memcpy versions on backups, and try to figure out some comparisons. I vaguely remember there was an MMX version worked out faster than either regular or SSE/SSE2 versions for data blocks from 1mb to 200mb. If I can find it I'll try and figure out if it could be applicable.
-
Hi Jason,
if you find a very fast memcopy for mmx it would be great, have a diskless (2GB Compactflash) dual 200MMX crunching as testmachine.
here (http://setiathome.berkeley.edu/show_host_detail.php?hostid=3670222) you can see it.
regards heinz
-
I think I found some of the versions. It was two years ago I went through quite a lot of experiments at the time so i'm not sure which one was best. I'll try make a test with a few different sizes over the next few days. [I'll try testing them against Ms's and intel's]
Jason
-
Okay, Crude preliminary memcpy tests:
[edited bandwidths & typos]
Machine:
P4 (northwood 2.0A) @ 2.1GHz, 512k l2 cache, 1 Gig DDR400 @ CAS 3
Running XP SP2 , Boinc disabled, minimal background processes
Test:
200mb buffer, copied 100 times per function, then verfied using memcmp, then src contents is changed for each function.
Timing is only using clock() function around the copies only, I can't find my good macros for this but I'll keep looking.
Functions are 'Thrown into' asm blocks, need alignment and register preservation checking
Results:
memcpy - ms visual studio pro 2005, 59.1 seconds = 338.41 meg/s , reference , ~2.64 Gbits/s
mmxcopy - bog standard mmx code, 59.3 seconds = 337.27 meg/s, 0% speed change, ~2.63 Gbits/s
dword copy - standard instructions, 59.7 seconds = 335.01 meg/s. -1% speed change, ~2.62 Gbits/s
xxmmcopy - bog standard SSE/2, 59.0 seconds = 338.98 meg/s, 0% speed change, ~2.65 Gbits/s
xmmcopynt - SSE2 Non Temporal writes, 41.2 seconds, 485.44 meg/s, +43% speed change, ~3.79 Gbits/s,
mmxcopynt - MMX Non Temporal writes, 40.4 seconds, 495.05 meg/s, +46% speed change, ~3.87 Gbits/s
Conclusions:
So mmx, with non temporal writes appears fastest at this stage, pretty even with sse2 non temporal. MMX is pretty widely available, though this may not perform as well for AMD chips. AMD have a special 'software pretouch' memory copy technique somewhere on their website in PDF, but I don't own an AMD processor so I can't test it.
I haven't tested the intel memcopy yet either, (seti compiled with IPP seemes to spent about 11% of time in there), but it looks like a another bog standard non vectorised approach like first four above... maybe I'll get to that later in the week. Until then I'll poke at this code and see if I can clean it up a little and throughly check for accuracy, because quite frankly i wasn't expecting a >30% speedup with bodgy asm blocks.
Jason
-
...
Conclusions:
So mmx, with non temporal writes appears fastest at this stage, pretty even with sse2 non temporal. MMX is pretty widely available, though this may not perform as well for AMD chips. AMD have a special 'software pretouch' memory copy technique somewhere on their website in PDF, but I don't own an AMD processor so I can't test it.
If you're thinking of Using Block Prefetch for Optimized Memory Performance (http://cdrom.amd.com/devconn/events/AMD_block_prefetch_paper.pdf), that software block prefetch should work on Intel processors too.
I haven't tested the intel memcopy yet either, (seti compiled with IPP seemes to spent about 11% of time in there), but it looks like a another bog standard non vectorised approach like first four above... maybe I'll get to that later in the week. Until then I'll poke at this code and see if I can clean it up a little and throughly check for accuracy, because quite frankly i wasn't expecting a >30% speedup with bodgy asm blocks.
Jason
I haven't yet started counting the places where memcpy is used on large blocks vs. those which are small blocks. For the small block case, the copied data would usually be needed in cache for immediate use, so using non-temporal writes wouldn't be helpful. But there are certainly cases where we're copying arrays larger than will fit in cache, particularly on systems without huge cache sizes.
Joe
-
If you're thinking of Using Block Prefetch for Optimized Memory Performance (http://cdrom.amd.com/devconn/events/AMD_block_prefetch_paper.pdf), that software block prefetch should work on Intel processors too.
A Hah!, there 'tis, I've been looking for that... damn that code looks a lot like the one I just tested, maybe I can squeeze a bit more out of it then :D
I haven't yet started counting the places where memcpy is used on large blocks vs. those which are small blocks. For the small block case, the copied data would usually be needed in cache for immediate use, so using non-temporal writes wouldn't be helpful. But there are certainly cases where we're copying arrays larger than will fit in cache, particularly on systems without huge cache sizes.
Joe
Neither have I, as I don't know the datasets well enough yet to determine what some of the size parameters mean. However two places that struck my eye were the three calls in pulse_find, which seen to be at the end of their code blocks, perhaps suggesting that the data is not immediately needed. secondly the one at the start of chirpdata which looks to be needed in cache immediately, may one day be worth a test. as the source and destination seem to both be in use in the following code.
There may be many more places, I simply haven't looked that far.
All good fun
-
Trying with ICC it seems less than useful for seti at this stage, both from looking at the source , and converting to ICC is doing something wierd and making it slower! Back to maths homework tommorrow :S I hope to figure out what it's doing (just for fun) in another couple of weeks, It's been interesting... 'till next time :D
-
Okay, Between maths homework, I'm still trying to get a handle on the overall algorthms employed in the client. I found this page a while back,from Seti Classic, that game me some starters. http://seticlassic.ssl.berkeley.edu/about_seti/about_seti_at_home_4.html. Can I expect major differences other than sizes/durations of datasets in ehanced/MB? on the surface the logical sequence and methods appear similar ....
-
http://seticlassic.ssl.berkeley.edu/about_seti/about_seti_at_home_4.html....
Yes, that's still a fairly good overall description.
The duration of a WU is unchanged since the beginning of S@H, ~107.37 seconds (2^20 data points at 9765.625 samples per second). The classic project started with the assumption that the beam width of the Line feed was 0.1 degrees, that was later refined to 0.083 degrees, and for the ALFA receivers it's 0.05 degrees. So a target now is seen as within the beam width for half the time originally assumed. Other changes for setiathome_enhanced cause significantly more processing to improve the sensitivity, basically that reduced the spacing in chirps at which searches at a particular FFT length are done. And of course the MB WUs now do de-chirping all the way from -100 to +100 Hz/sec.
I did a quick survey of the memcpy uses today. There's one in seti.cpp which copies the full 8MB array, but is only used once. Those at the beginning of the chirp functions you noted are similar but they are also only used once, for the zero chirp case. But I'm thinking that it might be effective to use the block prefetch at the entry to all chirp functions, turn it into an effective memcpy for that zero chirp case but a three phase operation for the actual chirps. I certainly want to add an MMX chirp variant with the nontemporal writes, if I can figure out the best way to do so for MS, Intel, and GCC. I like assembler, but it gets complex to deal with both masm and AT&T versions.
The instances in find_pulse you noted can't be used more than 31 times in one WU, as they are used in reporting a found pulse. And the length can be anything from 16 to 40960 single floats.
The heavy use of memcpy is probably mainly in two areas.
1. The GetFixedPoT() function is used before every Gaussian fitting operation, and 3/4 of those calls are handled with memcpy, something like 4.33e8 times in an ar=0.41 MB WU. The length is 64 single floats, but like many other things is specified by a value in the WU header so the possibility it could change must be provided for.
2. Copying the part of the chirped data on which an FFT is going to be done. Those range from 64 bytes to 1 MiB, but always multiples of 64. It might be better to change to out of place IPP FFTs than try to optimize the memcpy, but I'm not sure.
Joe
-
Thanks for the expanded descriptions. I've been weeding my way through the main analyse_seti to map the general process against what was written. FFT's hiding in all sorts of places :D.
As far as usage of assembled files (integration difficulty) or even _asm blocks (for some reason it seems to disable compiler optimisations in my IPP :(), I too would stay away from them in this particular code (even though I prefer assembly). If you can use intrinsics to do the same thing, it should be fairly portable across compilers with a few defines (intrinsics seem to be used extensively elsewhere to good effect. ) and avoid having hassles with prologue epilogue too.
Just for interest, General rule of thumb for mem copies: (sounds like it matches what you're saying already)
1) less than 64-byte use a MOV or MOVQ or MOVDQA (incremental pointers)
2) 64+ bytes, use REP MOVSD ( std c libraries 'memcpy' are usually a 'version' of this one, but I think compiler can't convert/inject memcpy and flatten the code, so direct rep movsd would be a lot less overhead here if possible)
3) 1KB plus use MOVQ or MOVDQA (unrolledx4 or 8 interleaved loops)
4) 2MB plus use MOVNTDQ or other non-temporal stores. (limited [hardware] prefetch, interleave counters unrolled x8)
5) large read--> process--> stores ... requires block prefetch, non temporal writes + interleaved loop counters
I noticed a couple of caveats that AMD don't aooear to mention in that document. 1) interleave counters, use more registers, and 2) avoid p4 classic (SSE2) multiple of 64k pointer difference, particularly needed for IPP calls.
(PS: Playingwith the code a little more, I came across that disabled Intel Math include that says "needs more work", it doesn't there are duplicate includes , LOL maybe that'll save 0.01 %)
[4.33e8 x 64 x 4 = 103 Gbytes ? :o]
-
Here's a quick play with copying 256 bytes (64 floats), I've left out all the mmx & sse variants for these short copies as the emms & fences overhead is too high and makes them take ~twice as long. [Note: no attention to alignment as yet in these]
Testing blocksize 256 bytes
Memory space allocated for Source and Dest blocks
-----------------------------------------------
memsetting Source to 'JJJJ...'
Copying with memcopy (Source to Dest) 104857600 times
Comparing Source to Dest
...identical .... duration 16093 ms
-----------------------------------------------
memsetting Source to 'PPPP...'
Copying with inline rep movsd (Source to Dest) 104857600 times
floats=no tail needed here, should be similar to memcpy
Comparing Source to Dest
...identical .... duration 16359 ms
-----------------------------------------------
memsetting Source to 'QQQQ...'
Copying with inline register movs 2x unroll (Source to Dest) 104857600 times
no tail here, must have even #floats, can add tail if needed.
Should be slightly faster than memcpy & rep movsd
Comparing Source to Dest
...identical .... duration 7375 ms
-----------------------------------------------
Memory freed
Press any key to continue . . .
I thought about the portability issue a little, what about .libs? (existing ones or new)
Still using ms here, though I found icc had some inconvenient debug settings that spoiled things so maybe i'll try some tests with that later in the week.
Jason
-
So for 1.05e8 iterations about 8.7 seconds is saved. That would be about 35 seconds for an ar=0.41 WU. It seems the GetFixedPoT memcpy was probably only a small part of the 11% you measured, though memory accesses caused by that memcpy probably increase the impact.
...
I thought about the portability issue a little, what about .libs? (existing ones or new)
...
Jason
I like the .lib idea, if we come up with enough tweaks to make it worthwhile.
Joe
-
So for 1.05e8 iterations about 8.7 seconds is saved. That would be about 35 seconds for an ar=0.41 WU. It seems the GetFixedPoT memcpy was probably only a small part of the 11% you measured, though memory accesses caused by that memcpy probably increase the impact.
Sounds reasonable, excepting that my ability to drive vtune at this stage is dubious at best, and that the 'one other' that has publicly used it on the code is focussing on totally different areas. My hotspots [in terms of total execution time] went something like 'memcpy, ...sum_ptt.(or transpose something)..,gaussfit... something else fft or two..a possible chirp or two, then pulse_find.
Of course I recognise that's a completely different generation architecture etc...
64.4 seconds(std memcopy in ) on this machine suggest to me about 0.3% out of total execution time, more or less. (with potential 0.15% [of total] improvement (LOL) at that spot in GetFixedPot) so there may well be another 10.7% waiting to be halved.
A call graph would help at this stage, but that part of vtune seems to crash the instrumented app for some reason.... ANYONE know how to stop it doing that (apart from disabling instrumentation) ? [ or perhaps suggest a, preferably lighter weight, profiler that will work with msvc/icc executables & debug info... ]
jason
-
digging through the 'questionable profiles', the p4/xeon 64k aliasing conflict issue I mentioned earlier seems to have registered 58.7e9 events. Avoiding the cache with those writes may help ...[ But it looks like over 50% of them are generated in a particular IPP call,, there was a mention somewhere about operatiing on 'denormailsed data' and 'using threshholds', I wonders if that applies....]
This event counts the number of 64K/4M-aliasing conflicts. On the Intel(R) Pentium(R) 4 processor, a 64K-aliasing conflict occurs when a virtual address memory reference causes a cache line to be loaded and the address is modulo 64K bytes apart from another address that already resides in the first level cache. Only one cache line with a virtual address modulo 64K bytes can reside in the first level cache at the same time. On Intel(R) Xeon(R) processors, the conflict occurs when the memory references are modulo 4M bytes.
For example, accessing a byte at virtual addresses 0x10000 and 0x3000F would cause a 64K/4M aliasing conflict. This is because the virtual addresses for the two bytes reside on cache lines that are modulo 64K bytes apart.
On the Intel(R) Pentium(R) 4 and Intel(R) Xeon(R) processors with CPUID signature of family encoding 15, model encoding of 0, 1 or 2, the 64K/M aliasing conflict also occurs when addresses have identical value in bits 15:6. If you avoid this kind of aliasing, you can speedup programs by a factor of three if they load frequently from preceding stores with aliased addresses and there is little other instruction-level parallelism available. The gain is smaller when loads alias with other loads, which cause thrashing in the first-level cache.
-
A call graph would help at this stage, but that part of vtune seems to crash the instrumented app for some reason...
Sorted ;) [new p4 primary events profile in the oven]
-
@jason
at the end of opt_FPU.cpp line 600 ff is a memcopy named ultra_memcpy for different processorstypes.
heinz
-
Hi again,. I did see these a while back....when Joe Segur explained:
Ben Herndon started (but didn't complete) an effort to add memcpy routines to the set of tested functions. It might be a useful addition.
What I'm trying to establish is whether the preliminary measurements I did, that indicate My vintage p4 is spending so much time in memcpy, are valid. I have just finished the Second profiling (19 calibrated Runs of primary event monitors through a full workunit) so I'll know more soon.
Obtaining code for good memory copies isn't so much the issue for me ,as I have tons of those optimised for different applications (sizes of data, platforms and comtext, tailorable to any dedicated context at will) . What is gradually becoming clearer for me is I only have a 512k L2 cache, and the generation of p4 that I have has an artifact called '64k Aliasing Conflicts' which causing cache thrashing [ In L1 I think, but am still reading up on that], particularly during certain large IPP calls .... and GuassFit and others. [ new run has 44.3 thousand million 64k Aliasing Conflicts sampled]
Some areas as Joe is pointing out may benefit from Interleaved (or phased) operations instead of a discrete memcpy. this allows multiple read/store/write operations in paralell AND to avoid cache problems, but is challenging to modify the algorithms. This would require no direct use of memcpy as such, but a comprehensive understanding of how memory is used in a given part of the code, to implement a superior technique.
Another [most important] issue which Joe brought up is portability. In many cases that'll require more thought as there are many platforms, but there is a mechanism in place. That'll be the most challenging part of any improvements [Joe's Job :P LOL].
[Later: Also note that a rep movsd on a PC. like the unused one in that file, will always show 'about' the same performance as standard memcpy (because that's what memcpy is on a PC), I see no other implementations in the source I have, only a commented out selection function (ultra_memcpy) which looks promising, but may or may not be the best approach if the cause of memory issues remains elusive. It may turn out 'All' heavy traffic memcpy()]'s won't be there at all, which would be nice for my p4 :D]
-
Okay, New profile pretty much agrees with the last, except for a few key differences/observations.
[ Notes for my later reference, and others interest, comparsion etc...]
Platform: Intel Northwood p4 2.0A @ 2.1 GHz , 1Gb DDR400 RAM@CAS 3
OS Windows XP Pro w/sp2
App: Local 2.3S9 Build, ICC+IPP, xW , SSE2
Applied build fixes: Multiplier, MKL Include, Switches, dinclude directories
Performance Colectors: Call Graph, Counter Data (cpu%&sys proc queue length), Primary EBS events
Total 19 Runs, with Test Workunit Only
From Counter summaries /Sampling (Test Unit Used expected to have skewed results, will rerun with real WU when more is understood)
-- memcpy is now second on the list @ 9.5%, and was auto mapped to _intel_new/fast_memcpy. (Call graph works, so I can see this properly now).
[removed error]
--- w7_ipps_BitRev1_C ~ 8% (Intel IPP libray internally called it seems)
--- sse_vTranspose4ntw ~7%
--- GaussFit ~5.3%
--- Then find_pulse ~5%
Top 4 Hotspots by Self Time from Call Graph
---- Gauss_Fitt, Find_Pulse !!! (That's Better, ) w7_ipps_BitRev1_C, _intel_fast_memcopy,
TODO: Compare Call Graph vs Sampling Timings...
Memcpy:
--- _intel_fast_memcpy looks good for small transfers to medium, a bit better than memcpy. it's a standard, no mmx/sse dword copy. We'll get little or no benefit changing this one for small runs (except perhaps by inline, dedicated, tailless)
--- of the roughly 10% total time in intel_memcpy:
- 75% = called from seti_analyse ---> 7.5% of total execution time
- 9% = called from optimize_init ---> ~0.9% of total execution
- 8% = called from analyse_pot --> ~0.8% of total execution
- 3.3% = called from GetFixedPot --> ~0.3% of total execution ( Matches previous profile/Joe's/My estimations!)
remaining tiny perecntage calls to intel memcpy total <1% of execution.
Jason
-
Ahhh, now finally starting to lead me to more rational places. The second profile run has brought be full circle back to the FoldArray routines underneath find_pulse (SSE routines I mentioned in an early post). There seem to be a few p4 specific issues in there for me to learn about :D It looks like even a tiny change here would effect the whole run.
No more messin' with memcopies just jet! there be cache problems in there :D
Jason
-
@ jason
here (http://www.v12extreme.com/seti/profile.PNG) you see what francois found...
;)
-
Cool, nice to see some similarity despite different platforms (I assume he's done that on a core2). I think that may be showing Total events grouped by function, which will get all callees etc, that can help you drill down to see what inside the function is expensive. he would drill into seti_analyze now to get more detail.
If you do drill in through seti analyze, you find the source of the fat memcpy's and stop at those(they have no callees), and then you back up and drill through findpulse then you find as deep as you can go is the FoldArray Routiness (they have no callees either). The foldArray Routines have the SSE2 instructions, pointer aliasing and dependancy loops on a p4. It looks like the same issues may be relevant on the core2 after all.
(I thought many were just p4 issues)
All good fun, Nice to know I seem to be using vtune OK. Gotta get back to Maths study for a while .. but I'll be Back! :D