Author Topic: First Time build try, 2.2B [&2.3S9] science App (Read 23508 times)

Jason G · « **Reply #15 on:** 07 Oct 2007, 11:43:45 am »

Okay, Between maths homework, I'm still trying to get a handle on the overall algorthms employed in the client. I found this page a while back,from Seti Classic, that game me some starters. http://seticlassic.ssl.berkeley.edu/about_seti/about_seti_at_home_4.html. Can I expect major differences other than sizes/durations of datasets in ehanced/MB? on the surface the logical sequence and methods appear similar ....

Josef W. Segur · « **Reply #16 on:** 07 Oct 2007, 07:27:32 pm »

Quote from: j_groothu on 07 Oct 2007, 11:43:45 am

http://seticlassic.ssl.berkeley.edu/about_seti/about_seti_at_home_4.html....

Yes, that's still a fairly good overall description.

The duration of a WU is unchanged since the beginning of S@H, ~107.37 seconds (2^20 data points at 9765.625 samples per second). The classic project started with the assumption that the beam width of the Line feed was 0.1 degrees, that was later refined to 0.083 degrees, and for the ALFA receivers it's 0.05 degrees. So a target now is seen as within the beam width for half the time originally assumed. Other changes for setiathome_enhanced cause significantly more processing to improve the sensitivity, basically that reduced the spacing in chirps at which searches at a particular FFT length are done. And of course the MB WUs now do de-chirping all the way from -100 to +100 Hz/sec.

I did a quick survey of the memcpy uses today. There's one in seti.cpp which copies the full 8MB array, but is only used once. Those at the beginning of the chirp functions you noted are similar but they are also only used once, for the zero chirp case. But I'm thinking that it might be effective to use the block prefetch at the entry to all chirp functions, turn it into an effective memcpy for that zero chirp case but a three phase operation for the actual chirps. I certainly want to add an MMX chirp variant with the nontemporal writes, if I can figure out the best way to do so for MS, Intel, and GCC. I like assembler, but it gets complex to deal with both masm and AT&T versions.

The instances in find_pulse you noted can't be used more than 31 times in one WU, as they are used in reporting a found pulse. And the length can be anything from 16 to 40960 single floats.

The heavy use of memcpy is probably mainly in two areas.

1. The GetFixedPoT() function is used before every Gaussian fitting operation, and 3/4 of those calls are handled with memcpy, something like 4.33e8 times in an ar=0.41 MB WU. The length is 64 single floats, but like many other things is specified by a value in the WU header so the possibility it could change must be provided for.

2. Copying the part of the chirped data on which an FFT is going to be done. Those range from 64 bytes to 1 MiB, but always multiples of 64. It might be better to change to out of place IPP FFTs than try to optimize the memcpy, but I'm not sure.
Joe

Jason G · « **Reply #17 on:** 07 Oct 2007, 08:12:27 pm »

Thanks for the expanded descriptions. I've been weeding my way through the main analyse_seti to map the general process against what was written. FFT's hiding in all sorts of places

.

As far as usage of assembled files (integration difficulty) or even _asm blocks (for some reason it seems to disable compiler optimisations in my IPP

), I too would stay away from them in this particular code (even though I prefer assembly). If you can use intrinsics to do the same thing, it should be fairly portable across compilers with a few defines (intrinsics seem to be used extensively elsewhere to good effect. ) and avoid having hassles with prologue epilogue too.

Just for interest, General rule of thumb for mem copies: (sounds like it matches what you're saying already)
1) less than 64-byte use a MOV or MOVQ or MOVDQA (incremental pointers)
2) 64+ bytes, use REP MOVSD ( std c libraries 'memcpy' are usually a 'version' of this one, but I think compiler can't convert/inject memcpy and flatten the code, so direct rep movsd would be a lot less overhead here if possible)
3) 1KB plus use MOVQ or MOVDQA (unrolledx4 or 8 interleaved loops)
4) 2MB plus use MOVNTDQ or other non-temporal stores. (limited [hardware] prefetch, interleave counters unrolled x8)
5) large read--> process--> stores ... requires block prefetch, non temporal writes + interleaved loop counters

I noticed a couple of caveats that AMD don't aooear to mention in that document. 1) interleave counters, use more registers, and 2) avoid p4 classic (SSE2) multiple of 64k pointer difference, particularly needed for IPP calls.

(PS: Playingwith the code a little more, I came across that disabled Intel Math include that says "needs more work", it doesn't there are duplicate includes , LOL maybe that'll save 0.01 %)

[4.33e8 x 64 x 4 = 103 Gbytes ?

]

Jason G · « **Reply #18 on:** 08 Oct 2007, 08:43:56 am »

Here's a quick play with copying 256 bytes (64 floats), I've left out all the mmx & sse variants for these short copies as the emms & fences overhead is too high and makes them take ~twice as long. [Note: no attention to alignment as yet in these]

Code: [Select]

Testing blocksize 256 bytes
Memory space allocated for Source and Dest blocks
-----------------------------------------------
memsetting Source to 'JJJJ...'
Copying with memcopy (Source to Dest) 104857600 times
Comparing Source to Dest
...identical .... duration 16093 ms
-----------------------------------------------
memsetting Source to 'PPPP...'
Copying with inline rep movsd (Source to Dest) 104857600 times
floats=no tail needed here, should be similar to memcpy
Comparing Source to Dest
...identical .... duration 16359 ms
-----------------------------------------------
memsetting Source to 'QQQQ...'
Copying with inline register movs 2x unroll (Source to Dest) 104857600 times
no tail here, must have even #floats, can add tail if needed.
Should be slightly faster than memcpy & rep movsd
Comparing Source to Dest
...identical .... duration 7375 ms
-----------------------------------------------
Memory freed
Press any key to continue . . .

I thought about the portability issue a little, what about .libs? (existing ones or new)

Still using ms here, though I found icc had some inconvenient debug settings that spoiled things so maybe i'll try some tests with that later in the week.

Jason

Josef W. Segur · « **Reply #19 on:** 08 Oct 2007, 11:49:09 am »

So for 1.05e8 iterations about 8.7 seconds is saved. That would be about 35 seconds for an ar=0.41 WU. It seems the GetFixedPoT memcpy was probably only a small part of the 11% you measured, though memory accesses caused by that memcpy probably increase the impact.

Quote from: j_groothu on 08 Oct 2007, 08:43:56 am

...
I thought about the portability issue a little, what about .libs? (existing ones or new)
...
Jason

I like the .lib idea, if we come up with enough tweaks to make it worthwhile.
Joe

Jason G · « **Reply #20 on:** 08 Oct 2007, 12:29:04 pm »

Quote from: Josef W. Segur on 08 Oct 2007, 11:49:09 am

So for 1.05e8 iterations about 8.7 seconds is saved. That would be about 35 seconds for an ar=0.41 WU. It seems the GetFixedPoT memcpy was probably only a small part of the 11% you measured, though memory accesses caused by that memcpy probably increase the impact.

Sounds reasonable, excepting that my ability to drive vtune at this stage is dubious at best, and that the 'one other' that has publicly used it on the code is focussing on totally different areas. My hotspots [in terms of total execution time] went something like 'memcpy, ...sum_ptt.(or transpose something)..,gaussfit... something else fft or two..a possible chirp or two, then pulse_find.

Of course I recognise that's a completely different generation architecture etc...

64.4 seconds(std memcopy in ) on this machine suggest to me about 0.3% out of total execution time, more or less. (with potential 0.15% [of total] improvement (LOL) at that spot in GetFixedPot) so there may well be another 10.7% waiting to be halved.

A call graph would help at this stage, but that part of vtune seems to crash the instrumented app for some reason.... ANYONE know how to stop it doing that (apart from disabling instrumentation) ? [ or perhaps suggest a, preferably lighter weight, profiler that will work with msvc/icc executables & debug info... ]

jason

Jason G · « **Reply #21 on:** 08 Oct 2007, 01:09:46 pm »

digging through the 'questionable profiles', the p4/xeon 64k aliasing conflict issue I mentioned earlier seems to have registered 58.7e9 events. Avoiding the cache with those writes may help ...[ But it looks like over 50% of them are generated in a particular IPP call,, there was a mention somewhere about operatiing on 'denormailsed data' and 'using threshholds', I wonders if that applies....]

Quote

This event counts the number of 64K/4M-aliasing conflicts. On the Intel(R) Pentium(R) 4 processor, a 64K-aliasing conflict occurs when a virtual address memory reference causes a cache line to be loaded and the address is modulo 64K bytes apart from another address that already resides in the first level cache. Only one cache line with a virtual address modulo 64K bytes can reside in the first level cache at the same time. On Intel(R) Xeon(R) processors, the conflict occurs when the memory references are modulo 4M bytes.

For example, accessing a byte at virtual addresses 0x10000 and 0x3000F would cause a 64K/4M aliasing conflict. This is because the virtual addresses for the two bytes reside on cache lines that are modulo 64K bytes apart.

On the Intel(R) Pentium(R) 4 and Intel(R) Xeon(R) processors with CPUID signature of family encoding 15, model encoding of 0, 1 or 2, the 64K/M aliasing conflict also occurs when addresses have identical value in bits 15:6. If you avoid this kind of aliasing, you can speedup programs by a factor of three if they load frequently from preceding stores with aliased addresses and there is little other instruction-level parallelism available. The gain is smaller when loads alias with other loads, which cause thrashing in the first-level cache.

Jason G · « **Reply #22 on:** 08 Oct 2007, 07:45:43 pm »

Quote

A call graph would help at this stage, but that part of vtune seems to crash the instrumented app for some reason...

Sorted

[new p4 primary events profile in the oven]

_heinz · « **Reply #23 on:** 09 Oct 2007, 04:01:30 am »

@jason
at the end of opt_FPU.cpp line 600 ff is a memcopy named ultra_memcpy for different processorstypes.
heinz

Jason G · « **Reply #24 on:** 09 Oct 2007, 06:45:27 am »

Hi again,. I did see these a while back....when Joe Segur explained:

Quote

Ben Herndon started (but didn't complete) an effort to add memcpy routines to the set of tested functions. It might be a useful addition.

What I'm trying to establish is whether the preliminary measurements I did, that indicate My vintage p4 is spending so much time in memcpy, are valid. I have just finished the Second profiling (19 calibrated Runs of primary event monitors through a full workunit) so I'll know more soon.

Obtaining code for good memory copies isn't so much the issue for me ,as I have tons of those optimised for different applications (sizes of data, platforms and comtext, tailorable to any dedicated context at will) . What is gradually becoming clearer for me is I only have a 512k L2 cache, and the generation of p4 that I have has an artifact called '64k Aliasing Conflicts' which causing cache thrashing [ In L1 I think, but am still reading up on that], particularly during certain large IPP calls .... and GuassFit and others. [ new run has 44.3 thousand million 64k Aliasing Conflicts sampled]

Some areas as Joe is pointing out may benefit from Interleaved (or phased) operations instead of a discrete memcpy. this allows multiple read/store/write operations in paralell AND to avoid cache problems, but is challenging to modify the algorithms. This would require no direct use of memcpy as such, but a comprehensive understanding of how memory is used in a given part of the code, to implement a superior technique.

Another [most important] issue which Joe brought up is portability. In many cases that'll require more thought as there are many platforms, but there is a mechanism in place. That'll be the most challenging part of any improvements [Joe's Job

LOL].

[Later: Also note that a rep movsd on a PC. like the unused one in that file, will always show 'about' the same performance as standard memcpy (because that's what memcpy is on a PC), I see no other implementations in the source I have, only a commented out selection function (ultra_memcpy) which looks promising, but may or may not be the best approach if the cause of memory issues remains elusive. It may turn out 'All' heavy traffic memcpy()]'s won't be there at all, which would be nice for my p4

]

Jason G · « **Reply #25 on:** 09 Oct 2007, 09:26:36 am »

Okay, New profile pretty much agrees with the last, except for a few key differences/observations.

[ Notes for my later reference, and others interest, comparsion etc...]

Platform: Intel Northwood p4 2.0A @ 2.1 GHz , 1Gb DDR400 RAM@CAS 3
OS Windows XP Pro w/sp2
App: Local 2.3S9 Build, ICC+IPP, xW , SSE2
Applied build fixes: Multiplier, MKL Include, Switches, dinclude directories
Performance Colectors: Call Graph, Counter Data (cpu%&sys proc queue length), Primary EBS events
Total 19 Runs, with Test Workunit Only

From Counter summaries /Sampling (Test Unit Used expected to have skewed results, will rerun with real WU when more is understood)
-- memcpy is now second on the list @ 9.5%, and was auto mapped to _intel_new/fast_memcpy. (Call graph works, so I can see this properly now).
[removed error]
--- w7_ipps_BitRev1_C ~ 8% (Intel IPP libray internally called it seems)
--- sse_vTranspose4ntw ~7%
--- GaussFit ~5.3%
--- Then find_pulse ~5%

Top 4 Hotspots by Self Time from Call Graph
---- Gauss_Fitt, Find_Pulse !!! (That's Better, ) w7_ipps_BitRev1_C, _intel_fast_memcopy,

TODO: Compare Call Graph vs Sampling Timings...

Memcpy:
--- _intel_fast_memcpy looks good for small transfers to medium, a bit better than memcpy. it's a standard, no mmx/sse dword copy. We'll get little or no benefit changing this one for small runs (except perhaps by inline, dedicated, tailless)

--- of the roughly 10% total time in intel_memcpy:
- 75% = called from seti_analyse ---> 7.5% of total execution time
- 9% = called from optimize_init ---> ~0.9% of total execution
- 8% = called from analyse_pot --> ~0.8% of total execution
- 3.3% = called from GetFixedPot --> ~0.3% of total execution ( Matches previous profile/Joe's/My estimations!)
remaining tiny perecntage calls to intel memcpy total <1% of execution.

Jason

Jason G · « **Reply #26 on:** 09 Oct 2007, 01:09:54 pm »

Ahhh, now finally starting to lead me to more rational places. The second profile run has brought be full circle back to the FoldArray routines underneath find_pulse (SSE routines I mentioned in an early post). There seem to be a few p4 specific issues in there for me to learn about

It looks like even a tiny change here would effect the whole run.

No more messin' with memcopies just jet! there be cache problems in there

Jason

_heinz · « **Reply #27 on:** 12 Oct 2007, 04:52:16 am »

@ jason
here you see what francois found...

Jason G · « **Reply #28 on:** 12 Oct 2007, 11:01:05 am »

Cool, nice to see some similarity despite different platforms (I assume he's done that on a core2). I think that may be showing Total events grouped by function, which will get all callees etc, that can help you drill down to see what inside the function is expensive. he would drill into seti_analyze now to get more detail.

If you do drill in through seti analyze, you find the source of the fat memcpy's and stop at those(they have no callees), and then you back up and drill through findpulse then you find as deep as you can go is the FoldArray Routiness (they have no callees either). The foldArray Routines have the SSE2 instructions, pointer aliasing and dependancy loops on a p4. It looks like the same issues may be relevant on the core2 after all.

(I thought many were just p4 issues)

All good fun, Nice to know I seem to be using vtune OK. Gotta get back to Maths study for a while .. but I'll be Back!

Author Topic: First Time build try, 2.2B [&2.3S9] science App (Read 23508 times)

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

Josef W. Segur

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

Josef W. Segur

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

_heinz

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

_heinz

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App