First Time build try, 2.2B [&2.3S9] science App

Forum > Windows

<< < (5/6) > >>

Jason G:

--- Quote from: Josef W. Segur on 08 Oct 2007, 11:49:09 am ---So for 1.05e8 iterations about 8.7 seconds is saved. That would be about 35 seconds for an ar=0.41 WU. It seems the GetFixedPoT memcpy was probably only a small part of the 11% you measured, though memory accesses caused by that memcpy probably increase the impact.

--- End quote ---

Sounds reasonable, excepting that my ability to drive vtune at this stage is dubious at best, and that the 'one other' that has publicly used it on the code is focussing on totally different areas. My hotspots [in terms of total execution time] went something like 'memcpy, ...sum_ptt.(or transpose something)..,gaussfit... something else fft or two..a possible chirp or two, then pulse_find.

Of course I recognise that's a completely different generation architecture etc...

64.4 seconds(std memcopy in ) on this machine suggest to me about 0.3% out of total execution time, more or less. (with potential 0.15% [of total] improvement (LOL) at that spot in GetFixedPot) so there may well be another 10.7% waiting to be halved.

A call graph would help at this stage, but that part of vtune seems to crash the instrumented app for some reason.... ANYONE know how to stop it doing that (apart from disabling instrumentation) ? [ or perhaps suggest a, preferably lighter weight, profiler that will work with msvc/icc executables & debug info... ]

jason

Jason G:
digging through the 'questionable profiles', the p4/xeon 64k aliasing conflict issue I mentioned earlier seems to have registered 58.7e9 events. Avoiding the cache with those writes may help ...[ But it looks like over 50% of them are generated in a particular IPP call,, there was a mention somewhere about operatiing on 'denormailsed data' and 'using threshholds', I wonders if that applies....]

--- Quote ---This event counts the number of 64K/4M-aliasing conflicts. On the Intel(R) Pentium(R) 4 processor, a 64K-aliasing conflict occurs when a virtual address memory reference causes a cache line to be loaded and the address is modulo 64K bytes apart from another address that already resides in the first level cache. Only one cache line with a virtual address modulo 64K bytes can reside in the first level cache at the same time. On Intel(R) Xeon(R) processors, the conflict occurs when the memory references are modulo 4M bytes.

For example, accessing a byte at virtual addresses 0x10000 and 0x3000F would cause a 64K/4M aliasing conflict. This is because the virtual addresses for the two bytes reside on cache lines that are modulo 64K bytes apart.

On the Intel(R) Pentium(R) 4 and Intel(R) Xeon(R) processors with CPUID signature of family encoding 15, model encoding of 0, 1 or 2, the 64K/M aliasing conflict also occurs when addresses have identical value in bits 15:6. If you avoid this kind of aliasing, you can speedup programs by a factor of three if they load frequently from preceding stores with aliased addresses and there is little other instruction-level parallelism available. The gain is smaller when loads alias with other loads, which cause thrashing in the first-level cache.

--- End quote ---

Jason G:

--- Quote ---A call graph would help at this stage, but that part of vtune seems to crash the instrumented app for some reason...
--- End quote ---

Sorted ;) [new p4 primary events profile in the oven]

_heinz:
@jason
at the end of opt_FPU.cpp line 600 ff is a memcopy named ultra_memcpy for different processorstypes.
heinz

Jason G:
Hi again,. I did see these a while back....when Joe Segur explained:

--- Quote ---Ben Herndon started (but didn't complete) an effort to add memcpy routines to the set of tested functions. It might be a useful addition.

--- End quote ---
What I'm trying to establish is whether the preliminary measurements I did, that indicate My vintage p4 is spending so much time in memcpy, are valid. I have just finished the Second profiling (19 calibrated Runs of primary event monitors through a full workunit) so I'll know more soon.

Obtaining code for good memory copies isn't so much the issue for me ,as I have tons of those optimised for different applications (sizes of data, platforms and comtext, tailorable to any dedicated context at will) . What is gradually becoming clearer for me is I only have a 512k L2 cache, and the generation of p4 that I have has an artifact called '64k Aliasing Conflicts' which causing cache thrashing [ In L1 I think, but am still reading up on that], particularly during certain large IPP calls .... and GuassFit and others. [ new run has 44.3 thousand million 64k Aliasing Conflicts sampled]

Some areas as Joe is pointing out may benefit from Interleaved (or phased) operations instead of a discrete memcpy. this allows multiple read/store/write operations in paralell AND to avoid cache problems, but is challenging to modify the algorithms. This would require no direct use of memcpy as such, but a comprehensive understanding of how memory is used in a given part of the code, to implement a superior technique.

Another [most important] issue which Joe brought up is portability. In many cases that'll require more thought as there are many platforms, but there is a mechanism in place. That'll be the most challenging part of any improvements [Joe's Job :P LOL].

[Later: Also note that a rep movsd on a PC. like the unused one in that file, will always show 'about' the same performance as standard memcpy (because that's what memcpy is on a PC), I see no other implementations in the source I have, only a commented out selection function (ultra_memcpy) which looks promising, but may or may not be the best approach if the cause of memory issues remains elusive. It may turn out 'All' heavy traffic memcpy()]'s won't be there at all, which would be nice for my p4 :D]

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version