http://seticlassic.ssl.berkeley.edu/about_seti/about_seti_at_home_4.html....
Testing blocksize 256 bytesMemory space allocated for Source and Dest blocks-----------------------------------------------memsetting Source to 'JJJJ...'Copying with memcopy (Source to Dest) 104857600 timesComparing Source to Dest...identical .... duration 16093 ms-----------------------------------------------memsetting Source to 'PPPP...'Copying with inline rep movsd (Source to Dest) 104857600 timesfloats=no tail needed here, should be similar to memcpyComparing Source to Dest...identical .... duration 16359 ms-----------------------------------------------memsetting Source to 'QQQQ...'Copying with inline register movs 2x unroll (Source to Dest) 104857600 timesno tail here, must have even #floats, can add tail if needed.Should be slightly faster than memcpy & rep movsdComparing Source to Dest...identical .... duration 7375 ms-----------------------------------------------Memory freedPress any key to continue . . .
...I thought about the portability issue a little, what about .libs? (existing ones or new)...Jason
So for 1.05e8 iterations about 8.7 seconds is saved. That would be about 35 seconds for an ar=0.41 WU. It seems the GetFixedPoT memcpy was probably only a small part of the 11% you measured, though memory accesses caused by that memcpy probably increase the impact.
This event counts the number of 64K/4M-aliasing conflicts. On the Intel(R) Pentium(R) 4 processor, a 64K-aliasing conflict occurs when a virtual address memory reference causes a cache line to be loaded and the address is modulo 64K bytes apart from another address that already resides in the first level cache. Only one cache line with a virtual address modulo 64K bytes can reside in the first level cache at the same time. On Intel(R) Xeon(R) processors, the conflict occurs when the memory references are modulo 4M bytes.For example, accessing a byte at virtual addresses 0x10000 and 0x3000F would cause a 64K/4M aliasing conflict. This is because the virtual addresses for the two bytes reside on cache lines that are modulo 64K bytes apart.On the Intel(R) Pentium(R) 4 and Intel(R) Xeon(R) processors with CPUID signature of family encoding 15, model encoding of 0, 1 or 2, the 64K/M aliasing conflict also occurs when addresses have identical value in bits 15:6. If you avoid this kind of aliasing, you can speedup programs by a factor of three if they load frequently from preceding stores with aliased addresses and there is little other instruction-level parallelism available. The gain is smaller when loads alias with other loads, which cause thrashing in the first-level cache.
A call graph would help at this stage, but that part of vtune seems to crash the instrumented app for some reason...
Ben Herndon started (but didn't complete) an effort to add memcpy routines to the set of tested functions. It might be a useful addition.