First Time build try, 2.2B [&2.3S9] science App

Forum > Windows

<< < (4/6) > >>

Jason G:
Okay, Between maths homework, I'm still trying to get a handle on the overall algorthms employed in the client. I found this page a while back,from Seti Classic, that game me some starters. http://seticlassic.ssl.berkeley.edu/about_seti/about_seti_at_home_4.html. Can I expect major differences other than sizes/durations of datasets in ehanced/MB? on the surface the logical sequence and methods appear similar ....

Josef W. Segur:

--- Quote from: j_groothu on 07 Oct 2007, 11:43:45 am ---http://seticlassic.ssl.berkeley.edu/about_seti/about_seti_at_home_4.html....
--- End quote ---

Yes, that's still a fairly good overall description.

The duration of a WU is unchanged since the beginning of S@H, ~107.37 seconds (2^20 data points at 9765.625 samples per second). The classic project started with the assumption that the beam width of the Line feed was 0.1 degrees, that was later refined to 0.083 degrees, and for the ALFA receivers it's 0.05 degrees. So a target now is seen as within the beam width for half the time originally assumed. Other changes for setiathome_enhanced cause significantly more processing to improve the sensitivity, basically that reduced the spacing in chirps at which searches at a particular FFT length are done. And of course the MB WUs now do de-chirping all the way from -100 to +100 Hz/sec.

I did a quick survey of the memcpy uses today. There's one in seti.cpp which copies the full 8MB array, but is only used once. Those at the beginning of the chirp functions you noted are similar but they are also only used once, for the zero chirp case. But I'm thinking that it might be effective to use the block prefetch at the entry to all chirp functions, turn it into an effective memcpy for that zero chirp case but a three phase operation for the actual chirps. I certainly want to add an MMX chirp variant with the nontemporal writes, if I can figure out the best way to do so for MS, Intel, and GCC. I like assembler, but it gets complex to deal with both masm and AT&T versions.

The instances in find_pulse you noted can't be used more than 31 times in one WU, as they are used in reporting a found pulse. And the length can be anything from 16 to 40960 single floats.

The heavy use of memcpy is probably mainly in two areas.

1. The GetFixedPoT() function is used before every Gaussian fitting operation, and 3/4 of those calls are handled with memcpy, something like 4.33e8 times in an ar=0.41 MB WU. The length is 64 single floats, but like many other things is specified by a value in the WU header so the possibility it could change must be provided for.

2. Copying the part of the chirped data on which an FFT is going to be done. Those range from 64 bytes to 1 MiB, but always multiples of 64. It might be better to change to out of place IPP FFTs than try to optimize the memcpy, but I'm not sure.
Joe

Jason G:
Thanks for the expanded descriptions. I've been weeding my way through the main analyse_seti to map the general process against what was written. FFT's hiding in all sorts of places :D.

As far as usage of assembled files (integration difficulty) or even _asm blocks (for some reason it seems to disable compiler optimisations in my IPP :(), I too would stay away from them in this particular code (even though I prefer assembly). If you can use intrinsics to do the same thing, it should be fairly portable across compilers with a few defines (intrinsics seem to be used extensively elsewhere to good effect. ) and avoid having hassles with prologue epilogue too.

Just for interest, General rule of thumb for mem copies: (sounds like it matches what you're saying already)
1) less than 64-byte use a MOV or MOVQ or MOVDQA (incremental pointers)
2) 64+ bytes, use REP MOVSD ( std c libraries 'memcpy' are usually a 'version' of this one, but I think compiler can't convert/inject memcpy and flatten the code, so direct rep movsd would be a lot less overhead here if possible)
3) 1KB plus use MOVQ or MOVDQA (unrolledx4 or 8 interleaved loops)
4) 2MB plus use MOVNTDQ or other non-temporal stores. (limited [hardware] prefetch, interleave counters unrolled x8)
5) large read--> process--> stores ... requires block prefetch, non temporal writes + interleaved loop counters

I noticed a couple of caveats that AMD don't aooear to mention in that document. 1) interleave counters, use more registers, and 2) avoid p4 classic (SSE2) multiple of 64k pointer difference, particularly needed for IPP calls.

(PS: Playingwith the code a little more, I came across that disabled Intel Math include that says "needs more work", it doesn't there are duplicate includes , LOL maybe that'll save 0.01 %)

[4.33e8 x 64 x 4 = 103 Gbytes ? :o]

Jason G:
Here's a quick play with copying 256 bytes (64 floats), I've left out all the mmx & sse variants for these short copies as the emms & fences overhead is too high and makes them take ~twice as long. [Note: no attention to alignment as yet in these]

--- Code: ---Testing blocksize 256 bytes
Memory space allocated for Source and Dest blocks
-----------------------------------------------
memsetting Source to 'JJJJ...'
Copying with memcopy (Source to Dest) 104857600 times
Comparing Source to Dest
...identical .... duration 16093 ms
-----------------------------------------------
memsetting Source to 'PPPP...'
Copying with inline rep movsd (Source to Dest) 104857600 times
floats=no tail needed here, should be similar to memcpy
Comparing Source to Dest
...identical .... duration 16359 ms
-----------------------------------------------
memsetting Source to 'QQQQ...'
Copying with inline register movs 2x unroll (Source to Dest) 104857600 times
no tail here, must have even #floats, can add tail if needed.
Should be slightly faster than memcpy & rep movsd
Comparing Source to Dest
...identical .... duration 7375 ms
-----------------------------------------------
Memory freed
Press any key to continue . . .
--- End code ---

I thought about the portability issue a little, what about .libs? (existing ones or new)

Still using ms here, though I found icc had some inconvenient debug settings that spoiled things so maybe i'll try some tests with that later in the week.

Jason

Josef W. Segur:
So for 1.05e8 iterations about 8.7 seconds is saved. That would be about 35 seconds for an ar=0.41 WU. It seems the GetFixedPoT memcpy was probably only a small part of the 11% you measured, though memory accesses caused by that memcpy probably increase the impact.

--- Quote from: j_groothu on 08 Oct 2007, 08:43:56 am ---...
I thought about the portability issue a little, what about .libs? (existing ones or new)
...
Jason
--- End quote ---

I like the .lib idea, if we come up with enough tweaks to make it worthwhile.
Joe

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version