Seti@Home optimized science apps and information

Optimized Seti@Home apps => Windows => Topic started by: BenHer on 09 Aug 2006, 08:29:12 pm

Title: Current Profile Analysis and points to optimze
Post by: BenHer on 09 Aug 2006, 08:29:12 pm
Hi Simon,

Got it all put together, compiled and running without errors.

Haven't modified source so I'm going to assume the results strongly match...will next.

However, even without this the profile Ive just done should show places to work on (AMD DeviceAnalyst).

The following areas showed most CPU time usage (I limited it to one Core only on this Athlon 64 3800+ X2)

WU used:  testWU-1
Test time:  ~10 min

13.32% - AnalyzePot
7.09% - findpulse
6.48% - GaussFit


AnalyzePot Locations:
Code: [Select]
66% An inlined version of "getFixedPot" function
      1/2 from this line -       fp_PoT[ul_PoT_i] = fp_PowerSpectrum[ul_PoT + ul_PoTChunk_i];

     Mostly due to cache misses (the value being copied is grabbed from what amounts to random memory addresses)

33% - The following loop
   for(i = 0; i < PulsePoTLen; i++) {
       PulsePoT[i] = PowerSpectrum[ThisPoT + (TOffset+i) * FftLength];
}

findpulse
Inside the 4 folding loops I mentioned earlier

GaussFit
Title: Re: Current Profile Analysis and points to optimze
Post by: BenHer on 10 Aug 2006, 12:45:10 am
Did a run on a full WU (the test WU included with the source).

Function usage pretty much the same, not many areas for improvement.

13.5% - analyze_pot ...  again - same 3 code sections
 7.17% - gauss_fit   (inlined: getChiSq[36%] - GetTrueMean[36%] - f_GetPeak[27%] )
 5.62% - seti_analyze (inline: v_ChirpData[78%] - v_getPowerSpectrum[27%]  )
 5.35% - find_pulse
 3.22% - an ipp fft routine
 2.68% - an ipp fft routine
 2.54% - chooseGaussEvent
 1.61% - memcpy

If you were to implement a blindingly fast memcopy routine, speeding it up by 5 times (or 400%) you would speedup overall WU crunching by only 1.2%...thus not a great place to put a lot of effort.

analyze_pot - The alex kahn transposed pot caching would probably speed this up a lot (and it uses memcpy so maybe thats a good idea)
Find_pulse - with SSE optimized functions could probably be speeded up by 200% (3x), saving overall WU time of 3.5%.

Title: Re: Current Profile Analysis and points to optimze
Post by: Josef W. Segur on 10 Aug 2006, 02:57:43 pm
Did a run on a full WU (the test WU included with the source).

Function usage pretty much the same, not many areas for improvement.

True, there's not a lot of difference between the 0.6xx angle range of WU-1 and the 0.775 of the project test WU. The distribution of angle ranges from Arecibo to date peaks in the 0.42 to 0.44 range; WU-2, WU-3, and WU-5 are more typical. The v_chirpdata contribution should not be affected by angle range, so would become a lower percentage.

Quote
13.5% - analyze_pot ...  again - same 3 code sections
 7.17% - gauss_fit   (inlined: getChiSq[36%] - GetTrueMean[36%] - f_GetPeak[27%] )
 5.62% - seti_analyze (inline: v_ChirpData[78%] - v_getPowerSpectrum[27%]  )
 5.35% - find_pulse
 3.22% - an ipp fft routine
 2.68% - an ipp fft routine
 2.54% - chooseGaussEvent
 1.61% - memcpy

If you were to implement a blindingly fast memcopy routine, speeding it up by 5 times (or 400%) you would speedup overall WU crunching by only 1.2%...thus not a great place to put a lot of effort.

analyze_pot - The alex kahn transposed pot caching would probably speed this up a lot (and it uses memcpy so maybe thats a good idea)
Find_pulse - with SSE optimized functions could probably be speeded up by 200% (3x), saving overall WU time of 3.5%.

I'm very glad to see real figures aimed at identifying the hot spots, thanks!

Question:

Do you think it would be practical to combine the fft, getPowerSpectrum, and transpose steps? It seems it might be more efficient, and perhaps IPP has some functions meant for generating a Spectrum Analyzer display which might be usable. The only thing which actually needs the complex fft output is baseline smoothing, all signal analysis is performed on the PowerSpectrum.
                                                                      Joe
Title: Re: Current Profile Analysis and points to optimze
Post by: BenHer on 10 Aug 2006, 04:40:40 pm
Quote
Do you think it would be practical to combine the fft, getPowerSpectrum, and transpose steps? It seems it might be more efficient, and perhaps IPP has some functions meant for generating a Spectrum Analyzer display which might be usable.

You have exceeded the level of my competence  :o  - This is a question best left for Eric K and maybe Dave A. (and other mathmaticians in the audience).

My forte is in programming, optimization, assembly programming, SIMD, cpu pipeline latencies and the like.  I've forgotten any signals analysis I learned in college, what little there was, long since.

I can spot inneficient code and understand the programming goal of most code though, so I can understand what the original programmer intended and often come at a solution from many angles. 

As an example Alex Kan designed a new method of caching the POT information, and submitted it to Eric K, who put it into release code 5.17.  The primary value of this method puts most pot table access in adjacent memory addresses and not semi-randomly placed throughout the table.  CPUs like this and can pre-fetch future memory into the cache.  Perhaps he understands the underlying signals purpose of the pot values int the table, but the data reorganization method is clear to my coding understanding.
Title: Re: Current Profile Analysis and points to optimze
Post by: Simon on 11 Aug 2006, 08:04:03 pm
Whew,

I'm away for a few days and scientific discussion erupts on these boards ;) Glad to see it. Also sorry to say that my math skills are largely constrained to what I learned in high school, so no help there.

However, thank you very much for analysis of possible hotspots, Ben! This does indeed help to identify possible, and more importantly, sensible parts of the code to try and improve, not to mention most people who can compile applications cannot analyze how they run very well.

In any case, I guess Michael's vTune runs should have run their full course (he said he cancelled them before they were done I believe, at 50% or so), because his results were that find_pulse was the biggest bottleneck (~20% of time spent or thereabouts as I recall). It still seems part of the top few (but at ~5%, not quite as important) and also explains why his optimizations never yielded as much performance as he (and I, obviously ;)) hoped. That's not to say his work isn't worthwhile, it most definitely is. It has, at the very least, attracted the attention of other capable people like yourselves - and that is no small feat.

Hats off :)
Simon.
Title: Re: Current Profile Analysis and points to optimze
Post by: BenHer on 11 Aug 2006, 10:31:41 pm
Simon,

Could you post a full WU for testing that

a. has typical  angle_range
b. crunches for a typical amount of time on one of your systems (ie the average time for that system)

As Josef suggested doing a full run with this type of full WU would be best.

Thanks,
=Ben
Title: Re: Current Profile Analysis and points to optimze
Post by: Simon on 11 Aug 2006, 11:27:51 pm
Josef is talking about the most common angle ranges - apart from the adjusted chirp_limit parameters, they are the exact same WUs.

Change the two occurrences of <chirp_limit>Value</chirp_limit> to 20 and 50, respectively, so the passage in each WU (there is an XML-style header on each) looks like this:

Code: [Select]
  <chirps>
    <chirp_parameter_t>
      <chirp_limit>20</chirp_limit>
      <fft_len_flags>262136</fft_len_flags>
    </chirp_parameter_t>
    <chirp_parameter_t>
      <chirp_limit>50</chirp_limit>
      <fft_len_flags>65528</fft_len_flags>
    </chirp_parameter_t>
  </chirps>

Only change the chirp_limit values please.

testWU-2, 3 and 5 fall in the "most common ARs" category.

I've modified these WUs and attached them to this post, since I realized you may not have a test pack.

HTH,
Simon.

[attachment deleted by admin]
Title: Re: Current Profile Analysis and points to optimze
Post by: BenHer on 12 Aug 2006, 02:37:04 pm
Thanks Simon,

I had downloaded those 7 WUs and the testbench already.

What I was looking for was a WU that would take the full normal ammount of time...2  3 and 5 are all "shorter" WUs.

Which aspect makes them short I`m not certain...perhaps changing those chirp_limit values you suggested would make the time long again (run it as it would if downloaded from seti )

Title: Re: Current Profile Analysis and points to optimze
Post by: Simon on 12 Aug 2006, 04:39:47 pm
Exactly my point :)

Those were downloaded from S@H by one of my hosts about two months ago.

The files I attached to my previous post have the chirp_limit values already reset, so please use them. The one that would run the longest would be testWU-6 (after changing chirp_limit as pointed out previously) as it is a VLAR (very low angle range) WU.

HTH,
Simon.
Title: Re: Current Profile Analysis and points to optimze
Post by: BenHer on 12 Aug 2006, 10:06:28 pm
Simon,

You mentioned you and Michael weren't able to get speed improvments out of find_pulse...

Here is a section of the Intel C++ compiler  SSE2 optimized code for the loop which "folds" two tables into a third table.
 
Code: [Select]
float tmp_max = 0;
for (int i=0;i<length;i++) {
register float tmpfloat=(ptr1[i]+ptr2[i])/2;
sums[i]=tmpfloat;
if (tmpfloat>tmp_max) {
tmp_max=tmpfloat;
}
}

... becomes...

Tree Samples Address Code Bytes Source Line # CPU0 CPU1
5 0x410629 F3 0F 10 05 48 17 50 00 movss xmm0,[00501748h] 5 5
0x41063d 89 0C 24 mov [esp],ecx 6
36 0x410640 8B 4D 14 mov ecx,[ebp+14h] 7 17 19
0x41064b 8B 75 10 mov esi,[ebp+10h] 8
38 0x41064e 33 FF xor edi,edi 9 18 20
0x410650 F3 0F 10 14 BE movss xmm2,[esi+edi*4] 10
99 0x410655 F3 0F 58 14 B9 addss xmm2,[ecx+edi*4] 11 40 59
11 0x41065a F3 0F 59 D0 mulss xmm2,xmm0 12 7 4
23 0x41067c 0F 28 05 A0 16 50 00 movaps xmm0,[005016a0h] 13 10 13
0x410699 0F 10 14 B9 movups xmm2,[ecx+edi*4] 14
495 0x41069d 0F 58 14 BA addps xmm2,[edx+edi*4] 15 194 301
270 0x4106a1 0F 59 D0 mulps xmm2,xmm0 16 122 148
175 0x4106ab 0F 10 4C B9 10 movups xmm1,[ecx+edi*4+10h] 17 74 101
359 0x4106b0 0F 58 4C BA 10 addps xmm1,[edx+edi*4+10h] 18 154 205
199 0x4106b5 0F 59 C8 mulps xmm1,xmm0 19 76 123
470 0x4106cf 0F 10 24 B9 movups xmm4,[ecx+edi*4] 20 201 269
1284 0x4106d3 0F 10 14 BA movups xmm2,[edx+edi*4] 21 540 744
1204 0x4106d7 0F 58 E2 addps xmm4,xmm2 22 532 672
510 0x4106da 0F 59 E0 mulps xmm4,xmm0 23 204 306
535 0x4106e4 0F 10 4C B9 10 movups xmm1,[ecx+edi*4+10h] 24 201 334
1066 0x4106e9 0F 10 5C BA 10 movups xmm3,[edx+edi*4+10h] 25 461 605
1142 0x4106ee 0F 58 CB addps xmm1,xmm3 26 478 664
539 0x4106f1 0F 59 C8 mulps xmm1,xmm0 27 213 326
119 0x410726 F3 0F 10 05 48 17 50 00 movss xmm0,[00501748h] 28 49 70
136 0x41072e 8B 45 14 mov eax,[ebp+14h] 29 60 76
114 0x410731 8B 55 10 mov edx,[ebp+10h] 30 48 66
5 0x410734 F3 0F 10 14 BA movss xmm2,[edx+edi*4] 31 1 4
1751 0x410739 F3 0F 58 14 B8 addss xmm2,[eax+edi*4] 32 683 1068
143 0x41073e F3 0F 59 D0 mulss xmm2,xmm0 33 44 99
Title: Re: Current Profile Analysis and points to optimze
Post by: BenHer on 13 Aug 2006, 01:12:11 am
...so basically

The automatic /Qunroll option causes the compiler to generate code that vectorizes simple loops like this one.

It creates three loops out of the one loop:

pretty clever compiler.
Title: Re: Current Profile Analysis and points to optimze
Post by: Gecko_R7 on 13 Aug 2006, 10:35:14 pm
Question for you gents.

The company I work for, we sometimes "farm-out" business studies to a local University where these studies are assigned as an MBA project to be researched and solutions/recommendations presented by an MBA student group. Good for them as a learning exercise, great for us in practicality as it efficiently expands our resource base and brings-in a fresh perspective.

I was thinking something similar could offer a comparable benefit from a mathematical stand-point if 4 or 5 of the most mathematically intensive tasks re: various functions/analysis algorithims could be presented w/ the goal of finding the most efficient mathematical solutions & implementations that could easily be x-ferred to programming.  Would this idea be worth pursuing? If so, your profile results have probably identified some likely candidate areas that a mathematician could likely improve upon.

Surely Seti has enough recognition and academic respect where it would present a credible subject & project for a graduate-level, aspiring mathematician? The real trick might be in aligning an individual's mathematical knowledge w/ prerequisite fundamentals supporting the kind of signal processing/analysis that S@H performs.

With a clear set of objectives and deliverables, I'm sure a few folks have the necessary contacts at respective universities where this could be possible.  I may be able to accomplish this at my former Alma Mater. ;)
Could even be interesting to see differing solutions to the same problems.
Just an idea if anyone thinks there's any merit to it.
Cheers! 


Title: Re: Current Profile Analysis and points to optimze
Post by: BenHer on 13 Aug 2006, 10:53:08 pm
Great idea Gecko,

Heck...Anderson, and Korpela are working as part of a grant on U.C. Berkeley campus...plenty of student bodies (and brains) available I should think.

=Ben
Title: Re: Current Profile Analysis and points to optimze
Post by: Simon on 14 Aug 2006, 07:42:51 am
...so basically

The automatic /Qunroll option causes the compiler to generate code that vectorizes simple loops like this one.

It creates three loops out of the one loop:
  • Loop to do single   ( ptr1 + ptr2 ) * .5  // automatically uses reciprocal to avoid divison
     until address of ptr1+i  ptr2+i  ptr3+i  are on a 16 byte boundary
  • loop to do SIMD adds like above until not enough bytes are left for an entire SIMD register
  • final loop to catch remaining values in buffers

pretty clever compiler.
Yup, that it is ;)

Also would explain why Michael's code couldn't gain much over the code produced by ICC by default, since he basically did the same - wrote a few inline assembly functions like sumtables-sum5tables that did exactly that.

Gecko, great idea! Sadly, I don't have any acquaintances that could help in the maths department, but I do hope some of you guys do ;)

Regards,
Simon.
Title: Re: Current Profile Analysis and points to optimze
Post by: BenHer on 15 Aug 2006, 02:32:53 pm
Figured out f_GetPeak optimization...new version MUCH faster than original.

The faster optimized GetPeak version currently using FPU vs original using SSE2.

Not sure exactly how much faster yet, probably at least 5 times...working on a timing/benchmark routine (like Eric's 'analyzeFuncs_vector.cpp' code) to time various function versions and also verify their output.

Found some interesting compile differences between  /Arch:SSE /G6  vs.  /OxW  vs. /OxN .  Sometimes /OxN isnt fastest.

Figured out how to tell ICC to super optimize v_getPowerSpectrum...hand coding could hardly improve on it.

I'm at work now so can't check assembly output, but I believe same super optimize worked for v_ChirpData (with a few code changes to avoid conditionals within loops).

GetTrueMean is really only a simple summing loop - this can get vectorized easilly.

getChiSq has lots of additional math in it but starts out with same speed problem as f_GetPeak...this could be  improved.

Completed method for converting an existing function call into a function pointer call with easy code to produce alternate versions of existing functions (that need improvement).

Created separate 'optimize' library project so I can set individual file's optimize options and avoid things like "/Quipo" or "/QxW" for some files.  Just an additional project in the IDE.

 7.17% - gauss_fit   (inlined: getChiSq[36%] - GetTrueMean[36%] - f_GetPeak[27%] )
 5.62% - seti_analyze (inline: v_ChirpData[78%] - v_getPowerSpectrum[27%]  )
Title: Re: Current Profile Analysis and points to optimze
Post by: BenHer on 15 Aug 2006, 04:36:21 pm
Hmm...just checked out the older version of the seti source by Alex Kan & Rick Berry optimized mac source code from their website http://writhe.org.uk/seti@home/.  Note: the lastest modified file was 9-15-2005 so it was pre "enhanced" I'm guessing...

They not only optimized existing functions they cleaned up formatting, added documentation, re-wrote entire sections and changed the way computations were performed (chirping)...so apparently they have reviewed some of the math.  ::)

They also commented many undocumented routines inside the source, so they seem to have worked through what Eric K. et al were trying to achieve with many of their functions.

Regarding an earlier question..."can some students be tasked with reviewing the math..."  Alex is apparently a U.C. Berkeley engineering student.
Title: Re: Current Profile Analysis and points to optimze
Post by: Josef W. Segur on 19 Aug 2006, 04:38:23 pm
Hmm...just checked out the older version of the seti source by Alex Kan & Rick Berry optimized mac source code from their website http://writhe.org.uk/seti@home/.  Note: the lastest modified file was 9-15-2005 so it was pre "enhanced" I'm guessing...

They not only optimized existing functions they cleaned up formatting, added documentation, re-wrote entire sections and changed the way computations were performed (chirping)...so apparently they have reviewed some of the math.  ::)

They also commented many undocumented routines inside the source, so they seem to have worked through what Eric K. et al were trying to achieve with many of their functions.

I was impressed, too. Later source can be found at http://tbp.berkeley.edu/~alexkan/seti/ (http://tbp.berkeley.edu/~alexkan/seti/). I'm wondering if I can restate some of the vectorized routines from the 6.1 source to compile with DevC++/MinGW. If I can get up to speed soon enough, I'll try to get at least some x86 SIMD variants into 5.17+. OTOH, you could probably do that much more efficiently than I...

Quote
Regarding an earlier question..."can some students be tasked with reviewing the math..."  Alex is apparently a U.C. Berkeley engineering student.

Graduate, now. I was reading the Macnn forum posts related to those optimized S@H apps, that was also quite interesting.
                                                                       Joe
Title: Re: Current Profile Analysis and points to optimze
Post by: Josef W. Segur on 19 Aug 2006, 04:49:35 pm
Figured out how to tell ICC to super optimize v_getPowerSpectrum...hand coding could hardly improve on it.

Is that ippsPowerSpectr_32fc() ?
                                                                     Joe
Title: Re: Current Profile Analysis and points to optimze
Post by: chboss on 19 Aug 2006, 05:15:40 pm
Yes, Alex's Mac client is impressive....

MacMini G4 1.25GHz  RAC 219
Athlon XP 2600+ (Linux) RAC 212

If some of their improvements can be brought over to the Linux version it would certainly be helpful.

Title: Re: Current Profile Analysis and points to optimze
Post by: BenHer on 20 Aug 2006, 01:43:04 am
I've gotten about a 20% improvement so far vs the Simon's SSE3 Athlon exe.

SIMD is only a part of it...many of the bottlenecks are simple programming optimization.

1st identify what is slow...2nd identify why...fix.   Several have been float/int conversions that aren't needed...others if-then's inside of loops...big no no...another was an  'abs( )' inside a loop...big speed up from that.

I've also incorporated Alex's power spectrum re-ordered table from 5.17, but without using another table...its all inside of the original powerspectrum table.

Have to verify it all vs the test WUs now...am only testing against short WU 2 vs release-515 for general development.  WU2 verifies strongly...time on my Athlon 64 3800 X2 - using only core #2 -   537 seconds

In my latest...find_pulse (and i'ts new sub functions) uses 19.02% of WU time...and Intel's FFT uses 17.92%...the cache misses for Pot functions are down to 15.7%.

Might be able to squeeze another 5-10% out...harder now though.

Quote
Is that ippsPowerSpectr_32fc() ?    - Joe
No...I just let Intel compiler vectorize the loop, but I gave it better hints that it could be vectorized.


Simon,
Suggest you check out the program AutoIt3 at http://www.autoitscript.com/autoit3/ (http://www.autoitscript.com/autoit3/)  for automating the testing...I'm going to write a short one myself...time seconds...etc.

Title: Re: Current Profile Analysis and points to optimze
Post by: Simon on 20 Aug 2006, 09:14:52 am
Hi Ben,

Auto-It is pretty impressive stuff. Even more, so, the 20% you said you got out of the 5.15 sources :) Any chance of getting an archive of your changes or a full source snapshot anytime soon? If I seem eager, I am ;)

Also, do those 20% translate to Intel systems too or is it AMD-only?

About telling ICC to vectorize things - are you doing that with "#pragma vector aligned" or "#pragma vector always"?

Regards,
Simon.
Title: Re: Current Profile Analysis and points to optimze
Post by: BenHer on 20 Aug 2006, 01:52:28 pm
Simon,

I use this code to tell it what pointers point to aligned buffers (in powerspectrum its both)
Code: [Select]
#ifdef __INTEL_COMPILER
#define ALIGNED_YES( buffer_ ) __assume_aligned( buffer_, SIMD_ALIGN );
#else
#define ALIGNED_YES( buffer_ )
#endif
Title: Re: Current Profile Analysis and points to optimze
Post by: Josef W. Segur on 21 Aug 2006, 12:04:12 am
For approximate comparison, I built 5.17 on DevC++/MinGW with profiling enabled. I had to drop optimization to O2 because the profiling code won't work with -fomit_frame_pointer. So FWIW here are some values from running WU2 with chirp limits 10 and 25, about 3 hours 41 minutes on my 1.4 GHz Pentium-m:

37.90% find_pulse()
11.09% v_Transpose4()
 6.04% v_ChirpData()
 5.28% CalcTrigArray()
 5.24% GaussFit()
 5.22% f_GetChiSq()
 4.71% f_GetTrueMean()
 3.61% FindSpikes()
 3.29% f_GetPeak()
 2.57% lcgf()
 2.51% find_triplets()
 2.36% v_GetPowerSpectrum()
 1.95% float_to_uchar()
 1.62% t_funct()
 1.53% GetFixedPoT()
 1.27% analyze_pot()
                                                                   Joe