+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: Current Profile Analysis and points to optimze  (Read 23258 times)

BenHer

  • Guest
Current Profile Analysis and points to optimze
« on: 09 Aug 2006, 08:29:12 pm »
Hi Simon,

Got it all put together, compiled and running without errors.

Haven't modified source so I'm going to assume the results strongly match...will next.

However, even without this the profile Ive just done should show places to work on (AMD DeviceAnalyst).

The following areas showed most CPU time usage (I limited it to one Core only on this Athlon 64 3800+ X2)

WU used:  testWU-1
Test time:  ~10 min

13.32% - AnalyzePot
7.09% - findpulse
6.48% - GaussFit


AnalyzePot Locations:
Code: [Select]
66% An inlined version of "getFixedPot" function
      1/2 from this line -       fp_PoT[ul_PoT_i] = fp_PowerSpectrum[ul_PoT + ul_PoTChunk_i];

     Mostly due to cache misses (the value being copied is grabbed from what amounts to random memory addresses)

33% - The following loop
   for(i = 0; i < PulsePoTLen; i++) {
       PulsePoT[i] = PowerSpectrum[ThisPoT + (TOffset+i) * FftLength];
}

findpulse
Inside the 4 folding loops I mentioned earlier

GaussFit

BenHer

  • Guest
Re: Current Profile Analysis and points to optimze
« Reply #1 on: 10 Aug 2006, 12:45:10 am »
Did a run on a full WU (the test WU included with the source).

Function usage pretty much the same, not many areas for improvement.

13.5% - analyze_pot ...  again - same 3 code sections
 7.17% - gauss_fit   (inlined: getChiSq[36%] - GetTrueMean[36%] - f_GetPeak[27%] )
 5.62% - seti_analyze (inline: v_ChirpData[78%] - v_getPowerSpectrum[27%]  )
 5.35% - find_pulse
 3.22% - an ipp fft routine
 2.68% - an ipp fft routine
 2.54% - chooseGaussEvent
 1.61% - memcpy

If you were to implement a blindingly fast memcopy routine, speeding it up by 5 times (or 400%) you would speedup overall WU crunching by only 1.2%...thus not a great place to put a lot of effort.

analyze_pot - The alex kahn transposed pot caching would probably speed this up a lot (and it uses memcpy so maybe thats a good idea)
Find_pulse - with SSE optimized functions could probably be speeded up by 200% (3x), saving overall WU time of 3.5%.


Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: Current Profile Analysis and points to optimze
« Reply #2 on: 10 Aug 2006, 02:57:43 pm »
Did a run on a full WU (the test WU included with the source).

Function usage pretty much the same, not many areas for improvement.

True, there's not a lot of difference between the 0.6xx angle range of WU-1 and the 0.775 of the project test WU. The distribution of angle ranges from Arecibo to date peaks in the 0.42 to 0.44 range; WU-2, WU-3, and WU-5 are more typical. The v_chirpdata contribution should not be affected by angle range, so would become a lower percentage.

Quote
13.5% - analyze_pot ...  again - same 3 code sections
 7.17% - gauss_fit   (inlined: getChiSq[36%] - GetTrueMean[36%] - f_GetPeak[27%] )
 5.62% - seti_analyze (inline: v_ChirpData[78%] - v_getPowerSpectrum[27%]  )
 5.35% - find_pulse
 3.22% - an ipp fft routine
 2.68% - an ipp fft routine
 2.54% - chooseGaussEvent
 1.61% - memcpy

If you were to implement a blindingly fast memcopy routine, speeding it up by 5 times (or 400%) you would speedup overall WU crunching by only 1.2%...thus not a great place to put a lot of effort.

analyze_pot - The alex kahn transposed pot caching would probably speed this up a lot (and it uses memcpy so maybe thats a good idea)
Find_pulse - with SSE optimized functions could probably be speeded up by 200% (3x), saving overall WU time of 3.5%.

I'm very glad to see real figures aimed at identifying the hot spots, thanks!

Question:

Do you think it would be practical to combine the fft, getPowerSpectrum, and transpose steps? It seems it might be more efficient, and perhaps IPP has some functions meant for generating a Spectrum Analyzer display which might be usable. The only thing which actually needs the complex fft output is baseline smoothing, all signal analysis is performed on the PowerSpectrum.
                                                                      Joe

BenHer

  • Guest
Re: Current Profile Analysis and points to optimze
« Reply #3 on: 10 Aug 2006, 04:40:40 pm »
Quote
Do you think it would be practical to combine the fft, getPowerSpectrum, and transpose steps? It seems it might be more efficient, and perhaps IPP has some functions meant for generating a Spectrum Analyzer display which might be usable.

You have exceeded the level of my competence  :o  - This is a question best left for Eric K and maybe Dave A. (and other mathmaticians in the audience).

My forte is in programming, optimization, assembly programming, SIMD, cpu pipeline latencies and the like.  I've forgotten any signals analysis I learned in college, what little there was, long since.

I can spot inneficient code and understand the programming goal of most code though, so I can understand what the original programmer intended and often come at a solution from many angles. 

As an example Alex Kan designed a new method of caching the POT information, and submitted it to Eric K, who put it into release code 5.17.  The primary value of this method puts most pot table access in adjacent memory addresses and not semi-randomly placed throughout the table.  CPUs like this and can pre-fetch future memory into the cache.  Perhaps he understands the underlying signals purpose of the pot values int the table, but the data reorganization method is clear to my coding understanding.

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
Re: Current Profile Analysis and points to optimze
« Reply #4 on: 11 Aug 2006, 08:04:03 pm »
Whew,

I'm away for a few days and scientific discussion erupts on these boards ;) Glad to see it. Also sorry to say that my math skills are largely constrained to what I learned in high school, so no help there.

However, thank you very much for analysis of possible hotspots, Ben! This does indeed help to identify possible, and more importantly, sensible parts of the code to try and improve, not to mention most people who can compile applications cannot analyze how they run very well.

In any case, I guess Michael's vTune runs should have run their full course (he said he cancelled them before they were done I believe, at 50% or so), because his results were that find_pulse was the biggest bottleneck (~20% of time spent or thereabouts as I recall). It still seems part of the top few (but at ~5%, not quite as important) and also explains why his optimizations never yielded as much performance as he (and I, obviously ;)) hoped. That's not to say his work isn't worthwhile, it most definitely is. It has, at the very least, attracted the attention of other capable people like yourselves - and that is no small feat.

Hats off :)
Simon.

BenHer

  • Guest
Re: Current Profile Analysis and points to optimze
« Reply #5 on: 11 Aug 2006, 10:31:41 pm »
Simon,

Could you post a full WU for testing that

a. has typical  angle_range
b. crunches for a typical amount of time on one of your systems (ie the average time for that system)

As Josef suggested doing a full run with this type of full WU would be best.

Thanks,
=Ben

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
Re: Current Profile Analysis and points to optimze
« Reply #6 on: 11 Aug 2006, 11:27:51 pm »
Josef is talking about the most common angle ranges - apart from the adjusted chirp_limit parameters, they are the exact same WUs.

Change the two occurrences of <chirp_limit>Value</chirp_limit> to 20 and 50, respectively, so the passage in each WU (there is an XML-style header on each) looks like this:

Code: [Select]
  <chirps>
    <chirp_parameter_t>
      <chirp_limit>20</chirp_limit>
      <fft_len_flags>262136</fft_len_flags>
    </chirp_parameter_t>
    <chirp_parameter_t>
      <chirp_limit>50</chirp_limit>
      <fft_len_flags>65528</fft_len_flags>
    </chirp_parameter_t>
  </chirps>

Only change the chirp_limit values please.

testWU-2, 3 and 5 fall in the "most common ARs" category.

I've modified these WUs and attached them to this post, since I realized you may not have a test pack.

HTH,
Simon.

[attachment deleted by admin]
« Last Edit: 11 Aug 2006, 11:32:24 pm by Simon »

BenHer

  • Guest
Re: Current Profile Analysis and points to optimze
« Reply #7 on: 12 Aug 2006, 02:37:04 pm »
Thanks Simon,

I had downloaded those 7 WUs and the testbench already.

What I was looking for was a WU that would take the full normal ammount of time...2  3 and 5 are all "shorter" WUs.

Which aspect makes them short I`m not certain...perhaps changing those chirp_limit values you suggested would make the time long again (run it as it would if downloaded from seti )


Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
Re: Current Profile Analysis and points to optimze
« Reply #8 on: 12 Aug 2006, 04:39:47 pm »
Exactly my point :)

Those were downloaded from S@H by one of my hosts about two months ago.

The files I attached to my previous post have the chirp_limit values already reset, so please use them. The one that would run the longest would be testWU-6 (after changing chirp_limit as pointed out previously) as it is a VLAR (very low angle range) WU.

HTH,
Simon.

BenHer

  • Guest
Re: Current Profile Analysis and points to optimze
« Reply #9 on: 12 Aug 2006, 10:06:28 pm »
Simon,

You mentioned you and Michael weren't able to get speed improvments out of find_pulse...

Here is a section of the Intel C++ compiler  SSE2 optimized code for the loop which "folds" two tables into a third table.
 
Code: [Select]
float tmp_max = 0;
for (int i=0;i<length;i++) {
register float tmpfloat=(ptr1[i]+ptr2[i])/2;
sums[i]=tmpfloat;
if (tmpfloat>tmp_max) {
tmp_max=tmpfloat;
}
}

... becomes...

Tree Samples Address Code Bytes Source Line # CPU0 CPU1
5 0x410629 F3 0F 10 05 48 17 50 00 movss xmm0,[00501748h] 5 5
0x41063d 89 0C 24 mov [esp],ecx 6
36 0x410640 8B 4D 14 mov ecx,[ebp+14h] 7 17 19
0x41064b 8B 75 10 mov esi,[ebp+10h] 8
38 0x41064e 33 FF xor edi,edi 9 18 20
0x410650 F3 0F 10 14 BE movss xmm2,[esi+edi*4] 10
99 0x410655 F3 0F 58 14 B9 addss xmm2,[ecx+edi*4] 11 40 59
11 0x41065a F3 0F 59 D0 mulss xmm2,xmm0 12 7 4
23 0x41067c 0F 28 05 A0 16 50 00 movaps xmm0,[005016a0h] 13 10 13
0x410699 0F 10 14 B9 movups xmm2,[ecx+edi*4] 14
495 0x41069d 0F 58 14 BA addps xmm2,[edx+edi*4] 15 194 301
270 0x4106a1 0F 59 D0 mulps xmm2,xmm0 16 122 148
175 0x4106ab 0F 10 4C B9 10 movups xmm1,[ecx+edi*4+10h] 17 74 101
359 0x4106b0 0F 58 4C BA 10 addps xmm1,[edx+edi*4+10h] 18 154 205
199 0x4106b5 0F 59 C8 mulps xmm1,xmm0 19 76 123
470 0x4106cf 0F 10 24 B9 movups xmm4,[ecx+edi*4] 20 201 269
1284 0x4106d3 0F 10 14 BA movups xmm2,[edx+edi*4] 21 540 744
1204 0x4106d7 0F 58 E2 addps xmm4,xmm2 22 532 672
510 0x4106da 0F 59 E0 mulps xmm4,xmm0 23 204 306
535 0x4106e4 0F 10 4C B9 10 movups xmm1,[ecx+edi*4+10h] 24 201 334
1066 0x4106e9 0F 10 5C BA 10 movups xmm3,[edx+edi*4+10h] 25 461 605
1142 0x4106ee 0F 58 CB addps xmm1,xmm3 26 478 664
539 0x4106f1 0F 59 C8 mulps xmm1,xmm0 27 213 326
119 0x410726 F3 0F 10 05 48 17 50 00 movss xmm0,[00501748h] 28 49 70
136 0x41072e 8B 45 14 mov eax,[ebp+14h] 29 60 76
114 0x410731 8B 55 10 mov edx,[ebp+10h] 30 48 66
5 0x410734 F3 0F 10 14 BA movss xmm2,[edx+edi*4] 31 1 4
1751 0x410739 F3 0F 58 14 B8 addss xmm2,[eax+edi*4] 32 683 1068
143 0x41073e F3 0F 59 D0 mulss xmm2,xmm0 33 44 99

BenHer

  • Guest
Re: Current Profile Analysis and points to optimze
« Reply #10 on: 13 Aug 2006, 01:12:11 am »
...so basically

The automatic /Qunroll option causes the compiler to generate code that vectorizes simple loops like this one.

It creates three loops out of the one loop:
  • Loop to do single   ( ptr1 + ptr2 ) * .5  // automatically uses reciprocal to avoid divison
     until address of ptr1+i  ptr2+i  ptr3+i  are on a 16 byte boundary
  • loop to do SIMD adds like above until not enough bytes are left for an entire SIMD register
  • final loop to catch remaining values in buffers

pretty clever compiler.

Gecko_R7

  • Guest
Re: Current Profile Analysis and points to optimze
« Reply #11 on: 13 Aug 2006, 10:35:14 pm »
Question for you gents.

The company I work for, we sometimes "farm-out" business studies to a local University where these studies are assigned as an MBA project to be researched and solutions/recommendations presented by an MBA student group. Good for them as a learning exercise, great for us in practicality as it efficiently expands our resource base and brings-in a fresh perspective.

I was thinking something similar could offer a comparable benefit from a mathematical stand-point if 4 or 5 of the most mathematically intensive tasks re: various functions/analysis algorithims could be presented w/ the goal of finding the most efficient mathematical solutions & implementations that could easily be x-ferred to programming.  Would this idea be worth pursuing? If so, your profile results have probably identified some likely candidate areas that a mathematician could likely improve upon.

Surely Seti has enough recognition and academic respect where it would present a credible subject & project for a graduate-level, aspiring mathematician? The real trick might be in aligning an individual's mathematical knowledge w/ prerequisite fundamentals supporting the kind of signal processing/analysis that S@H performs.

With a clear set of objectives and deliverables, I'm sure a few folks have the necessary contacts at respective universities where this could be possible.  I may be able to accomplish this at my former Alma Mater. ;)
Could even be interesting to see differing solutions to the same problems.
Just an idea if anyone thinks there's any merit to it.
Cheers! 



BenHer

  • Guest
Re: Current Profile Analysis and points to optimze
« Reply #12 on: 13 Aug 2006, 10:53:08 pm »
Great idea Gecko,

Heck...Anderson, and Korpela are working as part of a grant on U.C. Berkeley campus...plenty of student bodies (and brains) available I should think.

=Ben

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
Re: Current Profile Analysis and points to optimze
« Reply #13 on: 14 Aug 2006, 07:42:51 am »
...so basically

The automatic /Qunroll option causes the compiler to generate code that vectorizes simple loops like this one.

It creates three loops out of the one loop:
  • Loop to do single   ( ptr1 + ptr2 ) * .5  // automatically uses reciprocal to avoid divison
     until address of ptr1+i  ptr2+i  ptr3+i  are on a 16 byte boundary
  • loop to do SIMD adds like above until not enough bytes are left for an entire SIMD register
  • final loop to catch remaining values in buffers

pretty clever compiler.
Yup, that it is ;)

Also would explain why Michael's code couldn't gain much over the code produced by ICC by default, since he basically did the same - wrote a few inline assembly functions like sumtables-sum5tables that did exactly that.

Gecko, great idea! Sadly, I don't have any acquaintances that could help in the maths department, but I do hope some of you guys do ;)

Regards,
Simon.

BenHer

  • Guest
Re: Current Profile Analysis and points to optimze
« Reply #14 on: 15 Aug 2006, 02:32:53 pm »
Figured out f_GetPeak optimization...new version MUCH faster than original.

The faster optimized GetPeak version currently using FPU vs original using SSE2.

Not sure exactly how much faster yet, probably at least 5 times...working on a timing/benchmark routine (like Eric's 'analyzeFuncs_vector.cpp' code) to time various function versions and also verify their output.

Found some interesting compile differences between  /Arch:SSE /G6  vs.  /OxW  vs. /OxN .  Sometimes /OxN isnt fastest.

Figured out how to tell ICC to super optimize v_getPowerSpectrum...hand coding could hardly improve on it.

I'm at work now so can't check assembly output, but I believe same super optimize worked for v_ChirpData (with a few code changes to avoid conditionals within loops).

GetTrueMean is really only a simple summing loop - this can get vectorized easilly.

getChiSq has lots of additional math in it but starts out with same speed problem as f_GetPeak...this could be  improved.

Completed method for converting an existing function call into a function pointer call with easy code to produce alternate versions of existing functions (that need improvement).

Created separate 'optimize' library project so I can set individual file's optimize options and avoid things like "/Quipo" or "/QxW" for some files.  Just an additional project in the IDE.

 7.17% - gauss_fit   (inlined: getChiSq[36%] - GetTrueMean[36%] - f_GetPeak[27%] )
 5.62% - seti_analyze (inline: v_ChirpData[78%] - v_getPowerSpectrum[27%]  )

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 26
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 23
Total: 23
Powered by EzPortal