Forum > Windows

Current Profile Analysis and points to optimze

(1/5) > >>

BenHer:
Hi Simon,

Got it all put together, compiled and running without errors.

Haven't modified source so I'm going to assume the results strongly match...will next.

However, even without this the profile Ive just done should show places to work on (AMD DeviceAnalyst).

The following areas showed most CPU time usage (I limited it to one Core only on this Athlon 64 3800+ X2)

WU used:  testWU-1
Test time:  ~10 min

13.32% - AnalyzePot
7.09% - findpulse
6.48% - GaussFit


AnalyzePot Locations:

--- Code: ---66% An inlined version of "getFixedPot" function
      1/2 from this line -       fp_PoT[ul_PoT_i] = fp_PowerSpectrum[ul_PoT + ul_PoTChunk_i];

     Mostly due to cache misses (the value being copied is grabbed from what amounts to random memory addresses)

33% - The following loop
   for(i = 0; i < PulsePoTLen; i++) {
       PulsePoT[i] = PowerSpectrum[ThisPoT + (TOffset+i) * FftLength];
}
--- End code ---

findpulse
Inside the 4 folding loops I mentioned earlier

GaussFit

BenHer:
Did a run on a full WU (the test WU included with the source).

Function usage pretty much the same, not many areas for improvement.

13.5% - analyze_pot ...  again - same 3 code sections
 7.17% - gauss_fit   (inlined: getChiSq[36%] - GetTrueMean[36%] - f_GetPeak[27%] )
 5.62% - seti_analyze (inline: v_ChirpData[78%] - v_getPowerSpectrum[27%]  )
 5.35% - find_pulse
 3.22% - an ipp fft routine
 2.68% - an ipp fft routine
 2.54% - chooseGaussEvent
 1.61% - memcpy

If you were to implement a blindingly fast memcopy routine, speeding it up by 5 times (or 400%) you would speedup overall WU crunching by only 1.2%...thus not a great place to put a lot of effort.

analyze_pot - The alex kahn transposed pot caching would probably speed this up a lot (and it uses memcpy so maybe thats a good idea)
Find_pulse - with SSE optimized functions could probably be speeded up by 200% (3x), saving overall WU time of 3.5%.

Josef W. Segur:

--- Quote from: BenHer on 10 Aug 2006, 12:45:10 am ---Did a run on a full WU (the test WU included with the source).

Function usage pretty much the same, not many areas for improvement.
--- End quote ---

True, there's not a lot of difference between the 0.6xx angle range of WU-1 and the 0.775 of the project test WU. The distribution of angle ranges from Arecibo to date peaks in the 0.42 to 0.44 range; WU-2, WU-3, and WU-5 are more typical. The v_chirpdata contribution should not be affected by angle range, so would become a lower percentage.


--- Quote ---13.5% - analyze_pot ...  again - same 3 code sections
 7.17% - gauss_fit   (inlined: getChiSq[36%] - GetTrueMean[36%] - f_GetPeak[27%] )
 5.62% - seti_analyze (inline: v_ChirpData[78%] - v_getPowerSpectrum[27%]  )
 5.35% - find_pulse
 3.22% - an ipp fft routine
 2.68% - an ipp fft routine
 2.54% - chooseGaussEvent
 1.61% - memcpy

If you were to implement a blindingly fast memcopy routine, speeding it up by 5 times (or 400%) you would speedup overall WU crunching by only 1.2%...thus not a great place to put a lot of effort.

analyze_pot - The alex kahn transposed pot caching would probably speed this up a lot (and it uses memcpy so maybe thats a good idea)
Find_pulse - with SSE optimized functions could probably be speeded up by 200% (3x), saving overall WU time of 3.5%.
--- End quote ---

I'm very glad to see real figures aimed at identifying the hot spots, thanks!

Question:

Do you think it would be practical to combine the fft, getPowerSpectrum, and transpose steps? It seems it might be more efficient, and perhaps IPP has some functions meant for generating a Spectrum Analyzer display which might be usable. The only thing which actually needs the complex fft output is baseline smoothing, all signal analysis is performed on the PowerSpectrum.
                                                                      Joe

BenHer:

--- Quote ---Do you think it would be practical to combine the fft, getPowerSpectrum, and transpose steps? It seems it might be more efficient, and perhaps IPP has some functions meant for generating a Spectrum Analyzer display which might be usable.
--- End quote ---

You have exceeded the level of my competence  :o  - This is a question best left for Eric K and maybe Dave A. (and other mathmaticians in the audience).

My forte is in programming, optimization, assembly programming, SIMD, cpu pipeline latencies and the like.  I've forgotten any signals analysis I learned in college, what little there was, long since.

I can spot inneficient code and understand the programming goal of most code though, so I can understand what the original programmer intended and often come at a solution from many angles. 

As an example Alex Kan designed a new method of caching the POT information, and submitted it to Eric K, who put it into release code 5.17.  The primary value of this method puts most pot table access in adjacent memory addresses and not semi-randomly placed throughout the table.  CPUs like this and can pre-fetch future memory into the cache.  Perhaps he understands the underlying signals purpose of the pot values int the table, but the data reorganization method is clear to my coding understanding.

Simon:
Whew,

I'm away for a few days and scientific discussion erupts on these boards ;) Glad to see it. Also sorry to say that my math skills are largely constrained to what I learned in high school, so no help there.

However, thank you very much for analysis of possible hotspots, Ben! This does indeed help to identify possible, and more importantly, sensible parts of the code to try and improve, not to mention most people who can compile applications cannot analyze how they run very well.

In any case, I guess Michael's vTune runs should have run their full course (he said he cancelled them before they were done I believe, at 50% or so), because his results were that find_pulse was the biggest bottleneck (~20% of time spent or thereabouts as I recall). It still seems part of the top few (but at ~5%, not quite as important) and also explains why his optimizations never yielded as much performance as he (and I, obviously ;)) hoped. That's not to say his work isn't worthwhile, it most definitely is. It has, at the very least, attracted the attention of other capable people like yourselves - and that is no small feat.

Hats off :)
Simon.

Navigation

[0] Message Index

[#] Next page

Go to full version