Author Topic: Current Profile Analysis and points to optimze (Read 33131 times)

BenHer · « **on:** 09 Aug 2006, 08:29:12 pm »

Hi Simon,

Got it all put together, compiled and running without errors.

Haven't modified source so I'm going to assume the results strongly match...will next.

However, even without this the profile Ive just done should show places to work on (AMD DeviceAnalyst).

The following areas showed most CPU time usage (I limited it to one Core only on this Athlon 64 3800+ X2)

WU used: testWU-1
Test time: ~10 min

13.32% - AnalyzePot
7.09% - findpulse
6.48% - GaussFit

AnalyzePot Locations:

Code: [Select]

66% An inlined version of "getFixedPot" function
      1/2 from this line -       fp_PoT[ul_PoT_i] = fp_PowerSpectrum[ul_PoT + ul_PoTChunk_i];

     Mostly due to cache misses (the value being copied is grabbed from what amounts to random memory addresses)

33% - The following loop
   for(i = 0; i < PulsePoTLen; i++) {	
       PulsePoT[i] = PowerSpectrum[ThisPoT + (TOffset+i) * FftLength];	
	 }

findpulse
Inside the 4 folding loops I mentioned earlier

GaussFit

BenHer · « **Reply #1 on:** 10 Aug 2006, 12:45:10 am »

Did a run on a full WU (the test WU included with the source).

Function usage pretty much the same, not many areas for improvement.

13.5% - analyze_pot ... again - same 3 code sections
7.17% - gauss_fit (inlined: getChiSq[36%] - GetTrueMean[36%] - f_GetPeak[27%] )
5.62% - seti_analyze (inline: v_ChirpData[78%] - v_getPowerSpectrum[27%] )
5.35% - find_pulse
3.22% - an ipp fft routine
2.68% - an ipp fft routine
2.54% - chooseGaussEvent
1.61% - memcpy

If you were to implement a blindingly fast memcopy routine, speeding it up by 5 times (or 400%) you would speedup overall WU crunching by only 1.2%...thus not a great place to put a lot of effort.

analyze_pot - The alex kahn transposed pot caching would probably speed this up a lot (and it uses memcpy so maybe thats a good idea)
Find_pulse - with SSE optimized functions could probably be speeded up by 200% (3x), saving overall WU time of 3.5%.

Josef W. Segur · « **Reply #2 on:** 10 Aug 2006, 02:57:43 pm »

Quote from: BenHer on 10 Aug 2006, 12:45:10 am

Did a run on a full WU (the test WU included with the source).

Function usage pretty much the same, not many areas for improvement.

True, there's not a lot of difference between the 0.6xx angle range of WU-1 and the 0.775 of the project test WU. The distribution of angle ranges from Arecibo to date peaks in the 0.42 to 0.44 range; WU-2, WU-3, and WU-5 are more typical. The v_chirpdata contribution should not be affected by angle range, so would become a lower percentage.

Quote

13.5% - analyze_pot ... again - same 3 code sections
7.17% - gauss_fit (inlined: getChiSq[36%] - GetTrueMean[36%] - f_GetPeak[27%] )
5.62% - seti_analyze (inline: v_ChirpData[78%] - v_getPowerSpectrum[27%] )
5.35% - find_pulse
3.22% - an ipp fft routine
2.68% - an ipp fft routine
2.54% - chooseGaussEvent
1.61% - memcpy

If you were to implement a blindingly fast memcopy routine, speeding it up by 5 times (or 400%) you would speedup overall WU crunching by only 1.2%...thus not a great place to put a lot of effort.

analyze_pot - The alex kahn transposed pot caching would probably speed this up a lot (and it uses memcpy so maybe thats a good idea)
Find_pulse - with SSE optimized functions could probably be speeded up by 200% (3x), saving overall WU time of 3.5%.

I'm very glad to see real figures aimed at identifying the hot spots, thanks!

Question:

Do you think it would be practical to combine the fft, getPowerSpectrum, and transpose steps? It seems it might be more efficient, and perhaps IPP has some functions meant for generating a Spectrum Analyzer display which might be usable. The only thing which actually needs the complex fft output is baseline smoothing, all signal analysis is performed on the PowerSpectrum.
Joe

BenHer · « **Reply #3 on:** 10 Aug 2006, 04:40:40 pm »

Quote

Do you think it would be practical to combine the fft, getPowerSpectrum, and transpose steps? It seems it might be more efficient, and perhaps IPP has some functions meant for generating a Spectrum Analyzer display which might be usable.

You have exceeded the level of my competence

- This is a question best left for Eric K and maybe Dave A. (and other mathmaticians in the audience).

My forte is in programming, optimization, assembly programming, SIMD, cpu pipeline latencies and the like. I've forgotten any signals analysis I learned in college, what little there was, long since.

I can spot inneficient code and understand the programming goal of most code though, so I can understand what the original programmer intended and often come at a solution from many angles.

As an example Alex Kan designed a new method of caching the POT information, and submitted it to Eric K, who put it into release code 5.17. The primary value of this method puts most pot table access in adjacent memory addresses and not semi-randomly placed throughout the table. CPUs like this and can pre-fetch future memory into the cache. Perhaps he understands the underlying signals purpose of the pot values int the table, but the data reorganization method is clear to my coding understanding.

Simon · « **Reply #4 on:** 11 Aug 2006, 08:04:03 pm »

Whew,

I'm away for a few days and scientific discussion erupts on these boards

Glad to see it. Also sorry to say that my math skills are largely constrained to what I learned in high school, so no help there.

However, thank you very much for analysis of possible hotspots, Ben! This does indeed help to identify possible, and more importantly, sensible parts of the code to try and improve, not to mention most people who can compile applications cannot analyze how they run very well.

In any case, I guess Michael's vTune runs should have run their full course (he said he cancelled them before they were done I believe, at 50% or so), because his results were that find_pulse was the biggest bottleneck (~20% of time spent or thereabouts as I recall). It still seems part of the top few (but at ~5%, not quite as important) and also explains why his optimizations never yielded as much performance as he (and I, obviously

) hoped. That's not to say his work isn't worthwhile, it most definitely is. It has, at the very least, attracted the attention of other capable people like yourselves - and that is no small feat.

Hats off

Simon.

BenHer · « **Reply #5 on:** 11 Aug 2006, 10:31:41 pm »

Simon,

Could you post a full WU for testing that

a. has typical angle_range
b. crunches for a typical amount of time on one of your systems (ie the average time for that system)

As Josef suggested doing a full run with this type of full WU would be best.

Thanks,
=Ben

Simon · « **Reply #6 on:** 11 Aug 2006, 11:27:51 pm »

Josef is talking about the most common angle ranges - apart from the adjusted chirp_limit parameters, they are the exact same WUs.

Change the two occurrences of <chirp_limit>Value</chirp_limit> to 20 and 50, respectively, so the passage in each WU (there is an XML-style header on each) looks like this:

Code: [Select]

  <chirps>
    <chirp_parameter_t>
      <chirp_limit>20</chirp_limit>
      <fft_len_flags>262136</fft_len_flags>
    </chirp_parameter_t>
    <chirp_parameter_t>
      <chirp_limit>50</chirp_limit>
      <fft_len_flags>65528</fft_len_flags>
    </chirp_parameter_t>
  </chirps>

Only change the chirp_limit values please.

testWU-2, 3 and 5 fall in the "most common ARs" category.

I've modified these WUs and attached them to this post, since I realized you may not have a test pack.

HTH,
Simon.

[attachment deleted by admin]

BenHer · « **Reply #7 on:** 12 Aug 2006, 02:37:04 pm »

Thanks Simon,

I had downloaded those 7 WUs and the testbench already.

What I was looking for was a WU that would take the full normal ammount of time...2 3 and 5 are all "shorter" WUs.

Which aspect makes them short I`m not certain...perhaps changing those chirp_limit values you suggested would make the time long again (run it as it would if downloaded from seti )

Simon · « **Reply #8 on:** 12 Aug 2006, 04:39:47 pm »

Exactly my point

Those were downloaded from S@H by one of my hosts about two months ago.

The files I attached to my previous post have the chirp_limit values already reset, so please use them. The one that would run the longest would be testWU-6 (after changing chirp_limit as pointed out previously) as it is a VLAR (very low angle range) WU.

HTH,
Simon.

BenHer · « **Reply #9 on:** 12 Aug 2006, 10:06:28 pm »

Simon,

You mentioned you and Michael weren't able to get speed improvments out of find_pulse...

Here is a section of the Intel C++ compiler SSE2 optimized code for the loop which "folds" two tables into a third table.

Code: [Select]

	float	tmp_max = 0;
	for (int i=0;i<length;i++) {
		register float tmpfloat=(ptr1[i]+ptr2[i])/2;
		sums[i]=tmpfloat;
		if (tmpfloat>tmp_max) {
			tmp_max=tmpfloat;
		}
	}

... becomes...

Tree	Samples	Address	Code Bytes	Source	Line #	CPU0	CPU1	
	5	0x410629	F3 0F 10 05 48 17 50 00 	movss xmm0,[00501748h]	5		5	
		0x41063d	89 0C 24 	mov [esp],ecx	6			
	36	0x410640	8B 4D 14 	mov ecx,[ebp+14h]	7	17	19	
		0x41064b	8B 75 10 	mov esi,[ebp+10h]	8			
	38	0x41064e	33 FF 	xor edi,edi	9	18	20	
		0x410650	F3 0F 10 14 BE 	movss xmm2,[esi+edi*4]	10			
	99	0x410655	F3 0F 58 14 B9 	addss xmm2,[ecx+edi*4]	11	40	59	
	11	0x41065a	F3 0F 59 D0 	mulss xmm2,xmm0	12	7	4	
	23	0x41067c	0F 28 05 A0 16 50 00 	movaps xmm0,[005016a0h]	13	10	13	
		0x410699	0F 10 14 B9 	movups xmm2,[ecx+edi*4]	14			
	495	0x41069d	0F 58 14 BA 	addps xmm2,[edx+edi*4]	15	194	301	
	270	0x4106a1	0F 59 D0 	mulps xmm2,xmm0	16	122	148	
	175	0x4106ab	0F 10 4C B9 10 	movups xmm1,[ecx+edi*4+10h]	17	74	101	
	359	0x4106b0	0F 58 4C BA 10 	addps xmm1,[edx+edi*4+10h]	18	154	205	
	199	0x4106b5	0F 59 C8 	mulps xmm1,xmm0	19	76	123	
	470	0x4106cf	0F 10 24 B9 	movups xmm4,[ecx+edi*4]	20	201	269	
	1284	0x4106d3	0F 10 14 BA 	movups xmm2,[edx+edi*4]	21	540	744	
	1204	0x4106d7	0F 58 E2 	addps xmm4,xmm2	22	532	672	
	510	0x4106da	0F 59 E0 	mulps xmm4,xmm0	23	204	306	
	535	0x4106e4	0F 10 4C B9 10 	movups xmm1,[ecx+edi*4+10h]	24	201	334	
	1066	0x4106e9	0F 10 5C BA 10 	movups xmm3,[edx+edi*4+10h]	25	461	605	
	1142	0x4106ee	0F 58 CB 	addps xmm1,xmm3	26	478	664	
	539	0x4106f1	0F 59 C8 	mulps xmm1,xmm0	27	213	326	
	119	0x410726	F3 0F 10 05 48 17 50 00 	movss xmm0,[00501748h]	28	49	70	
	136	0x41072e	8B 45 14 	mov eax,[ebp+14h]	29	60	76	
	114	0x410731	8B 55 10 	mov edx,[ebp+10h]	30	48	66	
	5	0x410734	F3 0F 10 14 BA 	movss xmm2,[edx+edi*4]	31	1	4	
	1751	0x410739	F3 0F 58 14 B8 	addss xmm2,[eax+edi*4]	32	683	1068	
	143	0x41073e	F3 0F 59 D0 	mulss xmm2,xmm0	33	44	99

BenHer · « **Reply #10 on:** 13 Aug 2006, 01:12:11 am »

...so basically

The automatic /Qunroll option causes the compiler to generate code that vectorizes simple loops like this one.

It creates three loops out of the one loop:

Loop to do single ( ptr1 + ptr2 ) * .5 // automatically uses reciprocal to avoid divison
until address of ptr1+i ptr2+i ptr3+i are on a 16 byte boundary
loop to do SIMD adds like above until not enough bytes are left for an entire SIMD register
final loop to catch remaining values in buffers

pretty clever compiler.

Gecko_R7 · « **Reply #11 on:** 13 Aug 2006, 10:35:14 pm »

Question for you gents.

The company I work for, we sometimes "farm-out" business studies to a local University where these studies are assigned as an MBA project to be researched and solutions/recommendations presented by an MBA student group. Good for them as a learning exercise, great for us in practicality as it efficiently expands our resource base and brings-in a fresh perspective.

I was thinking something similar could offer a comparable benefit from a mathematical stand-point if 4 or 5 of the most mathematically intensive tasks re: various functions/analysis algorithims could be presented w/ the goal of finding the most efficient mathematical solutions & implementations that could easily be x-ferred to programming. Would this idea be worth pursuing? If so, your profile results have probably identified some likely candidate areas that a mathematician could likely improve upon.

Surely Seti has enough recognition and academic respect where it would present a credible subject & project for a graduate-level, aspiring mathematician? The real trick might be in aligning an individual's mathematical knowledge w/ prerequisite fundamentals supporting the kind of signal processing/analysis that S@H performs.

With a clear set of objectives and deliverables, I'm sure a few folks have the necessary contacts at respective universities where this could be possible. I may be able to accomplish this at my former Alma Mater.

Could even be interesting to see differing solutions to the same problems.
Just an idea if anyone thinks there's any merit to it.
Cheers!

BenHer · « **Reply #12 on:** 13 Aug 2006, 10:53:08 pm »

Great idea Gecko,

Heck...Anderson, and Korpela are working as part of a grant on U.C. Berkeley campus...plenty of student bodies (and brains) available I should think.

=Ben

Simon · « **Reply #13 on:** 14 Aug 2006, 07:42:51 am »

Quote from: BenHer on 13 Aug 2006, 01:12:11 am

...so basically

The automatic /Qunroll option causes the compiler to generate code that vectorizes simple loops like this one.

It creates three loops out of the one loop:
Loop to do single ( ptr1 + ptr2 ) * .5 // automatically uses reciprocal to avoid divison
until address of ptr1+i ptr2+i ptr3+i are on a 16 byte boundary
loop to do SIMD adds like above until not enough bytes are left for an entire SIMD register
final loop to catch remaining values in buffers

pretty clever compiler.

Yup, that it is

Also would explain why Michael's code couldn't gain much over the code produced by ICC by default, since he basically did the same - wrote a few inline assembly functions like sumtables-sum5tables that did exactly that.

Gecko, great idea! Sadly, I don't have any acquaintances that could help in the maths department, but I do hope some of you guys do

Regards,
Simon.

BenHer · « **Reply #14 on:** 15 Aug 2006, 02:32:53 pm »

Figured out f_GetPeak optimization...new version MUCH faster than original.

The faster optimized GetPeak version currently using FPU vs original using SSE2.

Not sure exactly how much faster yet, probably at least 5 times...working on a timing/benchmark routine (like Eric's 'analyzeFuncs_vector.cpp' code) to time various function versions and also verify their output.

Found some interesting compile differences between /Arch:SSE /G6 vs. /OxW vs. /OxN . Sometimes /OxN isnt fastest.

Figured out how to tell ICC to super optimize v_getPowerSpectrum...hand coding could hardly improve on it.

I'm at work now so can't check assembly output, but I believe same super optimize worked for v_ChirpData (with a few code changes to avoid conditionals within loops).

GetTrueMean is really only a simple summing loop - this can get vectorized easilly.

getChiSq has lots of additional math in it but starts out with same speed problem as f_GetPeak...this could be improved.

Completed method for converting an existing function call into a function pointer call with easy code to produce alternate versions of existing functions (that need improvement).

Created separate 'optimize' library project so I can set individual file's optimize options and avoid things like "/Quipo" or "/QxW" for some files. Just an additional project in the IDE.

7.17% - gauss_fit (inlined: getChiSq[36%] - GetTrueMean[36%] - f_GetPeak[27%] )
5.62% - seti_analyze (inline: v_ChirpData[78%] - v_getPowerSpectrum[27%] )

Author Topic: Current Profile Analysis and points to optimze (Read 33131 times)

BenHer

Current Profile Analysis and points to optimze

BenHer

Re: Current Profile Analysis and points to optimze

Josef W. Segur

Re: Current Profile Analysis and points to optimze

BenHer

Re: Current Profile Analysis and points to optimze

Simon

Re: Current Profile Analysis and points to optimze

BenHer

Re: Current Profile Analysis and points to optimze

Simon

Re: Current Profile Analysis and points to optimze

BenHer

Re: Current Profile Analysis and points to optimze

Simon

Re: Current Profile Analysis and points to optimze

BenHer

Re: Current Profile Analysis and points to optimze

BenHer

Re: Current Profile Analysis and points to optimze

Gecko_R7

Re: Current Profile Analysis and points to optimze

BenHer

Re: Current Profile Analysis and points to optimze

Simon

Re: Current Profile Analysis and points to optimze

BenHer

Re: Current Profile Analysis and points to optimze