66% An inlined version of "getFixedPot" function 1/2 from this line - fp_PoT[ul_PoT_i] = fp_PowerSpectrum[ul_PoT + ul_PoTChunk_i]; Mostly due to cache misses (the value being copied is grabbed from what amounts to random memory addresses)33% - The following loop for(i = 0; i < PulsePoTLen; i++) { PulsePoT[i] = PowerSpectrum[ThisPoT + (TOffset+i) * FftLength]; }
Did a run on a full WU (the test WU included with the source).Function usage pretty much the same, not many areas for improvement.
13.5% - analyze_pot ... again - same 3 code sections 7.17% - gauss_fit (inlined: getChiSq[36%] - GetTrueMean[36%] - f_GetPeak[27%] ) 5.62% - seti_analyze (inline: v_ChirpData[78%] - v_getPowerSpectrum[27%] ) 5.35% - find_pulse 3.22% - an ipp fft routine 2.68% - an ipp fft routine 2.54% - chooseGaussEvent 1.61% - memcpyIf you were to implement a blindingly fast memcopy routine, speeding it up by 5 times (or 400%) you would speedup overall WU crunching by only 1.2%...thus not a great place to put a lot of effort.analyze_pot - The alex kahn transposed pot caching would probably speed this up a lot (and it uses memcpy so maybe thats a good idea)Find_pulse - with SSE optimized functions could probably be speeded up by 200% (3x), saving overall WU time of 3.5%.
Do you think it would be practical to combine the fft, getPowerSpectrum, and transpose steps? It seems it might be more efficient, and perhaps IPP has some functions meant for generating a Spectrum Analyzer display which might be usable.
<chirps> <chirp_parameter_t> <chirp_limit>20</chirp_limit> <fft_len_flags>262136</fft_len_flags> </chirp_parameter_t> <chirp_parameter_t> <chirp_limit>50</chirp_limit> <fft_len_flags>65528</fft_len_flags> </chirp_parameter_t> </chirps>
float tmp_max = 0; for (int i=0;i<length;i++) { register float tmpfloat=(ptr1[i]+ptr2[i])/2; sums[i]=tmpfloat; if (tmpfloat>tmp_max) { tmp_max=tmpfloat; } }... becomes...Tree Samples Address Code Bytes Source Line # CPU0 CPU1 5 0x410629 F3 0F 10 05 48 17 50 00 movss xmm0,[00501748h] 5 5 0x41063d 89 0C 24 mov [esp],ecx 6 36 0x410640 8B 4D 14 mov ecx,[ebp+14h] 7 17 19 0x41064b 8B 75 10 mov esi,[ebp+10h] 8 38 0x41064e 33 FF xor edi,edi 9 18 20 0x410650 F3 0F 10 14 BE movss xmm2,[esi+edi*4] 10 99 0x410655 F3 0F 58 14 B9 addss xmm2,[ecx+edi*4] 11 40 59 11 0x41065a F3 0F 59 D0 mulss xmm2,xmm0 12 7 4 23 0x41067c 0F 28 05 A0 16 50 00 movaps xmm0,[005016a0h] 13 10 13 0x410699 0F 10 14 B9 movups xmm2,[ecx+edi*4] 14 495 0x41069d 0F 58 14 BA addps xmm2,[edx+edi*4] 15 194 301 270 0x4106a1 0F 59 D0 mulps xmm2,xmm0 16 122 148 175 0x4106ab 0F 10 4C B9 10 movups xmm1,[ecx+edi*4+10h] 17 74 101 359 0x4106b0 0F 58 4C BA 10 addps xmm1,[edx+edi*4+10h] 18 154 205 199 0x4106b5 0F 59 C8 mulps xmm1,xmm0 19 76 123 470 0x4106cf 0F 10 24 B9 movups xmm4,[ecx+edi*4] 20 201 269 1284 0x4106d3 0F 10 14 BA movups xmm2,[edx+edi*4] 21 540 744 1204 0x4106d7 0F 58 E2 addps xmm4,xmm2 22 532 672 510 0x4106da 0F 59 E0 mulps xmm4,xmm0 23 204 306 535 0x4106e4 0F 10 4C B9 10 movups xmm1,[ecx+edi*4+10h] 24 201 334 1066 0x4106e9 0F 10 5C BA 10 movups xmm3,[edx+edi*4+10h] 25 461 605 1142 0x4106ee 0F 58 CB addps xmm1,xmm3 26 478 664 539 0x4106f1 0F 59 C8 mulps xmm1,xmm0 27 213 326 119 0x410726 F3 0F 10 05 48 17 50 00 movss xmm0,[00501748h] 28 49 70 136 0x41072e 8B 45 14 mov eax,[ebp+14h] 29 60 76 114 0x410731 8B 55 10 mov edx,[ebp+10h] 30 48 66 5 0x410734 F3 0F 10 14 BA movss xmm2,[edx+edi*4] 31 1 4 1751 0x410739 F3 0F 58 14 B8 addss xmm2,[eax+edi*4] 32 683 1068 143 0x41073e F3 0F 59 D0 mulss xmm2,xmm0 33 44 99
...so basicallyThe automatic /Qunroll option causes the compiler to generate code that vectorizes simple loops like this one.It creates three loops out of the one loop:Loop to do single ( ptr1 + ptr2 ) * .5 // automatically uses reciprocal to avoid divison until address of ptr1+i ptr2+i ptr3+i are on a 16 byte boundaryloop to do SIMD adds like above until not enough bytes are left for an entire SIMD registerfinal loop to catch remaining values in bufferspretty clever compiler.