Forum > Windows

optimized sources

<< < (52/179) > >>

_heinz:
Hi Jason,

The compact loop construction PFLOOP runs.
some first impressions: --->

FFTlen 8192, PulsePoTLen 24,  1048576 loops.
             Standard:     9049838772 ticks, 8630.60 per loop, 0 rpt
Plan < 512 FPU swi ! :     3892589440 ticks, 3712.26 per loop, 0 rpt
 Plan < 512 AK SSE ! :     5260680348 ticks, 5016.98 per loop, 0 rpt
Plan < 512 BHx SSE ! :    13525734128 ticks, 12899.15 per loop, 0 rpt
 Plan < 512 BH SSE ! :     9339515956 ticks, 8906.86 per loop, 0 rpt

ar=0.435000 done. Total flop count: 108711033335.208650

PulTimB 0.5    Totals:  Ratio            Ticks
             standard:  1.000      87462139372
Plan < 512 FPU swi ! :  0.609      53291444096
 Plan < 512 AK SSE ! :  0.634      55471031448
Plan < 512 BHx SSE ! :  0.990      86608697300
 Plan < 512 BH SSE ! :  0.772      67556177968

Iīm surprised, did not expected it --->
Against the standard opt. case-construct it speeds up ~12% in FFTlen 8192, PulsePoTLen 24
And that it shows in FFTlen 8192, PulsePoTLen 24 ---> 3892589440 ticks, 3712.26   per loop
and the original code FFTlen 8192, PulsePoTLen 24---> 4427492996 ticks, 4222.39 per loop
looks like the LOOP-construct is faster in this case, but not in summary....
further measuring must manifest it.

All cases compiled with /Zi /Od no compiler optimization... and MS-Compiler...
so further improvement can be expected to use the Intel Compiler  ;D



Jason G:
Yeah, It is good now I can have a look at Pulsefind again now msvs is fixed! :D,  I think you might be finding a similar thing to what I've been seeing from a different part of the code:

     -  the code is very sensitive to certain optimisation settings,

I haven't worked out yet whether this is because the optimiser is improving some weakness in the code, or whether the code is written to take advantage of the optimisers, or perhaps [more likely] a little of both.  Time [and examining the assembly output listing  ;) ] will tell,

I would easily place that 12% in the range of the optimiser so I have learned [the hard way] to take care to use final settings for timing comparison.  That unexpected surprise might be a phantom, though it will be nice if it isn't :D

Jason

_heinz:
yes, all this must be analyzed.... I will have a look at the asm code...to see what is really going on there.
As Joe alredy stated, shorter code must not be automatic  fastest.

I have a lot of not necessary assignments to some vars eleminated, so results not stored meanwhile to the vars, they will be hold in the registers for next operations.  ;)  fewer instructions for the same result
We keep in mind every command must be loaded into the instruction register and executed by the instruction decoder so a well squeezed code can show any effects too, especially in big loops.  ;)   fewer instructions means smaller times

The optimizing by setting compiler flags ... unrolling loops and fill the cache properly etc. are a other big field....

heinz

_heinz:
access of 23 000    .... I didnīt expected it..... looks like a hot thread
greatings to all who are reding here
heinz  ;)

Jason G:

--- Quote from: seti_britta on 12 Nov 2007, 11:41:57 am ---access of 23 000    .... I didnīt expected it..... looks like a hot thread
greatings to all who are reding here
heinz  ;)

--- End quote ---

Ithink It'll get much bigger yet, with the optimisations you are trying, and maybe a few from me too if I can consolidate a little :D (and get some more study and work done first!  ;D)

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version