Forum > Windows
optimized sources
_heinz:
Hi Jason,
The compact loop construction PFLOOP runs.
some first impressions: --->
FFTlen 8192, PulsePoTLen 24, 1048576 loops.
Standard: 9049838772 ticks, 8630.60 per loop, 0 rpt
Plan < 512 FPU swi ! : 3892589440 ticks, 3712.26 per loop, 0 rpt
Plan < 512 AK SSE ! : 5260680348 ticks, 5016.98 per loop, 0 rpt
Plan < 512 BHx SSE ! : 13525734128 ticks, 12899.15 per loop, 0 rpt
Plan < 512 BH SSE ! : 9339515956 ticks, 8906.86 per loop, 0 rpt
ar=0.435000 done. Total flop count: 108711033335.208650
PulTimB 0.5 Totals: Ratio Ticks
standard: 1.000 87462139372
Plan < 512 FPU swi ! : 0.609 53291444096
Plan < 512 AK SSE ! : 0.634 55471031448
Plan < 512 BHx SSE ! : 0.990 86608697300
Plan < 512 BH SSE ! : 0.772 67556177968
Iīm surprised, did not expected it --->
Against the standard opt. case-construct it speeds up ~12% in FFTlen 8192, PulsePoTLen 24
And that it shows in FFTlen 8192, PulsePoTLen 24 ---> 3892589440 ticks, 3712.26 per loop
and the original code FFTlen 8192, PulsePoTLen 24---> 4427492996 ticks, 4222.39 per loop
looks like the LOOP-construct is faster in this case, but not in summary....
further measuring must manifest it.
All cases compiled with /Zi /Od no compiler optimization... and MS-Compiler...
so further improvement can be expected to use the Intel Compiler ;D
Jason G:
Yeah, It is good now I can have a look at Pulsefind again now msvs is fixed! :D, I think you might be finding a similar thing to what I've been seeing from a different part of the code:
- the code is very sensitive to certain optimisation settings,
I haven't worked out yet whether this is because the optimiser is improving some weakness in the code, or whether the code is written to take advantage of the optimisers, or perhaps [more likely] a little of both. Time [and examining the assembly output listing ;) ] will tell,
I would easily place that 12% in the range of the optimiser so I have learned [the hard way] to take care to use final settings for timing comparison. That unexpected surprise might be a phantom, though it will be nice if it isn't :D
Jason
_heinz:
yes, all this must be analyzed.... I will have a look at the asm code...to see what is really going on there.
As Joe alredy stated, shorter code must not be automatic fastest.
I have a lot of not necessary assignments to some vars eleminated, so results not stored meanwhile to the vars, they will be hold in the registers for next operations. ;) fewer instructions for the same result
We keep in mind every command must be loaded into the instruction register and executed by the instruction decoder so a well squeezed code can show any effects too, especially in big loops. ;) fewer instructions means smaller times
The optimizing by setting compiler flags ... unrolling loops and fill the cache properly etc. are a other big field....
heinz
_heinz:
access of 23 000 .... I didnīt expected it..... looks like a hot thread
greatings to all who are reding here
heinz ;)
Jason G:
--- Quote from: seti_britta on 12 Nov 2007, 11:41:57 am ---access of 23 000 .... I didnīt expected it..... looks like a hot thread
greatings to all who are reding here
heinz ;)
--- End quote ---
Ithink It'll get much bigger yet, with the optimisations you are trying, and maybe a few from me too if I can consolidate a little :D (and get some more study and work done first! ;D)
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version