optimized sources

Forum > Windows

optimized sources

<< < (49/179) > >>

_heinz:

--- Quote from: j_groothu on 03 Nov 2007, 01:34:41 pm ---Surprise Surprise, a QxN build is faster on my Northwood :P
LOL

--- End quote ---
have a Northwood too --->
CPU(s)
Number of CPUs 1

Name Intel Pentium 4
Code Name Northwood
Specification Intel(R) Pentium(R) 4 CPU 2.66GHz
Family / Model / Stepping F 2 7
Extended Family / Model 0 0
Brand ID 9
Package mPGA-478
Core Stepping C1
Technology 0.13 um
Supported Instructions Sets MMX, SSE, SSE2
CPU Clock Speed 2672.8 MHz
Clock multiplier x 20.0
Front Side Bus Frequency 133.6 MHz
Bus Speed 534.6 MHz
L1 Data Cache 8 KBytes, 4-way set associative, 64 Bytes line size
L1 Trace Cache 12 Kuops, 8-way set associative
L2 Cache 512 KBytes, 8-way set associative, 64 Bytes line size
L2 Speed 2672.8 MHz (Full)
L2 Location On Chip
L2 Data Prefetch Logic yes
L2 Bus Width 256 bits
-----------------------------------------------------------------------------------------
Let us speed up the old machines ---> ;D

Jason G:
Boincstats Host cpus, top 10 highest number on seti@home:
Pos., CPU, #, Total Credit

1    Intel(R) Pentium(R) 4 CPU 3.00GHz     104,449     1,920,980,979.29
2    Intel(R) Pentium(R) 4 CPU 2.80GHz     88,848     1,254,181,274.59
3    Intel(R) Pentium(R) 4 CPU 2.40GHz     57,309     633,952,931.43
4    Intel(R) Pentium(R) 4 CPU 3.20GHz     45,737     875,822,530.51
5    AMD Athlon(tm) 64 Processor 3000+     31,878     257,872,702.50
6    AMD Athlon(tm) 64 Processor 3200+     30,304     288,741,370.07
7    AMD Athlon(tm) Processor      27,726      129,774,610.58
8    Intel(R) Pentium(R) 4 CPU 2.00GHz     21,701    197,541,843.70
9    Intel(R) Pentium(R) 4 CPU 2.66GHz     19,200     208,668,039.95
10    AMD Athlon(tm) 64 Processor 3500+     19,049     191,994,766.55

We're Both in the top 10 most popular :D, I have a #8 & #4 :P [Doesn't it feel good to know you're with the 'in crowd'?]

[Must get around to try to strip mine those inner pulse foldiing loops for the p4 64k / 1meg aliasing problem]

_heinz:
It is worth to speed them up.... ;D

Although Dr. Who is already running his code... we give the old boxes a chance

squeezed the code of pulsefind.cpp again
sum1 and sum2 are no longer needed

here the case construct --->
switch (i) {
// case 30:
// sum1 = one[29] + two[29]; sum2 = one[28] + two[28];
// sum1 += three[29]; sum2 += three[28];
// P->dest[29] = sum1; P->dest[28] = sum2;
// if (sum1 > tmax1) tmax1 = sum1; if (sum2 > tmax2) tmax2 = sum2;
//seti_britta: new code:
case 30:
P->dest[29]= one[29] + two[29]+three[29]; P->dest[28]= one[28] + two[28]+three[28];
// sum1 += three[29]; sum2 += three[28];
// P->dest[29] = sum1; P->dest[28] = sum2;
if (P->dest[29] > tmax1) tmax1 = P->dest[29]; if (P->dest[28] > tmax2) tmax2 = P->dest[28];

and so on for all cases
----------------------------------------------------------------------------------------------------------------------------------------------------

and here the loop construct
// ----------------------------------------------------------------------------
//   Function:   sum_func_ptt( sw_sum3_t31 )
//   Typ      :   float
//   Inhalt   :   folding subroutines, FPU optimized
//   parameter:   sw_sum3_t31
//   last update:23.09.2007   by:seti_britta   new function
// ----------------------------------------------------------------------------
sum_func_ptt( sw_sum3_t31 ) {
register int i, j, k;
float tmax2, tmax1; //seti_britta: new
float *one = ss[0];
float *two = ss[0]+P->tmp0;
float *three = ss[0]+P->tmp1;
tmax2 = tmax1 = (0.0f); //seti_britta: no convert !!
i = P->di;
if ( i & 1 )
{
i -= 1;
P->dest = tmax1 = one + two + three; //seti_britta:new
}
   for ( j = i-1, k = i-2; j > 0; j -= 2, k -= 2 )
   {
P->dest[j]= one[j] + two[j] + three[j]; P->dest[k]= one[k] + two[k] + three[k];
if (P->dest[j] > tmax1) tmax1 = P->dest[j]; if (P->dest[k] > tmax2) tmax2 = P->dest[k];
   }
if (tmax1 > tmax2) return tmax1;
return tmax2;
}
-------------------------------------------------------------------------------------------------------------------------------------------
maybe the compact loop have a chance
so far it compiles well... now we must measure to find fastest
have fun
regards heinz ;D ;D

Jason G:
Yes, I think I would like to carefully go back and rexamine Joe's ideas/Posts in the other thread for incorporating 3 phase processing/ block prefetch in some places. I'll get a chance to look next weekend, and hopefully plan a methodical approach that might be able to handle striping for the p4 at the same time.

Intel theories suggest 3 to 5 times possible improvement, in certain code by fixing those p4 problems, And the 3 phase & prefetch techniques [ Ala AMD Paper] even more. If it adds up to a 10 to 20% crunch time improvement I'll be happy because it would bring my p4 3.2 back over 1000 RAC :D

Jason G:
Progress so far, Long way to go :D :
[Each compared against preset 2.3S9 xW SSE2 IPP build, on vs2005/ICC, p4 Northwood 2.0A@2.1GHz,NoHT, WinXP]

Tactic     Type     Status     Effect
1- Better memcpy in GetFixedPot       Generic x86 Prelim Tests     ~0.3%
2- Out of Place FFTs / eliminating associated memcopies     Intel IPP Initial     ~?.?%
3- Once off seti.cpp 8meg memcpy     Generic x86    Untested    ~0.?%
4- Chirp function Block Prefetch, memcpy++ zerocase & 3phase chirp     Generic x86 Untested     ~?.?%
5- Compiler Flags (xN SSE2 p4 Specific)     P4 specific   Tested     ~10%
6- Strip Mined Inner loops (p4 specific, 64k & 1M variants)     P4, possible x86 Untested     ~??%
7- GaussFit Improvements To be Determined

~ means approximate, my system, 'your mileage may vary'.

[Please anyone feel free to suggest additions, updates or corrections to this list:
either fairly generic OR p4 specific will do :D, Consider equivalent xP SSE3 builds as already on the list for later]

Jason

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version