Forum > Windows
optimized sources
_heinz:
--- Quote from: j_groothu on 03 Nov 2007, 01:34:41 pm ---Surprise Surprise, a QxN build is faster on my Northwood :P
LOL
--- End quote ---
have a Northwood too --->
CPU(s)
Number of CPUs 1
Name Intel Pentium 4
Code Name Northwood
Specification Intel(R) Pentium(R) 4 CPU 2.66GHz
Family / Model / Stepping F 2 7
Extended Family / Model 0 0
Brand ID 9
Package mPGA-478
Core Stepping C1
Technology 0.13 um
Supported Instructions Sets MMX, SSE, SSE2
CPU Clock Speed 2672.8 MHz
Clock multiplier x 20.0
Front Side Bus Frequency 133.6 MHz
Bus Speed 534.6 MHz
L1 Data Cache 8 KBytes, 4-way set associative, 64 Bytes line size
L1 Trace Cache 12 Kuops, 8-way set associative
L2 Cache 512 KBytes, 8-way set associative, 64 Bytes line size
L2 Speed 2672.8 MHz (Full)
L2 Location On Chip
L2 Data Prefetch Logic yes
L2 Bus Width 256 bits
-----------------------------------------------------------------------------------------
Let us speed up the old machines ---> ;D
Jason G:
Boincstats Host cpus, top 10 highest number on seti@home:
Pos., CPU, #, Total Credit
1 Intel(R) Pentium(R) 4 CPU 3.00GHz 104,449 1,920,980,979.29
2 Intel(R) Pentium(R) 4 CPU 2.80GHz 88,848 1,254,181,274.59
3 Intel(R) Pentium(R) 4 CPU 2.40GHz 57,309 633,952,931.43
4 Intel(R) Pentium(R) 4 CPU 3.20GHz 45,737 875,822,530.51
5 AMD Athlon(tm) 64 Processor 3000+ 31,878 257,872,702.50
6 AMD Athlon(tm) 64 Processor 3200+ 30,304 288,741,370.07
7 AMD Athlon(tm) Processor 27,726 129,774,610.58
8 Intel(R) Pentium(R) 4 CPU 2.00GHz 21,701 197,541,843.70
9 Intel(R) Pentium(R) 4 CPU 2.66GHz 19,200 208,668,039.95
10 AMD Athlon(tm) 64 Processor 3500+ 19,049 191,994,766.55
We're Both in the top 10 most popular :D, I have a #8 & #4 :P [Doesn't it feel good to know you're with the 'in crowd'?]
[Must get around to try to strip mine those inner pulse foldiing loops for the p4 64k / 1meg aliasing problem]
_heinz:
It is worth to speed them up.... ;D
Although Dr. Who is already running his code... we give the old boxes a chance
squeezed the code of pulsefind.cpp again
sum1 and sum2 are no longer needed
here the case construct --->
switch (i) {
// case 30:
// sum1 = one[29] + two[29]; sum2 = one[28] + two[28];
// sum1 += three[29]; sum2 += three[28];
// P->dest[29] = sum1; P->dest[28] = sum2;
// if (sum1 > tmax1) tmax1 = sum1; if (sum2 > tmax2) tmax2 = sum2;
//seti_britta: new code:
case 30:
P->dest[29]= one[29] + two[29]+three[29]; P->dest[28]= one[28] + two[28]+three[28];
// sum1 += three[29]; sum2 += three[28];
// P->dest[29] = sum1; P->dest[28] = sum2;
if (P->dest[29] > tmax1) tmax1 = P->dest[29]; if (P->dest[28] > tmax2) tmax2 = P->dest[28];
and so on for all cases
----------------------------------------------------------------------------------------------------------------------------------------------------
and here the loop construct
// ----------------------------------------------------------------------------
// Function: sum_func_ptt( sw_sum3_t31 )
// Typ : float
// Inhalt : folding subroutines, FPU optimized
// parameter: sw_sum3_t31
// last update:23.09.2007 by:seti_britta new function
// ----------------------------------------------------------------------------
sum_func_ptt( sw_sum3_t31 ) {
register int i, j, k;
float tmax2, tmax1; //seti_britta: new
float *one = ss[0];
float *two = ss[0]+P->tmp0;
float *three = ss[0]+P->tmp1;
tmax2 = tmax1 = (0.0f); //seti_britta: no convert !!
i = P->di;
if ( i & 1 )
{
i -= 1;
P->dest = tmax1 = one + two + three; //seti_britta:new
}
for ( j = i-1, k = i-2; j > 0; j -= 2, k -= 2 )
{
P->dest[j]= one[j] + two[j] + three[j]; P->dest[k]= one[k] + two[k] + three[k];
if (P->dest[j] > tmax1) tmax1 = P->dest[j]; if (P->dest[k] > tmax2) tmax2 = P->dest[k];
}
if (tmax1 > tmax2) return tmax1;
return tmax2;
}
-------------------------------------------------------------------------------------------------------------------------------------------
maybe the compact loop have a chance
so far it compiles well... now we must measure to find fastest
have fun
regards heinz ;D ;D
Jason G:
Yes, I think I would like to carefully go back and rexamine Joe's ideas/Posts in the other thread for incorporating 3 phase processing/ block prefetch in some places. I'll get a chance to look next weekend, and hopefully plan a methodical approach that might be able to handle striping for the p4 at the same time.
Intel theories suggest 3 to 5 times possible improvement, in certain code by fixing those p4 problems, And the 3 phase & prefetch techniques [ Ala AMD Paper] even more. If it adds up to a 10 to 20% crunch time improvement I'll be happy because it would bring my p4 3.2 back over 1000 RAC :D
Jason G:
Progress so far, Long way to go :D :
[Each compared against preset 2.3S9 xW SSE2 IPP build, on vs2005/ICC, p4 Northwood 2.0A@2.1GHz,NoHT, WinXP]
Tactic Type Status Effect
1- Better memcpy in GetFixedPot Generic x86 Prelim Tests ~0.3%
2- Out of Place FFTs / eliminating associated memcopies Intel IPP Initial ~?.?%
3- Once off seti.cpp 8meg memcpy Generic x86 Untested ~0.?%
4- Chirp function Block Prefetch, memcpy++ zerocase & 3phase chirp Generic x86 Untested ~?.?%
5- Compiler Flags (xN SSE2 p4 Specific) P4 specific Tested ~10%
6- Strip Mined Inner loops (p4 specific, 64k & 1M variants) P4, possible x86 Untested ~??%
7- GaussFit Improvements To be Determined
~ means approximate, my system, 'your mileage may vary'.
[Please anyone feel free to suggest additions, updates or corrections to this list:
either fairly generic OR p4 specific will do :D, Consider equivalent xP SSE3 builds as already on the list for later]
Jason
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version