Forum > Windows

optimized sources

<< < (49/179) > >>

_heinz:

--- Quote from: j_groothu on 03 Nov 2007, 01:34:41 pm ---Surprise Surprise, a  QxN build is faster on my Northwood :P
LOL     

--- End quote ---
have a Northwood too  --->
CPU(s)   
Number of CPUs 1
 
Name Intel Pentium 4
Code Name Northwood
Specification Intel(R) Pentium(R) 4 CPU 2.66GHz
Family / Model / Stepping F 2 7
Extended Family / Model 0 0
Brand ID 9
Package mPGA-478
Core Stepping C1
Technology 0.13 um
Supported Instructions Sets MMX, SSE, SSE2
CPU Clock Speed 2672.8 MHz
Clock multiplier x 20.0
Front Side Bus Frequency 133.6 MHz
Bus Speed 534.6 MHz
L1 Data Cache 8 KBytes, 4-way set associative, 64 Bytes line size
L1 Trace Cache 12 Kuops, 8-way set associative
L2 Cache 512 KBytes, 8-way set associative, 64 Bytes line size
L2 Speed 2672.8 MHz (Full)
L2 Location On Chip
L2 Data Prefetch Logic yes
L2 Bus Width 256 bits
-----------------------------------------------------------------------------------------
Let us speed up the old machines --->  ;D


Jason G:
Boincstats Host cpus, top 10 highest number on seti@home:
Pos.,  CPU, #, Total Credit

1    Intel(R) Pentium(R) 4 CPU 3.00GHz     104,449     1,920,980,979.29    
2    Intel(R) Pentium(R) 4 CPU 2.80GHz     88,848     1,254,181,274.59    
3    Intel(R) Pentium(R) 4 CPU 2.40GHz     57,309     633,952,931.43    
4    Intel(R) Pentium(R) 4 CPU 3.20GHz     45,737     875,822,530.51    
5    AMD Athlon(tm) 64 Processor 3000+     31,878     257,872,702.50    
6    AMD Athlon(tm) 64 Processor 3200+     30,304     288,741,370.07    
7    AMD Athlon(tm) Processor                   27,726        129,774,610.58    
8    Intel(R) Pentium(R) 4 CPU 2.00GHz     21,701    197,541,843.70
9    Intel(R) Pentium(R) 4 CPU 2.66GHz     19,200     208,668,039.95    
10    AMD Athlon(tm) 64 Processor 3500+     19,049     191,994,766.55    

We're Both in the top 10 most popular :D,  I have a #8 & #4  :P [Doesn't it feel good to know you're with the 'in crowd'?]

[Must get around to try to strip mine those inner pulse foldiing loops for the p4 64k / 1meg aliasing problem]

_heinz:
It is worth to speed them up.... ;D

Although Dr. Who is already running his code... we give the old boxes a chance

squeezed the code of pulsefind.cpp again
sum1 and sum2 are no longer needed

here the case construct --->
  switch (i) {
//    case 30:
//      sum1 = one[29] + two[29];           sum2 = one[28] + two[28];
//      sum1 += three[29];                  sum2 += three[28];
//      P->dest[29] = sum1;                 P->dest[28] = sum2;
//      if (sum1 > tmax1) tmax1 = sum1;     if (sum2 > tmax2) tmax2 = sum2;
 //seti_britta: new code:
    case 30:
      P->dest[29]= one[29] + two[29]+three[29];           P->dest[28]= one[28] + two[28]+three[28];
 //     sum1 += three[29];                  sum2 += three[28];
 //     P->dest[29] = sum1;                 P->dest[28] = sum2;
      if (P->dest[29] > tmax1) tmax1 = P->dest[29];     if (P->dest[28] > tmax2) tmax2 = P->dest[28];

and so on for all cases
----------------------------------------------------------------------------------------------------------------------------------------------------

and here the loop construct
// ----------------------------------------------------------------------------
//   Function:   sum_func_ptt( sw_sum3_t31 )
//   Typ      :   float
//   Inhalt   :   folding subroutines, FPU optimized                     
//   parameter:   sw_sum3_t31         
//   last update:23.09.2007   by:seti_britta   new function
// ----------------------------------------------------------------------------
sum_func_ptt( sw_sum3_t31 ) {
  register int i, j, k;
  float tmax2, tmax1; //seti_britta: new
  float *one   = ss[0];
  float *two   = ss[0]+P->tmp0;
  float *three = ss[0]+P->tmp1;
  tmax2 = tmax1 = (0.0f); //seti_britta: no convert !!
  i = P->di;
  if ( i & 1 )
  {
    i -= 1;
    P->dest = tmax1 = one + two + three; //seti_britta:new
  }
   for ( j = i-1, k = i-2; j > 0; j -= 2, k -= 2 )
   {
      P->dest[j]= one[j] + two[j] + three[j];           P->dest[k]= one[k] + two[k] + three[k];
      if (P->dest[j] > tmax1) tmax1 = P->dest[j];     if (P->dest[k] > tmax2) tmax2 = P->dest[k];
   }
  if (tmax1 > tmax2) return tmax1;
  return tmax2;
}
-------------------------------------------------------------------------------------------------------------------------------------------
maybe the compact loop have a chance
so far it compiles well... now we must measure to find fastest
have fun
regards heinz   ;D  ;D

Jason G:
Yes, I think I would like to carefully go back and rexamine Joe's ideas/Posts in the other thread for incorporating 3 phase processing/ block prefetch in some places. I'll get a chance to look next weekend, and hopefully plan a methodical approach that might be able to handle striping for the p4 at the same time. 

Intel theories suggest 3 to 5 times possible improvement, in certain code by fixing those p4 problems,  And the 3 phase & prefetch techniques [ Ala AMD Paper] even more.  If it adds up to a 10 to 20% crunch time improvement I'll be happy because it would bring my p4 3.2 back over 1000 RAC :D

Jason G:
Progress so far,  Long way to go :D :
[Each compared against preset 2.3S9 xW SSE2 IPP build, on vs2005/ICC, p4 Northwood 2.0A@2.1GHz,NoHT, WinXP]

Tactic                                                                                                        Type            Status                 Effect
1- Better memcpy in GetFixedPot                                                                   Generic x86   Prelim Tests      ~0.3%
2- Out of Place FFTs / eliminating associated memcopies                                   Intel IPP        Initial          ~?.?%
3- Once off seti.cpp 8meg memcpy                                                                Generic x86    Untested    ~0.?%
4- Chirp function Block Prefetch, memcpy++ zerocase & 3phase chirp                  Generic x86   Untested        ~?.?%
5- Compiler Flags (xN SSE2 p4 Specific)                                                               P4 specific   Tested            ~10%
6- Strip Mined Inner loops (p4 specific, 64k & 1M variants)                        P4, possible x86   Untested        ~??%
7- GaussFit Improvements                                                                                   To be Determined

~ means approximate, my system, 'your mileage may vary'.

[Please anyone feel free to suggest additions, updates or corrections to this list: 
            either fairly generic OR p4 specific will do :D, Consider equivalent xP SSE3 builds as already on the list for later]

Jason

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version