+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: optimized sources  (Read 615581 times)

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #240 on: 05 Nov 2007, 11:53:36 am »
Surprise Surprise, a  QxN build is faster on my Northwood :P
LOL     
have a Northwood too  --->
CPU(s)   
Number of CPUs 1
 
Name Intel Pentium 4
Code Name Northwood
Specification Intel(R) Pentium(R) 4 CPU 2.66GHz
Family / Model / Stepping F 2 7
Extended Family / Model 0 0
Brand ID 9
Package mPGA-478
Core Stepping C1
Technology 0.13 um
Supported Instructions Sets MMX, SSE, SSE2
CPU Clock Speed 2672.8 MHz
Clock multiplier x 20.0
Front Side Bus Frequency 133.6 MHz
Bus Speed 534.6 MHz
L1 Data Cache 8 KBytes, 4-way set associative, 64 Bytes line size
L1 Trace Cache 12 Kuops, 8-way set associative
L2 Cache 512 KBytes, 8-way set associative, 64 Bytes line size
L2 Speed 2672.8 MHz (Full)
L2 Location On Chip
L2 Data Prefetch Logic yes
L2 Bus Width 256 bits
-----------------------------------------------------------------------------------------
Let us speed up the old machines --->  ;D



Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #241 on: 05 Nov 2007, 12:21:38 pm »
Boincstats Host cpus, top 10 highest number on seti@home:
Pos.,  CPU, #, Total Credit

1    Intel(R) Pentium(R) 4 CPU 3.00GHz     104,449     1,920,980,979.29    
2    Intel(R) Pentium(R) 4 CPU 2.80GHz     88,848     1,254,181,274.59    
3    Intel(R) Pentium(R) 4 CPU 2.40GHz     57,309     633,952,931.43    
4    Intel(R) Pentium(R) 4 CPU 3.20GHz     45,737     875,822,530.51    
5    AMD Athlon(tm) 64 Processor 3000+     31,878     257,872,702.50    
6    AMD Athlon(tm) 64 Processor 3200+     30,304     288,741,370.07    
7    AMD Athlon(tm) Processor                   27,726        129,774,610.58    
8    Intel(R) Pentium(R) 4 CPU 2.00GHz     21,701    197,541,843.70
9    Intel(R) Pentium(R) 4 CPU 2.66GHz     19,200     208,668,039.95    
10    AMD Athlon(tm) 64 Processor 3500+     19,049     191,994,766.55    

We're Both in the top 10 most popular :D,  I have a #8 & #4  :P [Doesn't it feel good to know you're with the 'in crowd'?]

[Must get around to try to strip mine those inner pulse foldiing loops for the p4 64k / 1meg aliasing problem]
« Last Edit: 05 Nov 2007, 12:31:43 pm by j_groothu »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #242 on: 05 Nov 2007, 10:02:53 pm »
It is worth to speed them up.... ;D

Although Dr. Who is already running his code... we give the old boxes a chance

squeezed the code of pulsefind.cpp again
sum1 and sum2 are no longer needed

here the case construct --->
  switch (i) {
//    case 30:
//      sum1 = one[29] + two[29];           sum2 = one[28] + two[28];
//      sum1 += three[29];                  sum2 += three[28];
//      P->dest[29] = sum1;                 P->dest[28] = sum2;
//      if (sum1 > tmax1) tmax1 = sum1;     if (sum2 > tmax2) tmax2 = sum2;
 //seti_britta: new code:
    case 30:
      P->dest[29]= one[29] + two[29]+three[29];           P->dest[28]= one[28] + two[28]+three[28];
 //     sum1 += three[29];                  sum2 += three[28];
 //     P->dest[29] = sum1;                 P->dest[28] = sum2;
      if (P->dest[29] > tmax1) tmax1 = P->dest[29];     if (P->dest[28] > tmax2) tmax2 = P->dest[28];

and so on for all cases
----------------------------------------------------------------------------------------------------------------------------------------------------

and here the loop construct
// ----------------------------------------------------------------------------
//   Function:   sum_func_ptt( sw_sum3_t31 )
//   Typ      :   float
//   Inhalt   :   folding subroutines, FPU optimized                     
//   parameter:   sw_sum3_t31         
//   last update:23.09.2007   by:seti_britta   new function
// ----------------------------------------------------------------------------
sum_func_ptt( sw_sum3_t31 ) {
  register int i, j, k;
  float tmax2, tmax1; //seti_britta: new
  float *one   = ss[0];
  float *two   = ss[0]+P->tmp0;
  float *three = ss[0]+P->tmp1;
  tmax2 = tmax1 = (0.0f); //seti_britta: no convert !!
  i = P->di;
  if ( i & 1 )
  {
    i -= 1;
    P->dest = tmax1 = one + two + three; //seti_britta:new
  }
   for ( j = i-1, k = i-2; j > 0; j -= 2, k -= 2 )
   {
      P->dest[j]= one[j] + two[j] + three[j];           P->dest[k]= one[k] + two[k] + three[k];
      if (P->dest[j] > tmax1) tmax1 = P->dest[j];     if (P->dest[k] > tmax2) tmax2 = P->dest[k];
   }
  if (tmax1 > tmax2) return tmax1;
  return tmax2;
}
-------------------------------------------------------------------------------------------------------------------------------------------
maybe the compact loop have a chance
so far it compiles well... now we must measure to find fastest
have fun
regards heinz   ;D  ;D

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #243 on: 06 Nov 2007, 02:32:21 am »
Yes, I think I would like to carefully go back and rexamine Joe's ideas/Posts in the other thread for incorporating 3 phase processing/ block prefetch in some places. I'll get a chance to look next weekend, and hopefully plan a methodical approach that might be able to handle striping for the p4 at the same time. 

Intel theories suggest 3 to 5 times possible improvement, in certain code by fixing those p4 problems,  And the 3 phase & prefetch techniques [ Ala AMD Paper] even more.  If it adds up to a 10 to 20% crunch time improvement I'll be happy because it would bring my p4 3.2 back over 1000 RAC :D

« Last Edit: 06 Nov 2007, 05:18:39 am by j_groothu »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #244 on: 06 Nov 2007, 07:34:37 am »
Progress so far,  Long way to go :D :
[Each compared against preset 2.3S9 xW SSE2 IPP build, on vs2005/ICC, p4 Northwood 2.0A@2.1GHz,NoHT, WinXP]

Tactic                                                                                                        Type            Status                 Effect
1- Better memcpy in GetFixedPot                                                                   Generic x86   Prelim Tests      ~0.3%
2- Out of Place FFTs / eliminating associated memcopies                                   Intel IPP        Initial          ~?.?%
3- Once off seti.cpp 8meg memcpy                                                                Generic x86    Untested    ~0.?%
4- Chirp function Block Prefetch, memcpy++ zerocase & 3phase chirp                  Generic x86   Untested        ~?.?%
5- Compiler Flags (xN SSE2 p4 Specific)                                                               P4 specific   Tested            ~10%
6- Strip Mined Inner loops (p4 specific, 64k & 1M variants)                        P4, possible x86   Untested        ~??%
7- GaussFit Improvements                                                                                   To be Determined

~ means approximate, my system, 'your mileage may vary'.

[Please anyone feel free to suggest additions, updates or corrections to this list: 
            either fairly generic OR p4 specific will do :D, Consider equivalent xP SSE3 builds as already on the list for later]

Jason
« Last Edit: 06 Nov 2007, 09:54:55 am by j_groothu »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #245 on: 07 Nov 2007, 07:13:57 am »
Quote
4- Chirp function Block Prefetch, memcpy++ zerocase & 3phase chirp                  Generic x86   Untested        ~?.?%

Took a quick look between school and work, looks like this may be easier than I thought to try.  On my configuration the consistantly selected chirping function is the outstanding "sse2_ChirpData_ak".  nice one.

The structure is already there for potential 3 phase processing, though it is currently straight SSE2 rendering it vectorised SIMD as far as I can see. The existing prefetch, processing and writing sections are all SSE2, clearly laid out and exhibit the clean crystal vase like 'niceness' quality that make you reluctant to tamper :D

With few other adaptations, adjusting the prefetch, changing the processing to FPU, and suitably adjusting the streaming writes should do the trick,
  ... though for the p4 I would like to try to keep the aliasing issue in mind which might just dictate some of the block sizes and order they are processed.

Oh for the weekend :D

« Last Edit: 07 Nov 2007, 07:22:50 am by j_groothu »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #246 on: 07 Nov 2007, 11:05:59 am »
First run of original code [ Will need run more times for baseline though ] : ( Very Nice function already )

--------------------------------------------------------------------------------------
Testing xN SSE2 Build.

sse2_ChirpData_ak:

NumDataPoints = 1024*1024
test_points = 32768

Timer Frequency in:

Hz  =       3579545
MHz =       3.57955
GHz =    0.00358

Start Time =    1585115997106 Ticks
Stop Time  =    1585116003199 Ticks

Duration in Ticks   =  6093
Duration in seconds =  0.0017021716447

--------------------------------------------------------------------------------------

Inner loop executes 8192 times
« Last Edit: 07 Nov 2007, 11:10:42 am by j_groothu »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #247 on: 07 Nov 2007, 11:47:04 am »
measure its the best to try code and find optimal variants.  ;D

the loop construct in pulsefind.cpp is ready now, but not measured.
Today I will squeeze the case-construct code.
have still some good ideas to eleminate code else and there...we will see...


Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #248 on: 07 Nov 2007, 12:14:29 pm »
measure its the best to try code and find optimal variants.  ;D

the loop construct in pulsefind.cpp is ready now, but not measured.
Today I will squeeze the case-construct code.
have still some good ideas to eleminate code else and there...we will see...



Great!, a pulsefind baseline will be good too. for underneath pulsefind  It seems my machine also selects always AK folding routines and spends much of its time in the x2AL version..  I am running vtune on the chirp one now to look for any p4 specific slowdowns, wickedly fast code though :D

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #249 on: 07 Nov 2007, 01:55:39 pm »



 I am running vtune on the chirp one now to look for any p4 specific slowdowns, wickedly fast code though :D

have a strong modified chirpfft.cpp which we can try  too

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #250 on: 07 Nov 2007, 04:47:27 pm »
easy we can compile all 3 cases with the präprozessordefinition now --->
---------------------------------------------------------------------------------------------------
// USE_PFLOOP  --> Präprozessordirective
// USE_PFCASE  --> Präprozessordirective
#if defined( USE_PFLOOP )
   #pragma message ("-----PFLOOP-----")
   #include "pfloop.h" //use the loop-construct
#else
#if defined( USE_PFCASE )
   #pragma message ("-----PFCASE-----")
   #include "pfcase.h" //use the modified case-construct
#else
   //use original code
#endif // USE_PFCASE
#endif // USE_PFLOOP
-----------------------------------------------------------------------------------------
------ Build started: Project: seti_boinc, Configuration: Release32-NOGFX Win32 ------
Compiling...
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.20404 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
cl /Od /Ob2 /Oi /Ot /Oy /GT /I "." /I "../../../boinc/api" /I "../../../boinc/client/win" /I "../../../boinc/lib" /I ".." /I "glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\db" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\jpeglib" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\Optimizer" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\image_libs" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX" /I "C:\I\SC\vs90\boinc" /I "C:\I\SC\vs90\boinc\api" /I "C:\I\SC\vs90\boinc\client\win" /I "C:\I\SC\vs90\boinc\lib" /D "WIN32" /D "_WIN32" /D "_WINDOWS" /D "NBOINC_APP_GRAPHICS" /D "CLIENT" /D "_MT" /D "USE_IPP" /D "USE_SSE2" /D "_DEBUG" /D "USE_PFLOOP" /D "_VC80_UPGRADE=0x0600" /D "_MBCS" /GF /Gm /EHsc /MTd /Zp16 /Gy /Fp".\Release/seti_boinc.pch" /Fo".\Release32-NOGFX\\" /Fd".\Release32-NOGFX\vc90.pdb" /FR".\Release32-NOGFX\\" /W3 /c /Wp64 /Zi /TP "..\pulsefind.cpp"
pulsefind.cpp
-----PFLOOP-----
..\pulsefind.cpp(1487) : warning C4146: unary minus operator applied to unsigned type, result still unsigned
Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX\BuildLog.htm"
seti_boinc - 0 error(s), 1 warning(s)
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

regards   ;D


Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #251 on: 08 Nov 2007, 04:50:05 am »
       have a strong modified chirpfft.cpp which we can try  too

Good we'll do that I think it is a very good idea, I have p4 sse2  primary performance data  (vtune) for the sse2_ChirpData_ak, 10000 loops on p4 Northwood with 512k l2 cache, which took a toral time of 10 secs execution time: (19 runs worth of data gathered)
(preliminary data, subject to verification with further runs)
   64k Alaising : almost none... Accounts for 1.34% of function workload (about 0.13 secs)
  Second Level Cache misses: Accounts for 10.28% of the workload (about 1 second)

other statistics (preliminary, subject to verification) :
128 bit mmx instructions ~82 million (no 64 bit MMX instructions counted)
packed double precision Floating Point SSE instructions ~1.4 billion (thousand million)
packed single precision  Floating Point SSE instructions ~4 billion (thousand million)

Mispredicted Branches = 0 !!!  :o

No Machine Clear counts (Pipeline flushes), split loads or blocked store forwards at all :D

I think that's a really good function, much better statistics than the pulefolding functions gave me, but I'll have to retest those in isolation too as I'm getting better at selecting the correct compiler settings and driving vtune too.

Well I'll check a few build setting and run primary performance measures again to verify those results, and add secondary performance indicators to see what else turns up.... Then on the weekend maybe fiddle with that 3 phase idea to see if it actually works....All good fun :D...

Jason


« Last Edit: 08 Nov 2007, 05:06:50 am by j_groothu »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #252 on: 08 Nov 2007, 12:12:38 pm »
the modified PFCASE is ready now
-----------------------------------------------
------ Build started: Project: seti_boinc, Configuration: Release32-NOGFX Win32 ------
Compiling...
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.20404 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
cl /Od /Ob2 /Oi /Ot /Oy /GT /I "." /I "../../../boinc/api" /I "../../../boinc/client/win" /I "../../../boinc/lib" /I ".." /I "glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\db" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\jpeglib" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\Optimizer" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\image_libs" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX" /I "C:\I\SC\vs90\boinc" /I "C:\I\SC\vs90\boinc\api" /I "C:\I\SC\vs90\boinc\client\win" /I "C:\I\SC\vs90\boinc\lib" /D "WIN32" /D "_WIN32" /D "_WINDOWS" /D "NBOINC_APP_GRAPHICS" /D "CLIENT" /D "_MT" /D "USE_IPP" /D "USE_SSE2" /D "_DEBUG" /D "USE_PFCASE" /D "_VC80_UPGRADE=0x0600" /D "_MBCS" /GF /Gm /EHsc /MTd /Zp16 /Gy /Fp".\Release/seti_boinc.pch" /Fo".\Release32-NOGFX\\" /Fd".\Release32-NOGFX\vc90.pdb" /FR".\Release32-NOGFX\\" /W3 /c /Wp64 /Zi /TP "..\pulsefind.cpp"
pulsefind.cpp
-----PFCASE-----
..\pulsefind.cpp(1487) : warning C4146: unary minus operator applied to unsigned type, result still unsigned
Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX\BuildLog.htm"
seti_boinc - 0 error(s), 1 warning(s)
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========
 ;D

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #253 on: 08 Nov 2007, 09:50:55 pm »
modified PFCASE rocks

here as it was before --->
ar=0.435000 done. Total flop count: 108711033335.208650

PulTimB 0.5    Totals:  Ratio            Ticks
             standard:  1.000      87303043476
Plan < 512 FPU swi ! :  0.575      50201832416
 Plan < 512 AK SSE ! :  0.634      55338411648
Plan < 512 BHx SSE ! :  0.993      86661631716
 Plan < 512 BH SSE ! :  0.774      67545465584

PFCASE ---->
ar=0.435000 done. Total flop count: 108711033335.208650

PulTimB 0.5    Totals:  Ratio            Ticks
             standard:  1.000      87387438720
Plan < 512 FPU swi ! : 0.504      44014700492
 Plan < 512 AK SSE ! :  0.633      55324520388
Plan < 512 BHx SSE ! :  0.992      86681643504
 Plan < 512 BH SSE ! :  0.773      67531081560
----------------------------------------------------------------------------------------------------
modified PFCASE ---> ~13% faster     ;D
heinz

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #254 on: 09 Nov 2007, 01:45:24 am »
Woohoo!, It's weekend! that function was with just the changes you made before? I'll guess that maybe the compiler did vectorise some of that,  I would like to look at disassembly output,  if the compiler was smart enough to put prefetch plus FPU plus streaming stores then that IS 3-Phase :D, anything is possible, have you compared for accuracy as well ?

« Last Edit: 09 Nov 2007, 01:50:00 am by j_groothu »

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 355
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 27
Total: 27
Powered by EzPortal