Author Topic: optimized sources (Read 922618 times)

_heinz · « **Reply #240 on:** 05 Nov 2007, 11:53:36 am »

Quote from: j_groothu on 03 Nov 2007, 01:34:41 pm

Surprise Surprise, a QxN build is faster on my Northwood
LOL

have a Northwood too --->
CPU(s)
Number of CPUs 1

Name Intel Pentium 4
Code Name Northwood
Specification Intel(R) Pentium(R) 4 CPU 2.66GHz
Family / Model / Stepping F 2 7
Extended Family / Model 0 0
Brand ID 9
Package mPGA-478
Core Stepping C1
Technology 0.13 um
Supported Instructions Sets MMX, SSE, SSE2
CPU Clock Speed 2672.8 MHz
Clock multiplier x 20.0
Front Side Bus Frequency 133.6 MHz
Bus Speed 534.6 MHz
L1 Data Cache 8 KBytes, 4-way set associative, 64 Bytes line size
L1 Trace Cache 12 Kuops, 8-way set associative
L2 Cache 512 KBytes, 8-way set associative, 64 Bytes line size
L2 Speed 2672.8 MHz (Full)
L2 Location On Chip
L2 Data Prefetch Logic yes
L2 Bus Width 256 bits
-----------------------------------------------------------------------------------------
Let us speed up the old machines --->

Jason G · « **Reply #241 on:** 05 Nov 2007, 12:21:38 pm »

Boincstats Host cpus, top 10 highest number on seti@home:
Pos., CPU, #, Total Credit

1    Intel(R) Pentium(R) 4 CPU 3.00GHz     104,449     1,920,980,979.29
2    Intel(R) Pentium(R) 4 CPU 2.80GHz     88,848     1,254,181,274.59
3    Intel(R) Pentium(R) 4 CPU 2.40GHz     57,309     633,952,931.43
4    Intel(R) Pentium(R) 4 CPU 3.20GHz     45,737     875,822,530.51
5    AMD Athlon(tm) 64 Processor 3000+     31,878     257,872,702.50
6    AMD Athlon(tm) 64 Processor 3200+     30,304     288,741,370.07
7    AMD Athlon(tm) Processor      27,726      129,774,610.58
8    Intel(R) Pentium(R) 4 CPU 2.00GHz     21,701    197,541,843.70
9    Intel(R) Pentium(R) 4 CPU 2.66GHz     19,200     208,668,039.95
10    AMD Athlon(tm) 64 Processor 3500+     19,049     191,994,766.55

We're Both in the top 10 most popular

, I have a #8 & #4

[Doesn't it feel good to know you're with the 'in crowd'?]

[Must get around to try to strip mine those inner pulse foldiing loops for the p4 64k / 1meg aliasing problem]

_heinz · « **Reply #242 on:** 05 Nov 2007, 10:02:53 pm »

It is worth to speed them up....

Although Dr. Who is already running his code... we give the old boxes a chance

squeezed the code of pulsefind.cpp again
sum1 and sum2 are no longer needed

here the case construct --->
switch (i) {
// case 30:
// sum1 = one[29] + two[29]; sum2 = one[28] + two[28];
// sum1 += three[29]; sum2 += three[28];
// P->dest[29] = sum1; P->dest[28] = sum2;
// if (sum1 > tmax1) tmax1 = sum1; if (sum2 > tmax2) tmax2 = sum2;
//seti_britta: new code:
case 30:
P->dest[29]= one[29] + two[29]+three[29]; P->dest[28]= one[28] + two[28]+three[28];
// sum1 += three[29]; sum2 += three[28];
// P->dest[29] = sum1; P->dest[28] = sum2;
if (P->dest[29] > tmax1) tmax1 = P->dest[29]; if (P->dest[28] > tmax2) tmax2 = P->dest[28];

and so on for all cases
----------------------------------------------------------------------------------------------------------------------------------------------------

and here the loop construct
// ----------------------------------------------------------------------------
//   Function:   sum_func_ptt( sw_sum3_t31 )
//   Typ      :   float
//   Inhalt   :   folding subroutines, FPU optimized
//   parameter:   sw_sum3_t31
//   last update:23.09.2007   by:seti_britta   new function
// ----------------------------------------------------------------------------
sum_func_ptt( sw_sum3_t31 ) {
register int i, j, k;
float tmax2, tmax1; //seti_britta: new
float *one = ss[0];
float *two = ss[0]+P->tmp0;
float *three = ss[0]+P->tmp1;
tmax2 = tmax1 = (0.0f); //seti_britta: no convert !!
i = P->di;
if ( i & 1 )
{
i -= 1;
P->dest = tmax1 = one + two + three; //seti_britta:new
}
   for ( j = i-1, k = i-2; j > 0; j -= 2, k -= 2 )
   {
P->dest[j]= one[j] + two[j] + three[j]; P->dest[k]= one[k] + two[k] + three[k];
if (P->dest[j] > tmax1) tmax1 = P->dest[j]; if (P->dest[k] > tmax2) tmax2 = P->dest[k];
   }
if (tmax1 > tmax2) return tmax1;
return tmax2;
}
-------------------------------------------------------------------------------------------------------------------------------------------
maybe the compact loop have a chance
so far it compiles well... now we must measure to find fastest
have fun
regards heinz

Jason G · « **Reply #243 on:** 06 Nov 2007, 02:32:21 am »

Yes, I think I would like to carefully go back and rexamine Joe's ideas/Posts in the other thread for incorporating 3 phase processing/ block prefetch in some places. I'll get a chance to look next weekend, and hopefully plan a methodical approach that might be able to handle striping for the p4 at the same time.

Intel theories suggest 3 to 5 times possible improvement, in certain code by fixing those p4 problems, And the 3 phase & prefetch techniques [ Ala AMD Paper] even more. If it adds up to a 10 to 20% crunch time improvement I'll be happy because it would bring my p4 3.2 back over 1000 RAC

Jason G · « **Reply #244 on:** 06 Nov 2007, 07:34:37 am »

Progress so far, Long way to go

:
[Each compared against preset 2.3S9 xW SSE2 IPP build, on vs2005/ICC, p4 Northwood 2.0A@2.1GHz,NoHT, WinXP]

Tactic     Type     Status     Effect
1- Better memcpy in GetFixedPot       Generic x86 Prelim Tests     ~0.3%
2- Out of Place FFTs / eliminating associated memcopies     Intel IPP Initial     ~?.?%
3- Once off seti.cpp 8meg memcpy     Generic x86    Untested    ~0.?%
4- Chirp function Block Prefetch, memcpy++ zerocase & 3phase chirp     Generic x86 Untested     ~?.?%
5- Compiler Flags (xN SSE2 p4 Specific)     P4 specific   Tested     ~10%
6- Strip Mined Inner loops (p4 specific, 64k & 1M variants)     P4, possible x86 Untested     ~??%
7- GaussFit Improvements To be Determined

~ means approximate, my system, 'your mileage may vary'.

[Please anyone feel free to suggest additions, updates or corrections to this list:
either fairly generic OR p4 specific will do

, Consider equivalent xP SSE3 builds as already on the list for later]

Jason

Jason G · « **Reply #245 on:** 07 Nov 2007, 07:13:57 am »

Quote

4- Chirp function Block Prefetch, memcpy++ zerocase & 3phase chirp Generic x86 Untested ~?.?%

Took a quick look between school and work, looks like this may be easier than I thought to try. On my configuration the consistantly selected chirping function is the outstanding "sse2_ChirpData_ak". nice one.

The structure is already there for potential 3 phase processing, though it is currently straight SSE2 rendering it vectorised SIMD as far as I can see. The existing prefetch, processing and writing sections are all SSE2, clearly laid out and exhibit the clean crystal vase like 'niceness' quality that make you reluctant to tamper

With few other adaptations, adjusting the prefetch, changing the processing to FPU, and suitably adjusting the streaming writes should do the trick,
... though for the p4 I would like to try to keep the aliasing issue in mind which might just dictate some of the block sizes and order they are processed.

Oh for the weekend

Jason G · « **Reply #246 on:** 07 Nov 2007, 11:05:59 am »

First run of original code [ Will need run more times for baseline though ] : ( Very Nice function already )

--------------------------------------------------------------------------------------
Testing xN SSE2 Build.

sse2_ChirpData_ak:

NumDataPoints = 1024*1024
test_points = 32768

Timer Frequency in:

Hz = 3579545
MHz = 3.57955
GHz = 0.00358

Start Time = 1585115997106 Ticks
Stop Time = 1585116003199 Ticks

Duration in Ticks = 6093
Duration in seconds = 0.0017021716447

--------------------------------------------------------------------------------------

Inner loop executes 8192 times

_heinz · « **Reply #247 on:** 07 Nov 2007, 11:47:04 am »

measure its the best to try code and find optimal variants.

the loop construct in pulsefind.cpp is ready now, but not measured.
Today I will squeeze the case-construct code.
have still some good ideas to eleminate code else and there...we will see...

Jason G · « **Reply #248 on:** 07 Nov 2007, 12:14:29 pm »

Quote from: seti_britta on 07 Nov 2007, 11:47:04 am

measure its the best to try code and find optimal variants.

the loop construct in pulsefind.cpp is ready now, but not measured.
Today I will squeeze the case-construct code.
have still some good ideas to eleminate code else and there...we will see...

Great!, a pulsefind baseline will be good too. for underneath pulsefind It seems my machine also selects always AK folding routines and spends much of its time in the x2AL version.. I am running vtune on the chirp one now to look for any p4 specific slowdowns, wickedly fast code though

_heinz · « **Reply #249 on:** 07 Nov 2007, 01:55:39 pm »

Quote from: j_groothu on 07 Nov 2007, 12:14:29 pm

Quote from: seti_britta on 07 Nov 2007, 11:47:04 am

I am running vtune on the chirp one now to look for any p4 specific slowdowns, wickedly fast code though

have a strong modified chirpfft.cpp which we can try too

_heinz · « **Reply #250 on:** 07 Nov 2007, 04:47:27 pm »

easy we can compile all 3 cases with the präprozessordefinition now --->
---------------------------------------------------------------------------------------------------
// USE_PFLOOP --> Präprozessordirective
// USE_PFCASE --> Präprozessordirective
#if defined( USE_PFLOOP )
   #pragma message ("-----PFLOOP-----")
   #include "pfloop.h" //use the loop-construct
#else
#if defined( USE_PFCASE )
   #pragma message ("-----PFCASE-----")
   #include "pfcase.h" //use the modified case-construct
#else
   //use original code
#endif // USE_PFCASE
#endif // USE_PFLOOP
-----------------------------------------------------------------------------------------
------ Build started: Project: seti_boinc, Configuration: Release32-NOGFX Win32 ------
Compiling...
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.20404 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.
cl /Od /Ob2 /Oi /Ot /Oy /GT /I "." /I "../../../boinc/api" /I "../../../boinc/client/win" /I "../../../boinc/lib" /I ".." /I "glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\db" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\jpeglib" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\Optimizer" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\image_libs" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX" /I "C:\I\SC\vs90\boinc" /I "C:\I\SC\vs90\boinc\api" /I "C:\I\SC\vs90\boinc\client\win" /I "C:\I\SC\vs90\boinc\lib" /D "WIN32" /D "_WIN32" /D "_WINDOWS" /D "NBOINC_APP_GRAPHICS" /D "CLIENT" /D "_MT" /D "USE_IPP" /D "USE_SSE2" /D "_DEBUG" /D "USE_PFLOOP" /D "_VC80_UPGRADE=0x0600" /D "_MBCS" /GF /Gm /EHsc /MTd /Zp16 /Gy /Fp".\Release/seti_boinc.pch" /Fo".\Release32-NOGFX\\" /Fd".\Release32-NOGFX\vc90.pdb" /FR".\Release32-NOGFX\\" /W3 /c /Wp64 /Zi /TP "..\pulsefind.cpp"
pulsefind.cpp
-----PFLOOP-----
..\pulsefind.cpp(1487) : warning C4146: unary minus operator applied to unsigned type, result still unsigned
Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX\BuildLog.htm"
seti_boinc - 0 error(s), 1 warning(s)
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

regards

Jason G · « **Reply #251 on:** 08 Nov 2007, 04:50:05 am »

Quote from: seti_britta on 07 Nov 2007, 01:55:39 pm

have a strong modified chirpfft.cpp which we can try too

Good we'll do that I think it is a very good idea, I have p4 sse2 primary performance data (vtune) for the sse2_ChirpData_ak, 10000 loops on p4 Northwood with 512k l2 cache, which took a toral time of 10 secs execution time: (19 runs worth of data gathered)
(preliminary data, subject to verification with further runs)
64k Alaising : almost none... Accounts for 1.34% of function workload (about 0.13 secs)
Second Level Cache misses: Accounts for 10.28% of the workload (about 1 second)

other statistics (preliminary, subject to verification) :
128 bit mmx instructions ~82 million (no 64 bit MMX instructions counted)
packed double precision Floating Point SSE instructions ~1.4 billion (thousand million)
packed single precision Floating Point SSE instructions ~4 billion (thousand million)

Mispredicted Branches = 0 !!!

No Machine Clear counts (Pipeline flushes), split loads or blocked store forwards at all

I think that's a really good function, much better statistics than the pulefolding functions gave me, but I'll have to retest those in isolation too as I'm getting better at selecting the correct compiler settings and driving vtune too.

Well I'll check a few build setting and run primary performance measures again to verify those results, and add secondary performance indicators to see what else turns up.... Then on the weekend maybe fiddle with that 3 phase idea to see if it actually works....All good fun

...

Jason

_heinz · « **Reply #252 on:** 08 Nov 2007, 12:12:38 pm »

the modified PFCASE is ready now
-----------------------------------------------
------ Build started: Project: seti_boinc, Configuration: Release32-NOGFX Win32 ------
Compiling...
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.20404 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.
cl /Od /Ob2 /Oi /Ot /Oy /GT /I "." /I "../../../boinc/api" /I "../../../boinc/client/win" /I "../../../boinc/lib" /I ".." /I "glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\db" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\jpeglib" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\Optimizer" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\image_libs" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX" /I "C:\I\SC\vs90\boinc" /I "C:\I\SC\vs90\boinc\api" /I "C:\I\SC\vs90\boinc\client\win" /I "C:\I\SC\vs90\boinc\lib" /D "WIN32" /D "_WIN32" /D "_WINDOWS" /D "NBOINC_APP_GRAPHICS" /D "CLIENT" /D "_MT" /D "USE_IPP" /D "USE_SSE2" /D "_DEBUG" /D "USE_PFCASE" /D "_VC80_UPGRADE=0x0600" /D "_MBCS" /GF /Gm /EHsc /MTd /Zp16 /Gy /Fp".\Release/seti_boinc.pch" /Fo".\Release32-NOGFX\\" /Fd".\Release32-NOGFX\vc90.pdb" /FR".\Release32-NOGFX\\" /W3 /c /Wp64 /Zi /TP "..\pulsefind.cpp"
pulsefind.cpp
-----PFCASE-----
..\pulsefind.cpp(1487) : warning C4146: unary minus operator applied to unsigned type, result still unsigned
Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX\BuildLog.htm"
seti_boinc - 0 error(s), 1 warning(s)
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

_heinz · « **Reply #253 on:** 08 Nov 2007, 09:50:55 pm »

modified PFCASE rocks

here as it was before --->
ar=0.435000 done. Total flop count: 108711033335.208650

PulTimB 0.5 Totals: Ratio Ticks
standard: 1.000 87303043476
Plan < 512 FPU swi ! : 0.575 50201832416
Plan < 512 AK SSE ! : 0.634 55338411648
Plan < 512 BHx SSE ! : 0.993 86661631716
Plan < 512 BH SSE ! : 0.774 67545465584

PFCASE ---->
ar=0.435000 done. Total flop count: 108711033335.208650

PulTimB 0.5 Totals: Ratio Ticks
standard: 1.000 87387438720
Plan < 512 FPU swi ! : 0.504 44014700492
Plan < 512 AK SSE ! : 0.633 55324520388
Plan < 512 BHx SSE ! : 0.992 86681643504
Plan < 512 BH SSE ! : 0.773 67531081560
----------------------------------------------------------------------------------------------------
modified PFCASE ---> ~13% faster

heinz

Jason G · « **Reply #254 on:** 09 Nov 2007, 01:45:24 am »

Woohoo!, It's weekend! that function was with just the changes you made before? I'll guess that maybe the compiler did vectorise some of that, I would like to look at disassembly output, if the compiler was smart enough to put prefetch plus FPU plus streaming stores then that IS 3-Phase

, anything is possible, have you compared for accuracy as well ?

Author Topic: optimized sources (Read 922618 times)

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

Jason G

Re: optimized sources

Jason G

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources