Forum > Windows
sources with Orcas
Josef W. Segur:
It hasn't been mentioned lately that in 2004 Eric Korpela set up a setiboinc sourceforge project to encourage submission of optimized code. Ben Herndon participated in that, and his code in ../setiboinc/client/opt is where I got the original form of the GetPSO functions. I just made it into four separate versions to compare the prefetching. On some systems and some builds there have been clear differences particularly on my Pentium-M system and on Core 2 systems. On P4 or PD systems, OTOH, the differences have been small.
Francois Piednoel's PowerSpectrum code from the "Who? optimizations Part 2" NC thread may be even better. I hope he'll release source soon but that part would be fairly easy to fit into our codebase anyhow.
Joe
Jason G:
Thanks Joe,
Though I did run maybe a couple of workunits when sah classic first started (I remember the news release on tv here in Oz... Incidently, could never rember the email I used then....), I just came back around last November....
so really '2004' is indeed before my time... Thanks for the link, more reading is always handy.
Just opened the Pulse Timing project for the first time, Looks like I'll be occupied for the weekend trying to get a working build, Looks Extremely handy for testing some of the things been worked on behind the scenes.
Jason
_heinz:
Looking now to vectorize the most used functions of S@H.
Here I show you a modified FillTrigArray as it performs as part of chirpfft.cpp
Have fun ;D
Heinz
---------------------------------------------------------------------------
PUBLIC ?FillTrigArray@@YAXH@Z ; FillTrigArray
; Function compile flags: /Ogtpy
; COMDAT ?FillTrigArray@@YAXH@Z
_TEXT SEGMENT
_k$ = 8 ; size = 4
?FillTrigArray@@YAXH@Z PROC ; FillTrigArray, COMDAT
; 731 : CurrentTrig[k].Sin = ((CurrentTrig[k].Sin * TrigStep[k].Cos) + (CurrentTrig[k].Cos * TrigStep[k].Sin));
mov edx, DWORD PTR ?TrigStep@@3PAUSinCosArray@@A ; TrigStep
mov ecx, DWORD PTR ?CurrentTrig@@3PAUSinCosArray@@A ; CurrentTrig
mov eax, DWORD PTR _k$[esp-4]
shl eax, 4
movsd xmm1, QWORD PTR [eax+edx+8]
mulsd xmm1, QWORD PTR [eax+ecx]
movsd xmm0, QWORD PTR [eax+ecx+8]
mulsd xmm0, QWORD PTR [eax+edx]
addsd xmm0, xmm1
movsd QWORD PTR [eax+ecx], xmm0
; 732 : CurrentTrig[k].Cos = ((CurrentTrig[k].Cos * TrigStep[k].Cos) - (CurrentTrig[k].Sin * TrigStep[k].Sin));
mov edx, DWORD PTR ?TrigStep@@3PAUSinCosArray@@A ; TrigStep
movsd xmm0, QWORD PTR [eax+edx+8]
push esi
mov esi, DWORD PTR ?CurrentTrig@@3PAUSinCosArray@@A ; CurrentTrig
movsd xmm1, QWORD PTR [eax+esi]
mulsd xmm0, QWORD PTR [eax+esi+8]
mulsd xmm1, QWORD PTR [eax+edx]
lea ecx, DWORD PTR [eax+esi+8]
subsd xmm0, xmm1
movsd QWORD PTR [ecx], xmm0
pop esi
; 733 : }
ret 0
?FillTrigArray@@YAXH@Z ENDP ; FillTrigArray
_TEXT ENDS
_heinz:
I. Going parallel or how to cut the leek !
This morning I was in the kitchen to make a salad of leek. After washing the leafs I took one to cut it in fit parts.
But how big is a fit part ? 1mm, 10mm, 100mm ?
I have a relative big leaf of 24 cm so we choose 10mm= 1cm as a fit part. Now we know I must cut it into 24 parts.
How todo that ?
1. we take the knife and cut it into 24 pieces one after the other. Wee need 24-1 = 23 cuts
We have stiil one tree to cut. This means sequential works.
or we do following-->
2. we cut the leaf in two eaqual parts(1 cut), then lay both parts parallel to each other and cut it. We need 12-1 = 11 cuts +1 extra cut from the first. Summary 12 cuts.
This means parallel work. We have 2 parallel trees to cut.The one extra cut and lay both parts parallel is the overhead(organize parallel work).
3. we cut the leaf into 2 parts (1 cut), laying both parts parallel, cut now again into 2 parts(1 cut), laying again the 2 parts parallel(have now 4 parallel) and cut it. We need still 6-1 cuts plus the 2 extra cuts = summary 7 cuts.
This means much more parallel. We have 4 trees. The overhead is now grown to 2 cuts plus laying 4 parts parallel.
4. we cut the leaf into 2 parts(1cut) laying both parts parallel, cut now again into 2 parts(1 cut), laying again both parts parallel and cut it into 2 parts(1 cut), laying the two parallel(have now 8 parts parallel) and cut. We need 3-1=2 cuts plus 3 extra cuts summary 5 cuts. The overhead grows now to 3 cuts plus laying 3 times (2³ = 8parts) parallel.
Summary of all:
1. sequential = 23 cuts --> no overhead
2. parallel (2) = 11+1=12 cuts (1 cut overhead)
3. parallel (4) = 5+2=7 cuts (2 cuts overhead)
4. parallel ( 8 ) = 2+3=5 cuts (3 cuts overhead)
-----------------------------------------------------------------------
In this way (4.) we can do the same work with still 5 cuts against (1.) 23 cuts if we use sequential work.
But attention the overhead with 3 is bigger as the real work-cuts(2) and we run 8 trees parallel.
This method of organize parallel work is called blocking. The problem is to determine the length of the pieces(fit parts) and the choose of parallel trees to get maximal performance. I believe every max performance solution is for every given work (1000), (100 000), (1 000 000) and machine a other. Therefore this thema is relative complex and so difficult to handle.
Believe me, the parallel cutted salad has a fine taste. ;D
Have fun ;D ;D ;D
regards heinz
Jason G:
8)...Or the 'other' kind of coarse grained parallelism ... where you ask your mum to make you a sandwich instead, then you watch TV in parallel ;D [ 0 cuts + 1 small communication overhead]
Jason
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version