Forum > GPU crunching
GPU/CPU performance dependence from AR value
Raistmer:
Sure, any external task move between CPU and GPU will result in breaking BOINC's expectance about how much work it has and need.
That is, no such action should be taken for 10 day cahce at all. And better is such rebranding will occur on regular basis pertty often but by pretty small chunks to not to deceive BOINC work amount estimate too much each time.
Richard Haselgrove:
You probably need to add some theoretical underpinning to the 'WU selection for re-branding', too. I've just finished an AR=0.127238 task, which, while exhibiting some of the sluggishness of the 'true' VLARs, ran for about 30 minutes on a 9800GT - compared to 90 minutes for a ~0.01, or 20 minutes for a ~0.4
This WU would have come from the first peak of Joe's graph recently published at Main:
I've also had (rare) tasks from the second peak, around AR=0.2, with similar runtimes. Joe, if you're watching, is there any way of knowing how many times these big pulse PoT arrays are processed during a run at different ARs? If there's a massive array at 40960, but it's only run once, the problem is much less than the arrays at 32768 which seem to run forever at VLAR.
Also, another thought. Several people (myself included) have observed and reported that VHAR tasks don't scale well on multi-core hosts: there's a big penalty for running 8 x VHAR on my dual Xeon (memory bus saturation, even with quad-channel FB-DIMMs, we think). Your efficiency cross-over point may be different if you measure with all CPU cores saturated with VHAR work - which would likely be the case after re-branding.
Josef W. Segur:
--- Quote from: Richard Haselgrove on 18 Apr 2009, 09:33:08 am ---...
Joe, if you're watching, is there any way of knowing how many times these big pulse PoT arrays are processed during a run at different ARs? If there's a massive array at 40960, but it's only run once, the problem is much less than the arrays at 32768 which seem to run forever at VLAR.
...
--- End quote ---
Yes, it is possible. First, here's a table which applies to all LAR pulse finding:
--- Code: ---FFTLen Stepsize NumCfft
8 17.072753 11
16 8.536377 23
32 4.268188 47
64 2.134094 93
128 1.067047 187
256 0.533524 375
512 0.266762 749
1024 0.133381 1499
2048 0.066690 2999
4096 0.033345 5997
8192 0.016673 11995
--- End code ---
The Stepsize is in chirp values, and the work goes to a final chirp limit of +/- 100, so the NumCfft (number of chirp/fft pairs) is simply derived by dividing 100 by Stepsize, truncating to an integer, doubling (for positive and negative), and adding 1 (for processing at zero chirp). The reason the same set of values applies to all LAR work is the code uses pot_min_slew to establish the Stepsize for any work below AR 0.225485775 . FFTLen 8 and 16 are only used if there's sufficient motion that PoTlen will be below the 40960 limit.
When an FFTLen 32 is done it produces 32 arrays 32768 long, but only 31 are searched for signals. For true VLAR, motion is less than one beamwidth for the full WU duration, so each of the 31 is done in one gulp. Because there are 47 chirp/fft pairs to process, we get 31*47=1457 of them.
For anything above one beamwidth the data is processed with PoTs overlapped by at least 50%, so for 1 to 1.5 beamwidths two PoTs are needed, then three for 1.5 to 2 beamwidths, four for 2 to 2.5, etc. Here's a table for true VLAR plus the two peaks which reach 40960 length:
--- Code: ---FFTLen AR<=0.05 AR=0.08 AR=0.16
------------- -------------- -------------
8 462(@40960)
16 1035(@40960) 2070(@20480)
32 1457(@32768) 4371(@20480) 8742(@10240)
64 5859(@16384) 17577(@10240) 35154(@5120)
128 23749(@8192) 71247(@5120) 142494(@2560)
256 95625(@4096) 286875(@2560) 573750(@1280)
512 382739(@2048) 1148217(@1280) 2296434(@640)
1024 1.53e6(@1024) 4.6e6(@640) 9.2e6(@320)
2048 6.14e6(@512) 1.84e7(@320) 3.68e7(@160)
4096 2.46e7(@256) 7.37e7(@160) 1.47e8(@80)
8192 9.83e7(@128) 2.95e8(@80) 5.9e8(@40)
--- End code ---
In another thread there's the testing Raistmer did at my request showing how much time is saved doing VLAR work with the 32768 and 16384 PoT lengths not done, and with the 8192 length skipped also. Although the counts for those lengths is relatively small, they seem to account for most of the GPU crunch time. What I don't know is if the relationship is relatively smooth or time increases in steps as certain sizes are exceeded.
Raistmer, could you provide the raw data from which your chart was derived? I'd like to be able to correlate things better.
Joe
Raistmer:
--- Quote from: Richard Haselgrove on 18 Apr 2009, 09:33:08 am ---
Also, another thought. Several people (myself included) have observed and reported that VHAR tasks don't scale well on multi-core hosts: there's a big penalty for running 8 x VHAR on my dual Xeon (memory bus saturation, even with quad-channel FB-DIMMs, we think). Your efficiency cross-over point may be different if you measure with all CPU cores saturated with VHAR work - which would likely be the case after re-branding.
--- End quote ---
Yes, multicore (as mult-GPU) consideration can complicate this picture much.
I consider it as first-level approach while curent (any AR goes anywhere) is zero-level one. Moreover situation with HT-capable host can be different too (compared with true multicore host).
Will try to measure CPU/ELAPSED times while all cores busy with VHARs.
For protocol: All measurements made in first post was made with BOINC disabled. All cores were idle except wich running test app.
Raistmer:
--- Quote from: Josef W. Segur on 18 Apr 2009, 05:08:33 pm ---
Raistmer, could you provide the raw data from which your chart was derived? I'd like to be able to correlate things better.
Joe
--- End quote ---
Sure, will mail it to you.
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version