Lets compare relative performance of MultiBeam and AstroPulse on different hardware.
Comparison should not include any credits-based metric (cause credit awarding screwed and not reflect performance anymore IMO).
So I chose another metric for comparison: sum of elapsed time for PGv7 set divided by time for Clean01 task.
Such number doesn't mean anything for single device, but comparing such numbers between devices one could say wich device better suitable for wich application.
The more number is the relatively faster is AstroPulse processing on device.
IMHo it would be interesting to compare such numbers between different CPUs and GPUs to understand what better to crunch on what, not involving credits or RAC into consideration.
[ADDON: surely it's quite simplified metric so it's only some estimation, only quite big difference in numbers for different devices could make any sense and lead to some conclusions. For now this metric omits AP blanking, relative CPU usage for GPU apps, change in performance on fully loaded multicore CPU and so on. For each app best available app will be used with current production settings (not the best possible but chosen as best at given time)]
So lets start with my Intel Core2 Q9450 host:
MB7 data:
Core2 Q9450, idle:
App Name Task name AR/blanking% CPU time Elapsed ffa_block ffa_block_fetch unroll hp use_sleep skip_ffa_precompute sbs
AKv8c_r1973_winx86_SSSE3xjs PG0009_v7.wu 0.008955 529.826 532.221 -1 -1 -1 0 0 0 -1
AKv8c_r1973_winx86_SSSE3xjs PG0395_v7.wu 0.394768 475.148 477.385 -1 -1 -1 0 0 0 -1
AKv8c_r1973_winx86_SSSE3xjs PG0444_v7.wu 0.444184 408.255 410.520 -1 -1 -1 0 0 0 -1
AKv8c_r1973_winx86_SSSE3xjs PG1327_v7.wu 1.326684 412.389 414.608 -1 -1 -1 0 0 0 -1
So, MB7 number will be: 1834.734
AP6 data:
WU : Clean_01LC.wu
AP6_win_x86_SSE_CPU_r1797.exe
Elapsed 306.450 secs
CPU 304.155 secs
So, AP6 number will be: 306.450
And relative AP6vsMB7 performance number (in chosen metric) for
Q9450 will be: 5.987 (4.25)Now this number should be compared with similarly aquired number for another device to aquire any meaning.
And similar numbers for HD6950 I use in that host:
App Name Task name AR/blanking% CPU time Elapsed ffa_block ffa_block_fetch unroll hp use_sleep skip_ffa_precompute sbs
MB7_win_x86_SSE_OpenCL_ATi_HD5_r1843 PG0009_v7.wu 0.008955 32.526 125.850 -1 -1 -1 0 0 0 -1
MB7_win_x86_SSE_OpenCL_ATi_HD5_r1843 PG0395_v7.wu 0.394768 27.597 72.738 -1 -1 -1 0 0 0 -1
MB7_win_x86_SSE_OpenCL_ATi_HD5_r1843 PG0444_v7.wu 0.444184 29.079 71.293 -1 -1 -1 0 0 0 -1
MB7_win_x86_SSE_OpenCL_ATi_HD5_r1843 PG1327_v7.wu 1.326684 19.266 71.931 -1 -1 -1 0 0 0 -1
WU : Clean_01LC.wu
AP6_win_x86_SSE2_OpenCL_ATI_r2058.exe -unroll 16 -ffa_block 8192 -ffa_block_fetch 8192 :
Elapsed 19.822 secs
CPU 5.897 secs
So
HD6950 number will be: 17.244 (10.895)Cause no VLAR comes to GPU on SETI main now metric w/o PG0009 task would reflect real relative performance better (and this number come in "()")
So, CPU Q9450 is 4.25 and GPU HD6950 is 10.9, more than twice bigger. That is, AstroPulse tasks relatively better to do on this ATi GPU than on this Intel CPU.
If one would configure own host for best SETI project performance one would compute only MB7 tasks on such CPU leaving such GPU only for AP6 tasks.
AMD C-60 APU:
CPU part:
C-60, idle:
App Name Task name AR/blanking% CPU time Elapsed ffa_block ffa_block_fetch unroll hp use_sleep skip_ffa_precompute sbs
AKv8c_r1973_winx86_SSSE3xjs PG0009_v7.wu 0.008955 2106.762 2115.052 -1 -1 -1 0 0 0 -1
AKv8c_r1973_winx86_SSSE3xjs PG0395_v7.wu 0.394768 1889.391 1896.355 -1 -1 -1 0 0 0 -1
AKv8c_r1973_winx86_SSSE3xjs PG0444_v7.wu 0.444184 1650.896 1658.018 -1 -1 -1 0 0 0 -1
AKv8c_r1973_winx86_SSSE3xjs PG1327_v7.wu 1.326684 1964.115 1975.775 -1 -1 -1 0 0 0 -1
WU : Clean_01LC.wu
AP6_win_x86_SSE_CPU_r1797.exe -verbose :
Elapsed 1911.643 secs
CPU 1895.631 secs
So,
AMD C-60 CPU part number is: 4.0 (2.89)For Intel Q9450 it was: 5.987 (4.25)
Comparing these 2 CPUs between each other one could say that C-60 (CPU part) much less suitable for AstroPulse (even opt AP6 CPU app, for stock results would be even more bad) than Intel's Core2 quad. It's already known fact and most probable explanation is size of L2 cache, CPU AP6 is cache-hungry app.
Now C-60 GPU part:
MB7_win_x86_SSE_OpenCL_ATi_HD5_r2033_ZC PG0009_v7.wu 0.008955 111.244 1190.611 -1 -1 -1 0 0 0 -1
MB7_win_x86_SSE_OpenCL_ATi_HD5_r2033_ZC PG0395_v7.wu 0.394768 91.853 967.816 -1 -1 -1 0 0 0 -1
MB7_win_x86_SSE_OpenCL_ATi_HD5_r2033_ZC PG0444_v7.wu 0.444184 93.320 904.281 -1 -1 -1 0 0 0 -1
MB7_win_x86_SSE_OpenCL_ATi_HD5_r2033_ZC PG1327_v7.wu 1.326684 60.341 927.759 -1 -1 -1 0 0 0 -1
AP6_win_x86_SSE2_OpenCL_ATI_r2058.exe -unroll 4 -ffa_block 4096 -ffa_block_fetch 4096 :
Elapsed 298.819 secs
CPU 14.711 secs
And
C-60 GPU part number is: 13.35 (9.37)Definitely better to crunch AP on GPU part of this APU only leaving CPU either idle or busy with CPU MB7.Comparing this GPU with discrete HD6950 that has 17.244 (10.895): MB7 chance for C-60 GPU increased. As should be cause currently OpenCL GPU MB7 performance limited mostly by work size. Some searches are just too small to adequately load GPU with big numbers of CUs. From other hand, C-60 has low CU number so better loaded with MB7 work and difference in performance between MV7 and AP6 becomes smaller.
So, high-end ATi GPUs prefer AP6 tasks in more degree than low-end ones.(to be continued (Intel and NV GPUs next)...
And now lets move to another host with Intel i5-3470 (Ivy Bridge) APU with discrete GSO9600 NV GPU installed.
CPU part:
AKv8c_r1973_winx86_SSSE3xjs PG0009_v7.wu 0.008955 272.159 274.237 -1 -1 -1 0 0 0 -1
AKv8c_r1973_winx86_SSSE3xjs PG0395_v7.wu 0.394768 261.114 263.201 -1 -1 -1 0 0 0 -1
AKv8c_r1973_winx86_SSSE3xjs PG0444_v7.wu 0.444184 223.705 226.044 -1 -1 -1 0 0 0 -1
AKv8c_r1973_winx86_SSSE3xjs PG1327_v7.wu 1.326684 230.569 232.651 -1 -1 -1 0 0 0 -1
AP6_win_x86_SSE_CPU_r1797.exe -verbose :
Elapsed 162.830 secs
CPU 160.650 secs
So,
Ivy Bridge CPU part number is: 6.12 (4.43)Comparing with old Core2 quad that has 5.987 (4.25):
Values very close, most probably in the range of error and definitely inside systematic error bounds of this method.
AP not good on CPU, even on Ivy Bridge.GPU part:
MB7_win_x86_SSE_OpenCL_Intel_r2061 PG0009_v7.wu 0.008955 17.503 562.534 -1 -1 -1 0 0 0 -1
MB7_win_x86_SSE_OpenCL_Intel_r2061 PG0395_v7.wu 0.394768 15.772 390.116 -1 -1 -1 0 0 0 -1
MB7_win_x86_SSE_OpenCL_Intel_r2061 PG0444_v7.wu 0.444184 19.157 360.820 -1 -1 -1 0 0 0 -1
MB7_win_x86_SSE_OpenCL_Intel_r2061 PG1327_v7.wu 1.326684 16.942 528.949 -1 -1 -1 0 0 0 -1
AP6_win_x86_SSE2_OpenCL_Intel_r2058.exe -verbose :
Elapsed 242.775 secs
CPU 3.619 secs
Ivy Bridge GPU part scores: 7.59 (5.27)It's natural to compare this ently-level GPU with C-60 GPU part that has 13.35 (9.37).
Surprise! Intel GPU much more tolerable to MB7 app than ATi GPUs are! AP here almost as bad as on CPU part. So, one could use both OpenCL apps on this GPU to keep it busy all time.And finally lets consider NV GPU (pre-FERMI) one. GSO9600:
Lunatics_x41zc_win32_cuda23 PG0009_v7.wu 0.008955 11.450 473.929 -1 -1 -1 0 0 0 -1
Lunatics_x41zc_win32_cuda23 PG0395_v7.wu 0.394768 11.840 153.582 -1 -1 -1 0 0 0 -1
Lunatics_x41zc_win32_cuda23 PG0444_v7.wu 0.444184 11.248 137.389 -1 -1 -1 0 0 0 -1
Lunatics_x41zc_win32_cuda23 PG1327_v7.wu 1.326684 15.725 193.752 -1 -1 -1 0 0 0 -1
WU : Clean_01LC.wu
AP6_win_x86_SSE2_OpenCL_NV_r2058.exe -verbose :
Elapsed 209.530 secs
CPU 1.591 secs
GSO9600 scores: 4.58 (2.31)First of all, it's really bad on VLAR, 2 scores differ hugely. And both are lower than CPU ones for Intel.
So, better avoid to load this GPU with AP6 tasks, leave such devices for MB7 CUDA tasks instead (no matter what your RAC tells you )Would be interesting if some FERMI or Kepler class NV GPU owner will continue this research and post his bench results on one of such devices.
Also, would be interesting to compare GCN-generation of ATi GPUs with older ones.
And finally, some data from one of current AMD top CPUs would be interesting to get too.