Seti@Home optimized science apps and information

Optimized Seti@Home apps => Windows => Topic started by: Raistmer on 19 Nov 2013, 06:03:46 pm

Title: What is best hardware for what SETI application
Post by: Raistmer on 19 Nov 2013, 06:03:46 pm
Lets compare relative performance of MultiBeam and AstroPulse on different hardware.
Comparison should not include any credits-based metric (cause credit awarding screwed and not reflect performance anymore IMO).

So I chose another metric for comparison: sum of elapsed time for PGv7 set divided by time for Clean01 task.
Such number doesn't mean anything for single device, but comparing such numbers between devices one could say wich device better suitable for wich application.
The more number is the relatively faster is AstroPulse processing on device.
IMHo it would be interesting to compare such numbers between different CPUs and GPUs to understand what better to crunch on what, not involving credits or RAC into consideration.

[ADDON: surely it's quite simplified metric so it's only some estimation, only quite big difference in numbers for different devices could make any sense and lead to some conclusions. For now this metric omits AP blanking, relative CPU usage for GPU apps, change in performance on fully loaded multicore CPU and so on. For each app best available app will be used with current production settings (not the best possible but chosen as best at given time)]


So lets start with my Intel Core2 Q9450 host:

MB7 data:
 Core2 Q9450, idle:
App Name   Task name   AR/blanking%   CPU time   Elapsed   ffa_block   ffa_block_fetch   unroll   hp   use_sleep   skip_ffa_precompute   sbs
AKv8c_r1973_winx86_SSSE3xjs    PG0009_v7.wu    0.008955   529.826   532.221   -1   -1   -1   0   0   0   -1
AKv8c_r1973_winx86_SSSE3xjs    PG0395_v7.wu    0.394768   475.148   477.385   -1   -1   -1   0   0   0   -1
AKv8c_r1973_winx86_SSSE3xjs    PG0444_v7.wu    0.444184   408.255   410.520   -1   -1   -1   0   0   0   -1
AKv8c_r1973_winx86_SSSE3xjs    PG1327_v7.wu    1.326684   412.389   414.608   -1   -1   -1   0   0   0   -1

So, MB7 number will be: 1834.734

AP6 data:
WU : Clean_01LC.wu
AP6_win_x86_SSE_CPU_r1797.exe
  Elapsed 306.450 secs
      CPU 304.155 secs

So, AP6 number will be: 306.450

And relative AP6vsMB7 performance number (in chosen metric) for Q9450 will be: 5.987 (4.25)
Now this number should be compared with similarly aquired number for another device to aquire any meaning.

And similar numbers for HD6950 I use in that host:
App Name   Task name   AR/blanking%   CPU time   Elapsed   ffa_block   ffa_block_fetch   unroll   hp   use_sleep   skip_ffa_precompute   sbs
 MB7_win_x86_SSE_OpenCL_ATi_HD5_r1843    PG0009_v7.wu    0.008955   32.526   125.850   -1   -1   -1   0   0   0   -1
 MB7_win_x86_SSE_OpenCL_ATi_HD5_r1843    PG0395_v7.wu    0.394768   27.597   72.738   -1   -1   -1   0   0   0   -1
 MB7_win_x86_SSE_OpenCL_ATi_HD5_r1843    PG0444_v7.wu    0.444184   29.079   71.293   -1   -1   -1   0   0   0   -1
 MB7_win_x86_SSE_OpenCL_ATi_HD5_r1843    PG1327_v7.wu    1.326684   19.266   71.931   -1   -1   -1   0   0   0   -1

WU : Clean_01LC.wu
AP6_win_x86_SSE2_OpenCL_ATI_r2058.exe -unroll 16 -ffa_block 8192 -ffa_block_fetch 8192 :
  Elapsed 19.822 secs
      CPU 5.897 secs

So HD6950 number will be: 17.244 (10.895)

Cause no VLAR comes to GPU on SETI main now metric w/o PG0009 task would reflect real relative performance better (and this number come in "()")

So, CPU Q9450 is 4.25 and GPU HD6950 is 10.9, more than twice bigger.
That is, AstroPulse tasks relatively better to do on this ATi GPU than on this Intel CPU. If one would configure own host for best SETI project performance one would compute only MB7 tasks on such CPU leaving such GPU only for AP6 tasks.


AMD C-60 APU:
CPU part:
C-60, idle:
App Name   Task name   AR/blanking%   CPU time   Elapsed   ffa_block   ffa_block_fetch   unroll   hp   use_sleep   skip_ffa_precompute   sbs
AKv8c_r1973_winx86_SSSE3xjs    PG0009_v7.wu    0.008955   2106.762   2115.052   -1   -1   -1   0   0   0   -1
AKv8c_r1973_winx86_SSSE3xjs    PG0395_v7.wu    0.394768   1889.391   1896.355   -1   -1   -1   0   0   0   -1
AKv8c_r1973_winx86_SSSE3xjs    PG0444_v7.wu    0.444184   1650.896   1658.018   -1   -1   -1   0   0   0   -1
AKv8c_r1973_winx86_SSSE3xjs    PG1327_v7.wu    1.326684   1964.115   1975.775   -1   -1   -1   0   0   0   -1

WU : Clean_01LC.wu
AP6_win_x86_SSE_CPU_r1797.exe -verbose  :
  Elapsed 1911.643 secs
      CPU 1895.631 secs

So, AMD C-60 CPU part number is: 4.0 (2.89)
For Intel Q9450 it was: 5.987 (4.25)

Comparing these 2 CPUs between each other one could say that C-60 (CPU part) much less suitable for AstroPulse (even opt AP6 CPU app, for stock results would be even more bad) than Intel's Core2 quad. It's already known fact and most probable explanation is size of L2 cache, CPU AP6 is cache-hungry app.

Now C-60 GPU part:

 MB7_win_x86_SSE_OpenCL_ATi_HD5_r2033_ZC    PG0009_v7.wu    0.008955   111.244   1190.611   -1   -1   -1   0   0   0   -1
 MB7_win_x86_SSE_OpenCL_ATi_HD5_r2033_ZC    PG0395_v7.wu    0.394768   91.853   967.816   -1   -1   -1   0   0   0   -1
 MB7_win_x86_SSE_OpenCL_ATi_HD5_r2033_ZC    PG0444_v7.wu    0.444184   93.320   904.281   -1   -1   -1   0   0   0   -1
 MB7_win_x86_SSE_OpenCL_ATi_HD5_r2033_ZC    PG1327_v7.wu    1.326684   60.341   927.759   -1   -1   -1   0   0   0   -1

AP6_win_x86_SSE2_OpenCL_ATI_r2058.exe -unroll 4  -ffa_block 4096 -ffa_block_fetch 4096 :
  Elapsed 298.819 secs
      CPU 14.711 secs

And C-60 GPU part number is: 13.35 (9.37)
Definitely better to crunch AP on GPU part of this APU only leaving CPU either idle or busy with CPU MB7.
Comparing this GPU with discrete HD6950 that has 17.244 (10.895): MB7 chance for C-60 GPU increased. As should be cause currently OpenCL GPU MB7 performance limited mostly by work size. Some searches are just too small to adequately load GPU with big numbers of CUs. From other hand, C-60 has low CU number so better loaded with MB7 work and difference in performance between MV7 and AP6 becomes smaller.
So, high-end ATi GPUs prefer AP6 tasks in more degree than low-end ones.

(to be continued (Intel and NV GPUs next)...

And now lets move to another host with Intel i5-3470 (Ivy Bridge) APU with discrete GSO9600 NV GPU installed.

CPU part:
 AKv8c_r1973_winx86_SSSE3xjs    PG0009_v7.wu    0.008955   272.159   274.237   -1   -1   -1   0   0   0   -1
 AKv8c_r1973_winx86_SSSE3xjs    PG0395_v7.wu    0.394768   261.114   263.201   -1   -1   -1   0   0   0   -1
 AKv8c_r1973_winx86_SSSE3xjs    PG0444_v7.wu    0.444184   223.705   226.044   -1   -1   -1   0   0   0   -1
 AKv8c_r1973_winx86_SSSE3xjs    PG1327_v7.wu    1.326684   230.569   232.651   -1   -1   -1   0   0   0   -1

AP6_win_x86_SSE_CPU_r1797.exe -verbose  :
  Elapsed 162.830 secs
      CPU 160.650 secs

So, Ivy Bridge CPU part number is: 6.12 (4.43)
Comparing with old Core2 quad that has 5.987 (4.25):
Values very close, most probably in the range of error and definitely inside systematic error bounds of this method. AP not good on CPU, even on Ivy Bridge.


GPU part:
 MB7_win_x86_SSE_OpenCL_Intel_r2061    PG0009_v7.wu    0.008955   17.503   562.534   -1   -1   -1   0   0   0   -1
 MB7_win_x86_SSE_OpenCL_Intel_r2061    PG0395_v7.wu    0.394768   15.772   390.116   -1   -1   -1   0   0   0   -1
 MB7_win_x86_SSE_OpenCL_Intel_r2061    PG0444_v7.wu    0.444184   19.157   360.820   -1   -1   -1   0   0   0   -1
 MB7_win_x86_SSE_OpenCL_Intel_r2061    PG1327_v7.wu    1.326684   16.942   528.949   -1   -1   -1   0   0   0   -1

AP6_win_x86_SSE2_OpenCL_Intel_r2058.exe -verbose  :
  Elapsed 242.775 secs
      CPU 3.619 secs

Ivy Bridge GPU part scores: 7.59 (5.27)
It's natural to compare this ently-level GPU with C-60 GPU part that has 13.35 (9.37).
Surprise! Intel GPU much more tolerable to  MB7 app than ATi GPUs are! AP here almost as bad as on CPU part. So, one could use both OpenCL apps on this GPU to keep it busy all time.

And finally lets consider NV GPU (pre-FERMI) one. GSO9600:
 Lunatics_x41zc_win32_cuda23    PG0009_v7.wu    0.008955   11.450   473.929   -1   -1   -1   0   0   0   -1
 Lunatics_x41zc_win32_cuda23    PG0395_v7.wu    0.394768   11.840   153.582   -1   -1   -1   0   0   0   -1
 Lunatics_x41zc_win32_cuda23    PG0444_v7.wu    0.444184   11.248   137.389   -1   -1   -1   0   0   0   -1
 Lunatics_x41zc_win32_cuda23    PG1327_v7.wu    1.326684   15.725   193.752   -1   -1   -1   0   0   0   -1

WU : Clean_01LC.wu
AP6_win_x86_SSE2_OpenCL_NV_r2058.exe -verbose  :
  Elapsed 209.530 secs
      CPU 1.591 secs

GSO9600 scores: 4.58 (2.31)
First of all, it's really bad on VLAR, 2 scores differ hugely. And both are lower than CPU ones for Intel.
So, better  avoid to load this GPU with AP6 tasks, leave such devices for MB7 CUDA tasks instead (no matter what your RAC tells you ;) )

Would be interesting if some FERMI or Kepler class NV GPU owner will continue this research and post his bench results on one of such devices.
Also, would be interesting to compare GCN-generation of ATi GPUs with older ones.
And finally, some data from one of  current AMD top CPUs would be interesting to get too.