Crunch3r has released four new seti@home cuda 3.0 apps
here.
I checked the speed of the apps and the validity of the results. The tests were done in
this machine.
First of all, if you haven't seen it, nvidia has released a
Fermi compatibility guide where it says:
My application uses the CUDA Runtime API with CUDA Toolkit 2.1, 2.2, or 2.3.
How can I confirm that my application is ready to run on Fermi?
Answer: CUDA applications built using the CUDA Toolkit versions 2.1 through
2.3 are compatible with Fermi as long as they are built to include PTX versions of
their kernels. NVIDIA Driver versions 195.xx or newer allow the application to use
the PTX JIT code path. To test that PTX JIT is working for your application, you
can do the following:
- Go to the NVIDIA website, and install the latest R195 driver.
- Set the system environment flag CUDA_FORCE_PTX_JIT=1
- Launch your application.
When starting a CUDA application for the first time with the above environment
flag, the CUDA driver will JIT compile the PTX for each CUDA kernel that is used
into native CUBIN code. The generated CUBIN for the target GPU architecture is
cached by the CUDA driver. This cache persists across system shutdown/restart
events.
If this test passes, then your application is ready for Fermi.
So I've done the following tests:
Cuda 2.2 app
No CUDA_FORCE_PTX_JIT flag:
cuda 2.3 libs OK
cuda 3 libs OK
CUDA_FORCE_PTX_JIT flag enabled:
cuda 2.3 libs OK
cuda 3 libs
FAILAll cuda 3 apps
No CUDA_FORCE_PTX_JIT flag:
cuda 3 libs OK
CUDA_FORCE_PTX_JIT flag enabled:
cuda 3 libs
FAILThe combination of CUDA_FORCE_PTX_JIT flag and cuda 3 libraries produces garbage results (all result overflows) no matter what app was used.
Now to the tests.
I used 10 different normal workunits. Instead of their names, I write their AR along with what was found, S=Spikes, P=Pulses. T=Triplets and G=Gaussians.
I used the time utility to get accurate running times. A bit of explanation:
real: elapsed real (wall clock) time used by the process
user: Total number of CPU-seconds that the process used directly (in user mode)
sys: Total number of CPU-seconds used by the system on behalf of the process (in kernel mode) e.g., executing system calls
Percentages were calculated taking as base the cuda 2.2 app with cuda 2.3 libraries
and no CUDA_FORCE_PTX_JIT flag with the formula 100*(cuda2.2-app)/cuda2.2. Exception is the VLAR were as base was used AKv8.
AR 0.011923 0S 7P 1T 0G
real user sys
AK_V8 ssse3 64bit 8886.5 sec 8876.21 sec 1.31 sec
CUDA 3 10700.64 sec -20.41% 52.13 sec 99.41% 5.82 sec -344.27%
cuda30 10701.57 sec -20.43% 52.77 sec 99.41% 5.86 sec -347.33%
cuda30_v0.2 10701.18 sec -20.42% 52.55 sec 99.41% 5.73 sec -337.40%
CUDA 2.2 app, CUDA 2.3 libs, no jit
real user sys
AR 0.230341 0S 1P 0T 18G 4956.77 sec 73.5 sec 11.71 sec
AR 0.265939 2S 0P 0T 5G 4021.79 sec 68.58 sec 10.6 sec
AR 0.309898 29S 1P 0T 0G 936.14 sec 41.94 sec 3.31 sec
AR 0.386085 3S 1P 0T 1G 2756.04 sec 48.42 sec 19.01 sec
AR 0.396978 6S 1P 2T 0G 2675.23 sec 58.76 sec 8.08 sec
AR 0.409679 2S 0P 0T 0G 2435.57 sec 57.46 sec 7.61 sec
AR 0.437709 1S 0P 1T 0G 2231.96 sec 55.85 sec 6.73 sec
AR 0.510893 4S 0P 0T 0G 2024.16 sec 53.12 sec 6.3 sec
AR 0.942199 8S 0P 0T 0G 1201.43 sec 46.57 sec 4.52 sec
Total 23239.09 sec 504.2 sec 77.87 sec
CUDA 2.2 app, CUDA 2.3 libs, jit enabled
real user sys
AR 0.230341 0S 1P 0T 18G 4965.56 sec -0.18% 82.1 sec -11.70% 12.24 sec -4.53%
AR 0.265939 2S 0P 0T 5G 4030.33 sec -0.21% 77.52 sec -13.04% 10.8 sec -1.89%
AR 0.309898 29S 1P 0T 0G 945.22 sec -0.97% 51.26 sec -22.22% 3.4 sec -2.72%
AR 0.386085 3S 1P 0T 1G 2763.89 sec -0.28% 68.13 sec -40.71% 8.22 sec 56.76%
AR 0.396978 6S 1P 2T 0G 2684.3 sec -0.34% 68.25 sec -16.15% 8.24 sec -1.98%
AR 0.409679 2S 0P 0T 0G 2443.44 sec -0.32% 65.97 sec -14.81% 7.67 sec -0.79%
AR 0.437709 1S 0P 1T 0G 2240.33 sec -0.38% 64.65 sec -15.76% 7.08 sec -5.20%
AR 0.510893 4S 0P 0T 0G 2033.17 sec -0.45% 62.74 sec -18.11% 6.6 sec -4.76%
AR 0.942199 8S 0P 0T 0G 1209.93 sec -0.71% 55.3 sec -18.75% 4.85 sec -7.30%
Total 23316.17 sec -0.33% 595.92 sec -18.19% 69.1 sec 11.26%
CUDA 2.2 app, CUDA 3 libs, no jit
real user sys
AR 0.230341 0S 1P 0T 18G 5042.03 sec -1.72% 72 sec 2.04% 11.98 sec -2.31%
AR 0.265939 2S 0P 0T 5G 4102.33 sec -2.00% 67.36 sec 1.78% 10.7 sec -0.94%
AR 0.309898 29S 1P 0T 0G 973.15 sec -3.95% 41.91 sec 0.07% 3.18 sec 3.93%
AR 0.386085 3S 1P 0T 1G 2823.52 sec -2.45% 58.44 sec -20.69% 7.99 sec 57.97%
AR 0.396978 6S 1P 2T 0G 2742.01 sec -2.50% 58.02 sec 1.26% 7.88 sec 2.48%
AR 0.409679 2S 0P 0T 0G 2501.22 sec -2.70% 56.48 sec 1.71% 7.38 sec 3.02%
AR 0.437709 1S 0P 1T 0G 2295.95 sec -2.87% 54.51 sec 2.40% 6.97 sec -3.57%
AR 0.510893 4S 0P 0T 0G 2084.92 sec -3.00% 51.32 sec 3.39% 7.26 sec -15.24%
AR 0.942199 8S 0P 0T 0G 1252.34 sec -4.24% 45.24 sec 2.86% 4.42 sec 2.21%
Total 23817.47 sec -2.49% 505.28 sec -0.21% 67.76 sec 12.98%
CUDA 3 app, CUDA 3 libs, no jit
real user sys
AR 0.230341 0S 1P 0T 18G 5282.36 sec -6.57% 72.33 sec 1.59% 11.7 sec 0.09%
AR 0.265939 2S 0P 0T 5G 4323.51 sec -7.50% 67.5 sec 1.57% 10.54 sec 0.57%
AR 0.309898 29S 1P 0T 0G 1023.86 sec -9.37% 42.41 sec -1.12% 3.24 sec 2.11%
AR 0.386085 3S 1P 0T 1G 3007.54 sec -9.13% 58.44 sec -20.69% 7.96 sec 58.13%
AR 0.396978 6S 1P 2T 0G 2921.9 sec -9.22% 58.06 sec 1.19% 7.71 sec 4.58%
AR 0.409679 2S 0P 0T 0G 2678.24 sec -9.96% 56.37 sec 1.90% 7.34 sec 3.55%
AR 0.437709 1S 0P 1T 0G 2459.33 sec -10.19% 55.56 sec 0.52% 6.89 sec -2.38%
AR 0.510893 4S 0P 0T 0G 2234.59 sec -10.40% 52 sec 2.11% 6.18 sec 1.90%
AR 0.942199 8S 0P 0T 0G 1331.61 sec -10.84% 45.4 sec 2.51% 4.46 sec 1.33%
Total 25262.94 sec -8.71% 508.07 sec -0.77% 66.02 sec 15.22%
CUDA 3 vlarkill app, CUDA 3 libs, no jit
real user sys
AR 0.230341 0S 1P 0T 18G 5282.05 sec -6.56% 71.86 sec 2.23% 11.86 sec -1.28%
AR 0.265939 2S 0P 0T 5G 4321.99 sec -7.46% 66.95 sec 2.38% 10.26 sec 3.21%
AR 0.309898 29S 1P 0T 0G 1024.01 sec -9.39% 42.57 sec -1.50% 3.16 sec 4.53%
AR 0.386085 3S 1P 0T 1G 3007.05 sec -9.11% 58.21 sec -20.22% 7.71 sec 59.44%
AR 0.396978 6S 1P 2T 0G 2923.24 sec -9.27% 58.72 sec 0.07% 7.8 sec 3.47%
AR 0.409679 2S 0P 0T 0G 2678.58 sec -9.98% 56.52 sec 1.64% 7.45 sec 2.10%
AR 0.437709 1S 0P 1T 0G 2458.97 sec -10.17% 55.2 sec 1.16% 7 sec -4.01%
AR 0.510893 4S 0P 0T 0G 2233.06 sec -10.32% 52.49 sec 1.19% 5.7 sec 9.52%
AR 0.942199 8S 0P 0T 0G 1331.66 sec -10.84% 45.38 sec 2.56% 4.43 sec 1.99%
Total 25260.61 sec -8.70% 507.9 sec -0.73% 65.37 sec 16.05%
cuda30 app, CUDA 3 libs, no jit
real user sys
AR 0.230341 0S 1P 0T 18G 5282.88 sec -6.58% 72.34 sec 1.58% 11.7 sec 0.09%
AR 0.265939 2P 0P 0T 5G 4322.89 sec -7.49% 67.25 sec 1.94% 10.55 sec 0.47%
AR 0.309898 29S 1P 0T 0G 1023.61 sec -9.34% 42.21 sec -0.64% 3.22 sec 2.72%
AR 0.386085 3S 1P 0T 1G 3008.17 sec -9.15% 58.68 sec -21.19% 8.14 sec 57.18%
AR 0.396978 6S 1P 2T 0G 2921.98 sec -9.22% 57.88 sec 1.50% 7.67 sec 5.07%
AR 0.409679 2S 0P 0T 0G 2678.14 sec -9.96% 56.3 sec 2.02% 7.35 sec 3.42%
AR 0.437709 1S 0P 1T 0G 2458.46 sec -10.15% 55.11 sec 1.32% 7.06 sec -4.90%
AR 0.510893 4S 0P 0T 0G 2235.67 sec -10.45% 52.51 sec 1.15% 6.36 sec -0.95%
AR 0.942199 8S 0P 0T 0G 1332.55 sec -10.91% 46.04 sec 1.14% 4.38 sec 3.10%
Total 25264.35 sec -8.71% 508.32 sec -0.82% 66.43 sec 14.69%
cuda30_v0.2 app, CUDA 3 libs, no jit
real user sys
AR 0.230341 0S 1P 0T 18G 5282.94 sec -6.58% 72.91 sec 0.80% 11.91 sec -1.71%
AR 0.265939 2S 0P 0T 5G 4322.73 sec -7.48% 67.48 sec 1.60% 10.58 sec 0.19%
AR 0.309898 29S 1P 0T 0G 1023.14 sec -9.29% 41.73 sec 0.50% 3.19 sec 3.63%
AR 0.386085 3S 1P 0T 1G 3008.15 sec -9.15% 58.85 sec -21.54% 7.9 sec 58.44%
AR 0.396978 6S 1P 2T 0G 2922.16 sec -9.23% 57.98 sec 1.33% 7.8 sec 3.47%
AR 0.409679 2S 0P 0T 0G 2678.51 sec -9.97% 56.49 sec 1.69% 7.52 sec 1.18%
AR 0.437709 1S 0P 1T 0G 2458.38 sec -10.14% 55.24 sec 1.09% 6.86 sec -1.93%
AR 0.510893 4S 0P 0T 0G 2235.82 sec -10.46% 52.67 sec 0.85% 6.3 sec 0.00%
AR 0.942199 8S 0P 0T 0G 1332.55 sec -10.91% 43.17 sec 7.30% 7.38 sec -63.27%
Total 25264.38 sec -8.72% 506.52 sec -0.46% 69.44 sec 10.83%
All results, in any permutation, were strongly similar.
So the fastest combination was cuda 2.2 app, cuda 2.3 libs and no jit flag, at least for this cuda 2.x enabled graphics card.
I guess the combination of a cuda 3 app with cuda 3 libs will show their real face only in a Fermi based graphics card.