Seti@Home optimized science apps and information
Optimized Seti@Home apps => Discussion Forum => Topic started by: Pepi on 15 Sep 2011, 04:32:31 pm
-
Since there is coders that understand coding and cuda very well, I ask anyone of you can you spent a little of your time and look in source code of this app and optimized it for better performance.
For now CUDA LLR is in average 2-3 times faster then one core of average cpu ( speed 3GHz) ( on my GTX 560 Ti) : Since you guys increase speed in optimized S@H app many times, I assume that you can make similar increase in speed also in this app.
So for now I will wait will anyone will even answer to this post :)
Thanks for reading.
-
PrimeGrid ? Is that the project you're referring to ? Another user/developer (Heinz) has expressed interest in that one. If he's interested in looking into optimisation for that I'll be happy to give him some hints / guidance.
Jason
-
Collatz would be nice as well, maybe figure out what needs to be added to bring into compliance with the BOINC api.
I just wish I understood coding.
-
I just wish I understood coding.
Doesn't stop the rest of us :D What level are the mental barriers at ?
-
PrimeGrid ? Is that the project you're referring to ? Another user/developer (Heinz) has expressed interest in that one. If he's interested in looking into optimisation for that I'll be happy to give him some hints / guidance.
Jason
Yes, Primegrid is that project. Since source code is very small I think you will find any "problems" really fast. Only "problem" is that app is 64 bit :(
You need to have cufft64_32_16.dll and cudart64_32_16.dll in same directory as llrcuda.exe ( attached as llrcuda.rar)
Source code is also attached in second archive.
Command is (example) llrcuda.exe -d -q"46157*2^698207+1"
Thanks for any help Jason!
-
Alright, I'll find out from Heinz what the score is. With Seti almost completely Broken there are other things on my plate that take precedence, but Heinz should really be able to manage at least boincApi updates + some Cuda optimisation.
-
Collatz would be nice as well, maybe figure out what needs to be added to bring into compliance with the BOINC api.
I just wish I understood coding.
Yes Collatz needs cuda 4 app, but it looks like cuda 3.1 or cuda 3.2 or cuda4 will be aprox same speed...
-
Yes Collatz needs cuda 4 app, but it looks like cuda 3.1 or cuda 3.2 or cuda4 will be aprox same speed...
Cuda 4.0 is immature, so has some issues, so I've been sticking with Cuda 3.2 until Cuda 4.1 comes out. x64 isn't a problem for any of us, only a question of whether it's worth bothering with apps that don't execute on CPU... 64 bit Cuda is slightly slower due to cost of larger pointers.
@Heinz: As you already PM'd me about PG before, consider looking at PrimeGrid & Collatz & I will point you to the right directions.
-
So can you putt down app in 32 bit? Any 32 bit app can be executed in 64 bit system, and it is easy to get 32 bit version of cuda dlls. And if you say that 64 bit is slower, just converting to 32 bit will increase speed :)
-
Probably. Depends if they actually used enough (V)RAM to justify[or require] a 64 bit address space. If they did that would be silly, since RAM is slower than computation these days, but anything is possible.
I won't be rushing into anything right now (as mentioned), but certainly since Heinz is interested in PG, and I know Heinz has the right tools etc, then probably it'll get looked at.
Jason
-
Thanks :)
-
I'm pretty sure llrcuda is 64-bit due to the sheer size of the numbers they are dealing with. Even then, the app cannot deal with mersenne primes of the size that would qualify for the next EFF Cooperative Computing Award (100,000,000 digits).
Also, I believe that given a relatively recent CPU and GPU, it is still much slower on the GPU. It is *very*heavily* bound in cufft.
It's been more than a few months since I've looked that way. I had been meaning to look more closely at llrcuda before I dropped off the face of the planet a couple months ago...
The main developer for llrcuda is "msft" on mersenneforum.org. Someone else is doing the boinc-ification...
-
I know that GPU cannot be faster on any project that BOINC run, beside GPU is even now faster then CPU on some primes search, but compared to other BOINC project advantage is minimal. When you get power consumption of GPU and CPU then CPU is still better and cheaper way to find prime ( and works on all prime projects)
And for 32 or 64 bit , CPU also must have deal with big number, but all applications are 32 bits, even 64 bit host get 32 bit app. So it looks that 32 bit CUDA app will be ok ?
-
Lets see what Heinz says. He's a busy man & has multiple exploded computers right now, But I think his interest does lie there & will happily take advice & technology from here & apply it to other projects, if it helps spread 'not-breaking-seti-ness' (technical term) .
-
Hi,
I run primegrid and compiled ppsieve-source with ICC already some time ago.
llrcuda I have not tried till now.
A year ago I compiled a Collatz client with ICC too.
I have done this to study source and testing CUDA and ICC.
If I get some support from Jason I can do the boinc-part changes as it is done in seti already.
I'm very busy, so I can do it as soon I have time.
heinz
-
If someone would have a closer look at llr
I found now llr download-area (http://pgllr.mine.nu/software/LLR/)
heinz
-
could compile llr with VS2008 and CUDA40
1>llrcuda_win64 - 0 Fehler, 336 Warnung(en)
========== Alles neu erstellen: 1 erfolgreich, Fehler bei 0, 0 übersprungen ==========
The using of cutil.h cutil_inline.h in the project llr under CUDA40 is a bit problematic, cutil is no longer part of CUDA(since 4.0) (http://blog.cuvilib.com/2011/03/09/nvidia-cuda-4-0-tips-and-issues/)
-
Well done Heinz. If you plan for boinc lib updates first (to fix exit conditions) then optimisation I can give more hints as time goes on.
Jason
-
I run a short test with the original llrCUDA not my compiled version on i3 GT540M
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"9999*2^458051+1" -d
Starting Proth prime test of 9999*2^458051+1
Using complex irrational base DWT, FFT length = 65536, a = 5
9999*2^458051+1 is prime! Time : 487.041 sec.. Time per bit: 1.060 ms.
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"1000065*2^390927-1" -d
Starting Lucas Lehmer Riesel prime test of 1000065*2^390927-1
Using real irrational base DWT, FFT length = 131072
V1 = 5 ; Computing U0...
V1 = 5 ; Computing U0...done.
Starting Lucas-Lehmer loop...
1000065*2^390927-1, iteration : 10000 / 390927 [2.55%]. Time per iteration : 1.
1000065*2^390927-1, iteration : 20000 / 390927 [5.11%]. Time per iteration : 1.
1000065*2^390927-1, iteration : 30000 / 390927 [7.67%]. Time per iteration : 1.
1000065*2^390927-1, iteration : 40000 / 390927 [10.23%]. Time per iteration : 1
...
...
1000065*2^390927-1, iteration : 190000 / 390927 [48.60%]. Time per iteration :
Iter: 192128/390926, ERROR: ROUND OFF (0.4675197601) > 0.4
Continuing from last save file.
Resuming LLR test of 1000065*2^390927-1 at iteration 2 [0.00%]
1000065*2^390927-1, iteration : 10000 / 390927 [2.55%]. Time per iteration : 1.
1000065*2^390927-1, iteration : 20000 / 390927 [5.11%]. Time per iteration : 1.
1000065*2^390927-1, iteration : 30000 / 390927 [7.67%]. Time per iteration : 1.
1000065*2^390927-1, iteration : 40000 / 390927 [10.23%]. Time per iteration : 1
..
..
1000065*2^390927-1, iteration : 380000 / 390927 [97.20%]. Time per iteration :
1000065*2^390927-1, iteration : 390000 / 390927 [99.76%]. Time per iteration :
1000065*2^390927-1 is not prime. LLR Res64: 5704E082C8671874 Time : 721.315 sec.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"313*2^1012240+1" -d
Starting Proth prime test of 313*2^1012240+1
Using complex irrational base DWT, FFT length = 131072, a = 3
313*2^1012240+1 is not prime. Proth RES64: A3FC31A0497414EE Time : 1949.425 sec.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"192971*2^4998058-1" -d
Starting Lucas Lehmer Riesel prime test of 192971*2^4998058-1
Using real irrational base DWT, FFT length = 1048576
V1 = 4 ; Computing U0...
V1 = 4 ; Computing U0...done.
Starting Lucas-Lehmer loop...
192971*2^4998058-1, iteration : 10000 / 4998058 [0.20%]. Time per iteration : 2
192971*2^4998058-1, iteration : 20000 / 4998058 [0.40%]. Time per iteration : 1
192971*2^4998058-1, iteration : 30000 / 4998058 [0.60%]. Time per iteration : 1
192971*2^4998058-1, iteration : 40000 / 4998058 [0.80%]. Time per iteration : 1
...
...
192971*2^4998058-1, iteration : 2500000 / 4998058 [50.01%]. Time per iteration
192971*2^4998058-1, iteration : 2510000 / 4998058 [50.21%]. Time per iteration
192971*2^4998058-1, iteration : 2520000 / 4998058 [50.41%]. Time per iteration
192971*2^4998058-1, iteration : 2530000 / 4998058 [50.61%]. Time per iteration
...
...
192971*2^4998058-1, iteration : 4970000 / 4998058 [99.43%]. Time per iteration
192971*2^4998058-1, iteration : 4980000 / 4998058 [99.63%]. Time per iteration
192971*2^4998058-1, iteration : 4990000 / 4998058 [99.83%]. Time per iteration
192971*2^4998058-1 is not prime. LLR Res64: DBBFCB63CFBA6EA2 Time : 71172.972sec.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"3*2^7033641+1" -d
Starting Proth prime test of 3*2^7033641+1
Using complex irrational base DWT, FFT length = 1048576, a = 5
3*2^7033641+1, bit: 90000 / 7033642 [1.27%]. Time per bit: 14.932 ms.
..
3*2^7033641+1, bit: 2590000 / 7033642 [36.82%]. Time per bit: 14.932 ms.
3*2^7033641+1, bit: 2770000 / 7033642 [39.38%]. Time per bit: 14.931 ms.
3*2^7033641+1, bit: 4700000 / 7033642 [66.82%]. Time per bit: 14.932 ms. (20 hours)
...
3*2^7033641+1 is not prime. Proth RES64: 4DDC768A04467D4E Time : 105090.700 sec.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
uhh, one is running about 19 hours,
next one seems to be a long runner too, precalculation says ~30 hours.....I will see the end..
Remark: GPU temp increased from 70 to 79 grd C
ready now, it was a longer test...
I will rerun the first two tasks to see differences
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Will make a modified batchfile for speed-testing variants
-
Well done Heinz. If you plan for boinc lib updates first (to fix exit conditions) then optimisation I can give more hints as time goes on.
Jason
Hi Jason
It's a good idea to make the boinc lib updates now...
some hints, links ?
-
Hi Jason
It's a good idea to make the boinc lib updates now...
some hints, links ?
First step, you'll need to look at building some updated Boinc libs from the Boinc trunk, then making them fit into the application, which will require some app updates of files reference in Boinc to be compatible, and to use the newer GPU related features.
-
The using of cutil.h cutil_inline.h in the project llr under CUDA40 is a bit problematic, cutil is no longer part of CUDA(since 4.0) (http://blog.cuvilib.com/2011/03/09/nvidia-cuda-4-0-tips-and-issues/)
It doesn't use CUTIL for much, anyway. It doesn't take much to remove this dependency. (So far as I've seen, most projects that use CUTIL only use cutilSafeCall() and/or cufftSafeCall()).
-
Hi Jason
It's a good idea to make the boinc lib updates now...
some hints, links ?
First step, you'll need to look at building some updated Boinc libs from the Boinc trunk, then making them fit into the application, which will require some app updates of files reference in Boinc to be compatible, and to use the newer GPU related features.
C:\I\SC\pg\Ken-g6-PSieve-CUDA-a17a696_heinz\boinc
At revision: 24231
One or more files are in a conflicted state.
-
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"1000065*2^390927-1" -d
1000065*2^390927-1 is not prime. LLR Res64: 5704E082C8671874 Time : 721.315 sec.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"313*2^1012240+1" -d
313*2^1012240+1 is not prime. Proth RES64: A3FC31A0497414EE Time : 1949.425 sec.
I'm concerned, you're getting the wrong results here: "1000065*2^390927-1" should be prime! "313*2^1012240+1" should return 5FA128A9BECBCDD3.
-
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"1000065*2^390927-1" -d
1000065*2^390927-1 is not prime. LLR Res64: 5704E082C8671874 Time : 721.315 sec.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"313*2^1012240+1" -d
313*2^1012240+1 is not prime. Proth RES64: A3FC31A0497414EE Time : 1949.425 sec.
I'm concerned, you're getting the wrong results here: "1000065*2^390927-1" should be prime! "313*2^1012240+1" should return 5FA128A9BECBCDD3.
Although I run the original downloded llrCUDA.exe, will rerun those two(if whole test ends), too see if I get yor result.
If not we have a problem there.
heinz
-
Rerun of those two, this time with right results
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"1000065*2^390927-1" -d
Starting Lucas Lehmer Riesel prime test of 1000065*2^390927-1
Using real irrational base DWT, FFT length = 131072
V1 = 5 ; Computing U0...
V1 = 5 ; Computing U0...done.
Starting Lucas-Lehmer loop...
1000065*2^390927-1, iteration : 10000 / 390927 [2.55%]. Time per iteration : 1.
...
1000065*2^390927-1, iteration : 390000 / 390927 [99.76%]. Time per iteration :
1000065*2^390927-1 is prime! Time : 701.721 sec.
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"313*2^1012240+1" -d
Starting Proth prime test of 313*2^1012240+1
Using complex irrational base DWT, FFT length = 131072, a = 3
313*2^1012240+1 is not prime. Proth RES64: 5FA128A9BECBCDD3 Time : 1894.073 se
c.
-
Is this some new build or "old" one? :)
-
Is this some new build or "old" one? :)
Likely work in progress, so not even alpha yet Pepi ;)
-
Alpha or Beta, everything shows progress :)
Hip hip huray :)
It was small step for me, but big step for my GPU :)
-
On a slightly related note, I've been working with CUDALucas a bit recently, as the current devs of it over at mersenneforum.org had completely broken it as far as Linux support.
It is only for testing mersenne primes, and limited by memory to only being able to test primes up to around 2290000000-1... which would currently take about 245 days on a GTX460 (an exponent, which if my calculations are correct, would take about 19 years on a 2GHz single core CPU. The next Mersenne to win an EFF Cooperative Computing Award would be around 2336000000-1)
My fork (only builds on 64-bit Linux currently): https://github.com/ah42/CUDALucas
-
... It is only for testing mersenne primes, and limited by memory to only being able to test primes up to around 2290000000-1... which would currently take about 245 days on a GTX460 (an exponent, which if my calculations are correct, would take about 19 years on a 2GHz single core CPU. The next Mersenne to win an EFF Cooperative Computing Award would be around 2336000000-1) ...
Ouch... Too late in the night for counting all those exponent digits... It is moving somewhat when you need exponents to sensibly express the exponents!
Quite a nice speedup there for the GPU to CPU comparison ;D
Aside: I hit a RAC of over 200k on a test run with Boinc-GIMPS on my GTS450.
Happy fast crunchin',
Martin