Seti@Home optimized science apps and information

Optimized Seti@Home apps => Discussion Forum => Topic started by: Pepi on 15 Sep 2011, 04:32:31 pm

Title: CUDA for prime number search
Post by: Pepi on 15 Sep 2011, 04:32:31 pm: Since there is coders that understand coding and cuda very well, I ask anyone of you can you spent a little of your time and look in source code of this app and optimized it for better performance.
For now CUDA LLR is in average 2-3 times faster then one core of average cpu ( speed 3GHz) ( on my GTX 560 Ti) : Since you guys increase speed in optimized S@H app many times, I assume that you can make similar increase in speed also in this app.
So for now I will wait will anyone will even answer to this post :)
Thanks for reading.
Title: Re: CUDA for prime number search
Post by: Jason G on 15 Sep 2011, 04:38:27 pm: PrimeGrid ? Is that the project you're referring to ? Another user/developer (Heinz) has expressed interest in that one. If he's interested in looking into optimisation for that I'll be happy to give him some hints / guidance.

Jason
Title: Re: CUDA for prime number search
Post by: arkayn on 15 Sep 2011, 04:41:56 pm: Collatz would be nice as well, maybe figure out what needs to be added to bring into compliance with the BOINC api.

I just wish I understood coding.
Title: Re: CUDA for prime number search
Post by: Jason G on 15 Sep 2011, 04:55:40 pm: Quote from: arkayn on 15 Sep 2011, 04:41:56 pm
I just wish I understood coding.

Doesn't stop the rest of us :D What level are the mental barriers at ?
Title: Re: CUDA for prime number search
Post by: Pepi on 15 Sep 2011, 04:56:49 pm: Quote from: Jason G on 15 Sep 2011, 04:38:27 pm
PrimeGrid ? Is that the project you're referring to ? Another user/developer (Heinz) has expressed interest in that one. If he's interested in looking into optimisation for that I'll be happy to give him some hints / guidance.

Jason

Yes, Primegrid is that project. Since source code is very small I think you will find any "problems" really fast. Only "problem" is that app is 64 bit :(
You need to have cufft64_32_16.dll and cudart64_32_16.dll in same directory as llrcuda.exe ( attached as llrcuda.rar)
Source code is also attached in second archive.
Command is (example) llrcuda.exe -d -q"46157*2^698207+1"

Thanks for any help Jason!
Title: Re: CUDA for prime number search
Post by: Jason G on 15 Sep 2011, 05:00:45 pm: Alright, I'll find out from Heinz what the score is. With Seti almost completely Broken there are other things on my plate that take precedence, but Heinz should really be able to manage at least boincApi updates + some Cuda optimisation.
Title: Re: CUDA for prime number search
Post by: Pepi on 15 Sep 2011, 05:01:49 pm: Quote from: arkayn on 15 Sep 2011, 04:41:56 pm
Collatz would be nice as well, maybe figure out what needs to be added to bring into compliance with the BOINC api.

I just wish I understood coding.

Yes Collatz needs cuda 4 app, but it looks like cuda 3.1 or cuda 3.2 or cuda4 will be aprox same speed...
Title: Re: CUDA for prime number search
Post by: Jason G on 15 Sep 2011, 05:04:55 pm: Quote from: Pepi on 15 Sep 2011, 05:01:49 pm
Yes Collatz needs cuda 4 app, but it looks like cuda 3.1 or cuda 3.2 or cuda4 will be aprox same speed...

Cuda 4.0 is immature, so has some issues, so I've been sticking with Cuda 3.2 until Cuda 4.1 comes out. x64 isn't a problem for any of us, only a question of whether it's worth bothering with apps that don't execute on CPU... 64 bit Cuda is slightly slower due to cost of larger pointers.

@Heinz: As you already PM'd me about PG before, consider looking at PrimeGrid & Collatz & I will point you to the right directions.
Title: Re: CUDA for prime number search
Post by: Pepi on 15 Sep 2011, 05:07:14 pm: So can you putt down app in 32 bit? Any 32 bit app can be executed in 64 bit system, and it is easy to get 32 bit version of cuda dlls. And if you say that 64 bit is slower, just converting to 32 bit will increase speed :)
Title: Re: CUDA for prime number search
Post by: Jason G on 15 Sep 2011, 05:13:12 pm: Probably. Depends if they actually used enough (V)RAM to justify[or require] a 64 bit address space. If they did that would be silly, since RAM is slower than computation these days, but anything is possible.

I won't be rushing into anything right now (as mentioned), but certainly since Heinz is interested in PG, and I know Heinz has the right tools etc, then probably it'll get looked at.

Jason
Title: Re: CUDA for prime number search
Post by: Pepi on 15 Sep 2011, 05:14:23 pm: Thanks :)
Title: Re: CUDA for prime number search
Post by: aaronhaviland on 15 Sep 2011, 06:44:37 pm: I'm pretty sure llrcuda is 64-bit due to the sheer size of the numbers they are dealing with. Even then, the app cannot deal with mersenne primes of the size that would qualify for the next EFF Cooperative Computing Award (100,000,000 digits).

Also, I believe that given a relatively recent CPU and GPU, it is still much slower on the GPU. It is *very*heavily* bound in cufft.

It's been more than a few months since I've looked that way. I had been meaning to look more closely at llrcuda before I dropped off the face of the planet a couple months ago...

The main developer for llrcuda is "msft" on mersenneforum.org. Someone else is doing the boinc-ification...
Title: Re: CUDA for prime number search
Post by: Pepi on 15 Sep 2011, 07:17:30 pm: I know that GPU cannot be faster on any project that BOINC run, beside GPU is even now faster then CPU on some primes search, but compared to other BOINC project advantage is minimal. When you get power consumption of GPU and CPU then CPU is still better and cheaper way to find prime ( and works on all prime projects)

And for 32 or 64 bit , CPU also must have deal with big number, but all applications are 32 bits, even 64 bit host get 32 bit app. So it looks that 32 bit CUDA app will be ok ?
Title: Re: CUDA for prime number search
Post by: Jason G on 15 Sep 2011, 07:21:37 pm: Lets see what Heinz says. He's a busy man & has multiple exploded computers right now, But I think his interest does lie there & will happily take advice & technology from here & apply it to other projects, if it helps spread 'not-breaking-seti-ness' (technical term) .
Title: Re: CUDA for prime number search
Post by: _heinz on 16 Sep 2011, 06:56:23 pm: Hi,
I run primegrid and compiled ppsieve-source with ICC already some time ago.
llrcuda I have not tried till now.
A year ago I compiled a Collatz client with ICC too.
I have done this to study source and testing CUDA and ICC.
If I get some support from Jason I can do the boinc-part changes as it is done in seti already.
I'm very busy, so I can do it as soon I have time.

heinz
Title: Re: CUDA for prime number search
Post by: _heinz on 17 Sep 2011, 02:51:38 pm: If someone would have a closer look at llr
I found now llr download-area (http://pgllr.mine.nu/software/LLR/)

heinz
Title: Re: CUDA for prime number search
Post by: _heinz on 17 Sep 2011, 05:54:42 pm: could compile llr with VS2008 and CUDA40
1>llrcuda_win64 - 0 Fehler, 336 Warnung(en)
========== Alles neu erstellen: 1 erfolgreich, Fehler bei 0, 0 übersprungen ==========

The using of cutil.h cutil_inline.h in the project llr under CUDA40 is a bit problematic, cutil is no longer part of CUDA(since 4.0) (http://blog.cuvilib.com/2011/03/09/nvidia-cuda-4-0-tips-and-issues/)
Title: Re: CUDA for prime number search
Post by: Jason G on 17 Sep 2011, 06:11:42 pm: Well done Heinz. If you plan for boinc lib updates first (to fix exit conditions) then optimisation I can give more hints as time goes on.

Jason
Title: Re: CUDA for prime number search
Post by: _heinz on 17 Sep 2011, 07:23:07 pm: I run a short test with the original llrCUDA not my compiled version on i3 GT540M
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"9999*2^458051+1" -d
Starting Proth prime test of 9999*2^458051+1
Using complex irrational base DWT, FFT length = 65536, a = 5

9999*2^458051+1 is prime! Time : 487.041 sec.. Time per bit: 1.060 ms.

C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"1000065*2^390927-1" -d

Starting Lucas Lehmer Riesel prime test of 1000065*2^390927-1
Using real irrational base DWT, FFT length = 131072
V1 = 5 ; Computing U0...
V1 = 5 ; Computing U0...done.
Starting Lucas-Lehmer loop...
1000065*2^390927-1, iteration : 10000 / 390927 [2.55%]. Time per iteration : 1.
1000065*2^390927-1, iteration : 20000 / 390927 [5.11%]. Time per iteration : 1.
1000065*2^390927-1, iteration : 30000 / 390927 [7.67%]. Time per iteration : 1.
1000065*2^390927-1, iteration : 40000 / 390927 [10.23%]. Time per iteration : 1
...
...
1000065*2^390927-1, iteration : 190000 / 390927 [48.60%]. Time per iteration :
Iter: 192128/390926, ERROR: ROUND OFF (0.4675197601) > 0.4
Continuing from last save file.
Resuming LLR test of 1000065*2^390927-1 at iteration 2 [0.00%]
1000065*2^390927-1, iteration : 10000 / 390927 [2.55%]. Time per iteration : 1.
1000065*2^390927-1, iteration : 20000 / 390927 [5.11%]. Time per iteration : 1.
1000065*2^390927-1, iteration : 30000 / 390927 [7.67%]. Time per iteration : 1.
1000065*2^390927-1, iteration : 40000 / 390927 [10.23%]. Time per iteration : 1
..
..
1000065*2^390927-1, iteration : 380000 / 390927 [97.20%]. Time per iteration :
1000065*2^390927-1, iteration : 390000 / 390927 [99.76%]. Time per iteration :
1000065*2^390927-1 is not prime. LLR Res64: 5704E082C8671874 Time : 721.315 sec.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"313*2^1012240+1" -d
Starting Proth prime test of 313*2^1012240+1
Using complex irrational base DWT, FFT length = 131072, a = 3

313*2^1012240+1 is not prime. Proth RES64: A3FC31A0497414EE Time : 1949.425 sec.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"192971*2^4998058-1" -d

Starting Lucas Lehmer Riesel prime test of 192971*2^4998058-1
Using real irrational base DWT, FFT length = 1048576
V1 = 4 ; Computing U0...
V1 = 4 ; Computing U0...done.
Starting Lucas-Lehmer loop...
192971*2^4998058-1, iteration : 10000 / 4998058 [0.20%]. Time per iteration : 2
192971*2^4998058-1, iteration : 20000 / 4998058 [0.40%]. Time per iteration : 1
192971*2^4998058-1, iteration : 30000 / 4998058 [0.60%]. Time per iteration : 1
192971*2^4998058-1, iteration : 40000 / 4998058 [0.80%]. Time per iteration : 1
...
...
192971*2^4998058-1, iteration : 2500000 / 4998058 [50.01%]. Time per iteration
192971*2^4998058-1, iteration : 2510000 / 4998058 [50.21%]. Time per iteration
192971*2^4998058-1, iteration : 2520000 / 4998058 [50.41%]. Time per iteration
192971*2^4998058-1, iteration : 2530000 / 4998058 [50.61%]. Time per iteration
...
...
192971*2^4998058-1, iteration : 4970000 / 4998058 [99.43%]. Time per iteration
192971*2^4998058-1, iteration : 4980000 / 4998058 [99.63%]. Time per iteration
192971*2^4998058-1, iteration : 4990000 / 4998058 [99.83%]. Time per iteration
192971*2^4998058-1 is not prime. LLR Res64: DBBFCB63CFBA6EA2 Time : 71172.972sec.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"3*2^7033641+1" -d
Starting Proth prime test of 3*2^7033641+1
Using complex irrational base DWT, FFT length = 1048576, a = 5

3*2^7033641+1, bit: 90000 / 7033642 [1.27%]. Time per bit: 14.932 ms.
..
3*2^7033641+1, bit: 2590000 / 7033642 [36.82%]. Time per bit: 14.932 ms.
3*2^7033641+1, bit: 2770000 / 7033642 [39.38%]. Time per bit: 14.931 ms.
3*2^7033641+1, bit: 4700000 / 7033642 [66.82%]. Time per bit: 14.932 ms. (20 hours)
...
3*2^7033641+1 is not prime. Proth RES64: 4DDC768A04467D4E Time : 105090.700 sec.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

uhh, one is running about 19 hours,
next one seems to be a long runner too, precalculation says ~30 hours.....I will see the end..
Remark: GPU temp increased from 70 to 79 grd C
ready now, it was a longer test...
I will rerun the first two tasks to see differences
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Will make a modified batchfile for speed-testing variants
Title: Re: CUDA for prime number search
Post by: _heinz on 18 Sep 2011, 10:22:45 am: Quote from: Jason G on 17 Sep 2011, 06:11:42 pm
Well done Heinz. If you plan for boinc lib updates first (to fix exit conditions) then optimisation I can give more hints as time goes on.

Jason
Hi Jason
It's a good idea to make the boinc lib updates now...
some hints, links ?
Title: Re: CUDA for prime number search
Post by: Jason G on 18 Sep 2011, 10:28:28 am: Quote from: _heinz on 18 Sep 2011, 10:22:45 am
Hi Jason
It's a good idea to make the boinc lib updates now...
some hints, links ?

First step, you'll need to look at building some updated Boinc libs from the Boinc trunk, then making them fit into the application, which will require some app updates of files reference in Boinc to be compatible, and to use the newer GPU related features.
Title: Re: CUDA for prime number search
Post by: aaronhaviland on 18 Sep 2011, 10:30:18 am: Quote from: _heinz on 17 Sep 2011, 05:54:42 pm
The using of cutil.h cutil_inline.h in the project llr under CUDA40 is a bit problematic, cutil is no longer part of CUDA(since 4.0) (http://blog.cuvilib.com/2011/03/09/nvidia-cuda-4-0-tips-and-issues/)

It doesn't use CUTIL for much, anyway. It doesn't take much to remove this dependency. (So far as I've seen, most projects that use CUTIL only use cutilSafeCall() and/or cufftSafeCall()).
Title: Re: CUDA for prime number search
Post by: _heinz on 18 Sep 2011, 11:28:38 am: Quote from: Jason G on 18 Sep 2011, 10:28:28 am
Quote from: _heinz on 18 Sep 2011, 10:22:45 am
Hi Jason
It's a good idea to make the boinc lib updates now...
some hints, links ?

First step, you'll need to look at building some updated Boinc libs from the Boinc trunk, then making them fit into the application, which will require some app updates of files reference in Boinc to be compatible, and to use the newer GPU related features.

C:\I\SC\pg\Ken-g6-PSieve-CUDA-a17a696_heinz\boinc
At revision: 24231
One or more files are in a conflicted state.
Title: Re: CUDA for prime number search
Post by: aaronhaviland on 18 Sep 2011, 09:58:41 pm: Quote from: _heinz on 17 Sep 2011, 07:23:07 pm
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"1000065*2^390927-1" -d
1000065*2^390927-1 is not prime. LLR Res64: 5704E082C8671874 Time : 721.315 sec.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"313*2^1012240+1" -d
313*2^1012240+1 is not prime. Proth RES64: A3FC31A0497414EE Time : 1949.425 sec.

I'm concerned, you're getting the wrong results here: "1000065*2^390927-1" should be prime! "313*2^1012240+1" should return 5FA128A9BECBCDD3.
Title: Re: CUDA for prime number search
Post by: _heinz on 19 Sep 2011, 02:42:14 am: Quote from: aaronhaviland on 18 Sep 2011, 09:58:41 pm
Quote from: _heinz on 17 Sep 2011, 07:23:07 pm
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"1000065*2^390927-1" -d
1000065*2^390927-1 is not prime. LLR Res64: 5704E082C8671874 Time : 721.315 sec.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"313*2^1012240+1" -d
313*2^1012240+1 is not prime. Proth RES64: A3FC31A0497414EE Time : 1949.425 sec.

I'm concerned, you're getting the wrong results here: "1000065*2^390927-1" should be prime! "313*2^1012240+1" should return 5FA128A9BECBCDD3.
Although I run the original downloded llrCUDA.exe, will rerun those two(if whole test ends), too see if I get yor result.
If not we have a problem there.
heinz
Title: Re: CUDA for prime number search
Post by: _heinz on 20 Sep 2011, 03:51:54 am: Rerun of those two, this time with right results

C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"1000065*2^390927-1" -d

Starting Lucas Lehmer Riesel prime test of 1000065*2^390927-1
Using real irrational base DWT, FFT length = 131072
V1 = 5 ; Computing U0...
V1 = 5 ; Computing U0...done.
Starting Lucas-Lehmer loop...
1000065*2^390927-1, iteration : 10000 / 390927 [2.55%]. Time per iteration : 1.
...
1000065*2^390927-1, iteration : 390000 / 390927 [99.76%]. Time per iteration :
1000065*2^390927-1 is prime! Time : 701.721 sec.

C:\I\llrCUDA.0.60-win64\llrCUDA.0.60-win64>llrCUDA.exe -q"313*2^1012240+1" -d
Starting Proth prime test of 313*2^1012240+1
Using complex irrational base DWT, FFT length = 131072, a = 3

313*2^1012240+1 is not prime. Proth RES64: 5FA128A9BECBCDD3 Time : 1894.073 se
c.
Title: Re: CUDA for prime number search
Post by: Pepi on 20 Sep 2011, 12:39:36 pm: Is this some new build or "old" one? :)
Title: Re: CUDA for prime number search
Post by: Jason G on 20 Sep 2011, 12:45:17 pm: Quote from: Pepi on 20 Sep 2011, 12:39:36 pm
Is this some new build or "old" one? :)

Likely work in progress, so not even alpha yet Pepi ;)
Title: Re: CUDA for prime number search
Post by: Pepi on 20 Sep 2011, 12:54:47 pm: Alpha or Beta, everything shows progress :)
Hip hip huray :)
It was small step for me, but big step for my GPU :)
Title: Re: CUDA for prime number search
Post by: aaronhaviland on 10 Oct 2011, 07:09:05 pm: On a slightly related note, I've been working with CUDALucas a bit recently, as the current devs of it over at mersenneforum.org had completely broken it as far as Linux support.
It is only for testing mersenne primes, and limited by memory to only being able to test primes up to around 2^290000000-1... which would currently take about 245 days on a GTX460 (an exponent, which if my calculations are correct, would take about 19 years on a 2GHz single core CPU. The next Mersenne to win an EFF Cooperative Computing Award would be around 2^336000000-1)

My fork (only builds on 64-bit Linux currently): https://github.com/ah42/CUDALucas
Title: Re: CUDA for prime number search
Post by: ML1 on 10 Oct 2011, 08:48:59 pm: Quote from: aaronhaviland on 10 Oct 2011, 07:09:05 pm
... It is only for testing mersenne primes, and limited by memory to only being able to test primes up to around 2^290000000-1... which would currently take about 245 days on a GTX460 (an exponent, which if my calculations are correct, would take about 19 years on a 2GHz single core CPU. The next Mersenne to win an EFF Cooperative Computing Award would be around 2^336000000-1) ...
Ouch... Too late in the night for counting all those exponent digits... It is moving somewhat when you need exponents to sensibly express the exponents!

Quite a nice speedup there for the GPU to CPU comparison ;D

Aside: I hit a RAC of over 200k on a test run with Boinc-GIMPS on my GTS450.

Happy fast crunchin',
Martin