Seti@Home optimized science apps and information

Optimized Seti@Home apps => Windows => GPU crunching => Topic started by: Devaster on 19 Dec 2007, 01:38:23 pm

Title: Some thinking and theoretic discussion about seti client on GPU
Post by: Devaster on 19 Dec 2007, 01:38:23 pm
Now i am thinking how to best parallelize the pulsefind. in standard code is are pulses calculated in serial mode , by calling function in the main analyse loop .

what happen when i make something like this: ill take the cycle that is finding pulses at fft size count and run them in NumPoints/fftlen threads ???

i think this would be nice parallelization for this. but there is one extreme - by fft size bigger than 4096 is number of parallel therads going down from 256 to 8. maybe there will be some performance bottleneck or  then would be GPU utilization very low ...

i must test this on next day ... see ya!
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Devaster on 19 Dec 2007, 02:17:40 pm
about pulse find - i think there would be better to write all kernels manually and do not have it automatically generated - there wold be used loop unrolling too - better performance ....
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: popandbob on 20 Dec 2007, 12:25:34 am
To follow up on my last question..

Once all is programmed in CUDA will the CPU usage still be 100%? I know that Folding@home's ATI GPU client is... but I believe that's due to them not using CUDA...

~BoB
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Devaster on 20 Dec 2007, 07:59:49 am
i t dont now . there will be still some parts that would be run on CPU ....
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: popandbob on 20 Dec 2007, 11:11:22 pm
Thanks for the reply Devaster, I do hope CPU usage wont be at 100% because then at least we would have something that Folding@home doesn't... A GPU app that can run with CPU apps (ie. dont have to reserve a core for GPU app)

~BoB
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Devaster on 21 Dec 2007, 11:53:33 am
but by my observations is that 100% CPU usage only "empty loop" - waiting for driver response. by me at home when i run some pure GPU code from CUDA SDK  i haven't seemed any significant slowdown ....
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: abachler on 22 Dec 2007, 02:56:56 am
You are probably better off processing at least part fo the WU on teh CPU, so that it stays busywhile the GPU is processing the rest.   As for the FFT takign so long in RM, Yes, due to the nature of the FFT algorithm, it is difficult to implement it on a GPU without killing performance, but never fear, there is a workaround :)  Then again, since the CPU is idle, you should process a seperate WU on teh CPU while the GPU is processing the other.  I think ultimately the BOINC client will have to take care of recognizing when it should start mutiple clients including fro the GPU and to only use one client per GPU.
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: roisen.dubh on 30 Dec 2007, 01:48:49 am
From what I understand, Chirping the data is what takes the most amount of crunching. If getting the FFTs to crunch ion the GPU s what is causing the GPU client to go so slowly, why not have the GPUs chirp the data, and then send it to the CPU for the FFTs.

Or I could be completely mistaken
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Jason G on 30 Dec 2007, 01:55:54 am
From what I understand, Chirping the data is what takes the most amount of crunching. If getting the FFTs to crunch ion the GPU s what is causing the GPU client to go so slowly, why not have the GPUs chirp the data, and then send it to the CPU for the FFTs.
From vague memory when I did some profiling on my p4's [may or may not be relevant to GPU prcoessing, Don't know] , from most intensive to slighlty less intensive :
    Pulse Folding/Finding, sheer moving data about the place, then Chirping, then iFFT's& FFT's, then Gauss fitting.  Each of which vary by angle range and task content.

[Baseline Smoothing showed up somewhere too, but I don't remember how expensive that was... lower down on the list I think]

I remember at the time thinking these processing tasks seemed to each use a more even proportion of the total processing time than I would have expected. [Something like each major inner functions around 4 to 11% total execution time each]


Jason
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Devaster on 30 Dec 2007, 10:24:31 am
yes i know that pulse find is the most time comsuming operation , but i must begin with something easy - fft, power spectrum, data chirp ....
when you take look at pulse find code - is it more compex as find spike for example , and i am not so good for now to easy convert/rewrite the code to pararell architecture ...
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Jason G on 30 Dec 2007, 10:54:15 am
One thing with the inner loops of the pulse folding, and the chirping routines also,  is there are a few different very well hand vectorised versions in there,  Though I don't know much about GPU programming at all, I'd imagine they'd need a similar kind of loop iteration independence and blocking etc to take advantage of the parallelism capability.  So It may actually help you to examine  some of the SSE/SSE2 optimised/vectorised code rather than the standard C code,  as a wild guess on my part, some of it may be possible to translate almost straight to GPU code, though definitely not the fastest most suitable for the chip, It may be closer to what you need than the stock, at least in concept.

Let us know if you need help with understanding some of the SSE2 code and/or intrinsics used etc...

Just a thought.

Jason
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Devaster on 31 Dec 2007, 11:14:48 am
now i am working on find spikes code. the way what i have used is this :
in original seti code are all steps called sequently for every fft chunk in main analyse loop. i call the analyse functions for all chunks at one time ...
original seti code :
Code: [Select]
//main analyse - top analyse loop
for (icfft = state.icfft; icfft < num_cfft; icfft++)
.
.
.
       for (ifft = 0; ifft < NumFfts; ifft++)
       //inner loop for fft chunks
       fft calc;
       find spike
       .
       .
       .

by CUDA with his thread model can i create threads as fft chunk count and run on GPU - this will eliminate the inner loop for fft chunks .... so code look like
Code: [Select]
//main analyse - top analyse loop
for (icfft = state.icfft; icfft < num_cfft; icfft++)
.
.
.
fft calc; - for all chunks at one time
find spike - for all chunks at one time
       .
       .
       .
imagine that as you have a cpu that can run at one time 128k find spikes and return only best spike and result spike if its bigger than spike treshold ....
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Josef W. Segur on 31 Dec 2007, 03:41:02 pm
As long as the logic can report the same first 30 spikes for an overflow, that seems excellent.
                                                          Joe
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Vyper on 02 Jan 2008, 06:06:53 am
I remember in the good old assembly days when u could program the cpu that it could do other things whilst the other hardware was running and when the hardware was done it generated an interupt so the code would jump to a specific place and just fetch what the hardware just had done or do the next part so it could go back to the previous code for what it was doing?!

Wonder if the s@h code is linear?
With that i mean u need to process the WU in a specific manner or could findpulse be ahead of fft and vice versa?

Kind Regards Vyper
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Josef W. Segur on 02 Jan 2008, 04:33:41 pm
...
Wonder if the s@h code is linear?
With that i mean u need to process the WU in a specific manner or could findpulse be ahead of fft and vice versa?

The basic code is of course linear because it is written for a single worker thread, but there are high level loops which could be modified to distribute to other processors. Here's a quick overview of processing:

1. Read data from the WU, convert to floating point, baseline smooth. This is done once per startup and produces an 8 MiB array.

2. Dechirp the above array into another same size array. This is done at a lot of incremental chirp rates from zero through +/- 100 Hz/sec. It loops between 37193 and 108194 times.

3. Do FFTs on the dechirped array to produce narrower frequency bands for analysis. The original array has 9765.625 Hz. bandwidth, we analyze at bandwidths ranging from 1220.7 Hz. to 0.0745 Hz. At zero chirp all 15 FFT lengths are used, at quite a few other chirps only one FFT length is used, so this would be an awkward place to try to parallelize on that basis. However, each FFT length is used multiple times; for instance length 8 is used 128K times and those can be done in parallel.

4. Convert the FFT output to PowerSpectrum data and analyze for Spikes, Gaussians, Triplets, and Pulses. If the telescope moved more than one beam width during recording of the work, for Triplets and Pulses the data is divided into chunks with just one beam width worth of data.

Basically the data has to be organized before it can be analyzed, but there are opportunities to split the processing into parallel paths.
                                                          Joe
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: tfp on 03 Jan 2008, 01:28:49 pm
Just a quick question, is there a reason why the data is converted to PF first and then all of the work is done?
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: guido.man on 03 Jan 2008, 06:17:29 pm
That is a good question.
Is there a real need to convert to Floating Point,
or could all calculations be done in binary,
and converted at the end if need be?
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Gecko_R7 on 07 Jan 2008, 01:12:50 am
Mimo & Jason,
Interesting threads from 06' at GPGPU.org pertaining to FFTs.
I actually followed at the time and just came across them while cleaning out old links.
Maybe you've read already, but if not:

http://www.gpgpu.org/forums/viewtopic.php?t=2284

http://www.gpgpu.org/forums/viewtopic.php?t=2021

Cheers!
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Jason G on 07 Jan 2008, 01:51:09 am
LOL, Thanks Gecko, I like some of the confusions about precision and representation in the first one, as the discussion mirrors much of my own questions when I first started exploring (and there's still many things for me to work out yet).

Jason

[ Side Note attempted clarification: As far as I've been able to understand, we are mostly dealing using 32 bit single floats paired in a complex representatation, derived from 1 bit samples from the native  telescope data, which relates to telescope hardware arrangement for minimising noise, maximising sensitivity, and  recording capacity. (like some kind of dipole antenna arrangement which has advantageous geometric characteristics to it) . [I'd imagine the 1 bit pairs representation, for storage purposes,  would be  missing elements needed for the signal processing that are implicit in the geometrical relationships, meaning the data needs to be expanded (decompressed) as the first step, using those known relationships.

In my limited understanding - then Having available the full complex representation extends the effective nyquist induced bandwidth limits of the system from n/2, (for real only based samples)  to +/-n/2 (for a complex signal)  (effectively improving the overall sensitivity of the search while making better use of the original telescope hardware, and having other processing implications for several stages in the system ... which hopefully answers "Why use 32 bit complex pairs(totalling 64 bits)  instead of 64 bit double floats, which would occupy the same storage space?"

 I gather those aspects of the specific telescope setup have been refined over decades improve the sensitivity etc... ]

Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Devaster on 22 Jan 2008, 03:19:49 am
Hi

after running last ETI code in the GPU profiler it shows some interesting things :

1. the most time consuming thing (47 %) is not FFT , but Findspike at 128k size. Why ?

2. increasing core speed from 400 MHz to 580 MHz has enormous affect to performance (>900 sec agains 700 sec on 8500GT (2 multiprocessors)),but increasing memory from 450 to 550 MHz has doing nothing with performance. Why ? 



1. after better analyze i have found reason: find spike code is massively divergent - GPU cant use any predication and MUST run all direction of divergent code (whole if-then else construction- result from the bad direction is discarded). CPUs can predicate direction of code and can precache  needed instructions/data and then can avoid to run unnecessary code . GPUs  due its strictly parallel  architecture cannot skip part of code aka CPUs without massive impact to performance- all threads in warp (lowest hardware thread unit - 32 threads) must compute same code and if only one thread in warp give relevant and needed result other 31 threads are only ballast - massive performance hit.
   Way to solve this is in use reduction operations: after any compare i decrease the count of threads to half ... - this operation i have used at find best spike -  classic findsipke at 128k   is called about 120 times and takes 47% but reductive find spike (size is vary from 8 to 128k)is called about 4000 times and takes only 2% of time spend on GPU. This method has better read write coherency - reads are done from different memory banks and is better shared mem utilization 


2.seti code is compute based not memory  - so bigger core speed and more MP give better performance ....
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: roisen.dubh on 27 Jan 2008, 09:47:47 pm
What temps are you guys seeing while running the app, and once a fully working app is created, how hard to you predict it will be to get the code running on NVidia's next series of GPUs?
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: popandbob on 28 Jan 2008, 01:01:55 am
What temps are you guys seeing while running the app, and once a fully working app is created, how hard to you predict it will be to get the code running on NVidia's next series of GPUs?

temps are around normal for a 3d game... (~70 deg. C )
As long as they support CUDA which they do it will work.

~BoB
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: riha on 18 Feb 2008, 10:56:31 am
Sorry if not correct thread to post in, but what happened to this thread : http://lunatics.kwsn.net/windows/gpu-crunching-question.180.html

I found it, downloaded the software which seems to work except for the final -9 error. I posted an answer also in thet thread but the ndiscovered that nothing has happened in it for half an year.

Is this the replacement thread or is there any other replacement thread? What about the software, is it still maintained?
Title: Re: Some thinking and theoretic discussion about seti client on GPU
Post by: Devaster on 12 Mar 2008, 09:57:06 am
after experimenting with pot population i have found this - pot population is MORE speeder on CPU as on GPU - for gpu it is small data and compute intensity ... so i go to the gauss and pulse fuctions ....