Author Topic: Some thinking and theoretic discussion about seti client on GPU (Read 26217 times)

Devaster · « **on:** 19 Dec 2007, 01:38:23 pm »

Now i am thinking how to best parallelize the pulsefind. in standard code is are pulses calculated in serial mode , by calling function in the main analyse loop .

what happen when i make something like this: ill take the cycle that is finding pulses at fft size count and run them in NumPoints/fftlen threads

i think this would be nice parallelization for this. but there is one extreme - by fft size bigger than 4096 is number of parallel therads going down from 256 to 8. maybe there will be some performance bottleneck or then would be GPU utilization very low ...

i must test this on next day ... see ya!

Devaster · « **Reply #1 on:** 19 Dec 2007, 02:17:40 pm »

about pulse find - i think there would be better to write all kernels manually and do not have it automatically generated - there wold be used loop unrolling too - better performance ....

popandbob · « **Reply #2 on:** 20 Dec 2007, 12:25:34 am »

To follow up on my last question..

Once all is programmed in CUDA will the CPU usage still be 100%? I know that Folding@home's ATI GPU client is... but I believe that's due to them not using CUDA...

~BoB

Devaster · « **Reply #3 on:** 20 Dec 2007, 07:59:49 am »

i t dont now . there will be still some parts that would be run on CPU ....

popandbob · « **Reply #4 on:** 20 Dec 2007, 11:11:22 pm »

Thanks for the reply Devaster, I do hope CPU usage wont be at 100% because then at least we would have something that Folding@home doesn't... A GPU app that can run with CPU apps (ie. dont have to reserve a core for GPU app)

~BoB

Devaster · « **Reply #5 on:** 21 Dec 2007, 11:53:33 am »

but by my observations is that 100% CPU usage only "empty loop" - waiting for driver response. by me at home when i run some pure GPU code from CUDA SDK i haven't seemed any significant slowdown ....

abachler · « **Reply #6 on:** 22 Dec 2007, 02:56:56 am »

You are probably better off processing at least part fo the WU on teh CPU, so that it stays busywhile the GPU is processing the rest. As for the FFT takign so long in RM, Yes, due to the nature of the FFT algorithm, it is difficult to implement it on a GPU without killing performance, but never fear, there is a workaround

Then again, since the CPU is idle, you should process a seperate WU on teh CPU while the GPU is processing the other. I think ultimately the BOINC client will have to take care of recognizing when it should start mutiple clients including fro the GPU and to only use one client per GPU.

roisen.dubh · « **Reply #7 on:** 30 Dec 2007, 01:48:49 am »

From what I understand, Chirping the data is what takes the most amount of crunching. If getting the FFTs to crunch ion the GPU s what is causing the GPU client to go so slowly, why not have the GPUs chirp the data, and then send it to the CPU for the FFTs.

Or I could be completely mistaken

Jason G · « **Reply #8 on:** 30 Dec 2007, 01:55:54 am »

Quote from: roisen.dubh on 30 Dec 2007, 01:48:49 am

From what I understand, Chirping the data is what takes the most amount of crunching. If getting the FFTs to crunch ion the GPU s what is causing the GPU client to go so slowly, why not have the GPUs chirp the data, and then send it to the CPU for the FFTs.

From vague memory when I did some profiling on my p4's [may or may not be relevant to GPU prcoessing, Don't know] , from most intensive to slighlty less intensive :
Pulse Folding/Finding, sheer moving data about the place, then Chirping, then iFFT's& FFT's, then Gauss fitting. Each of which vary by angle range and task content.

[Baseline Smoothing showed up somewhere too, but I don't remember how expensive that was... lower down on the list I think]

I remember at the time thinking these processing tasks seemed to each use a more even proportion of the total processing time than I would have expected. [Something like each major inner functions around 4 to 11% total execution time each]

Jason

Devaster · « **Reply #9 on:** 30 Dec 2007, 10:24:31 am »

yes i know that pulse find is the most time comsuming operation , but i must begin with something easy - fft, power spectrum, data chirp ....
when you take look at pulse find code - is it more compex as find spike for example , and i am not so good for now to easy convert/rewrite the code to pararell architecture ...

Jason G · « **Reply #10 on:** 30 Dec 2007, 10:54:15 am »

One thing with the inner loops of the pulse folding, and the chirping routines also, is there are a few different very well hand vectorised versions in there, Though I don't know much about GPU programming at all, I'd imagine they'd need a similar kind of loop iteration independence and blocking etc to take advantage of the parallelism capability. So It may actually help you to examine some of the SSE/SSE2 optimised/vectorised code rather than the standard C code, as a wild guess on my part, some of it may be possible to translate almost straight to GPU code, though definitely not the fastest most suitable for the chip, It may be closer to what you need than the stock, at least in concept.

Let us know if you need help with understanding some of the SSE2 code and/or intrinsics used etc...

Just a thought.

Jason

Devaster · « **Reply #11 on:** 31 Dec 2007, 11:14:48 am »

now i am working on find spikes code. the way what i have used is this :
in original seti code are all steps called sequently for every fft chunk in main analyse loop. i call the analyse functions for all chunks at one time ...
original seti code :

Code: [Select]

//main analyse - top analyse loop
for (icfft = state.icfft; icfft < num_cfft; icfft++)
.
.
.
       for (ifft = 0; ifft < NumFfts; ifft++)
       //inner loop for fft chunks
       fft calc;
       find spike
       .
       .
       .

by CUDA with his thread model can i create threads as fft chunk count and run on GPU - this will eliminate the inner loop for fft chunks .... so code look like

Code: [Select]

//main analyse - top analyse loop
for (icfft = state.icfft; icfft < num_cfft; icfft++)
.
.
.
fft calc; - for all chunks at one time
find spike - for all chunks at one time 
       .
       .
       .

imagine that as you have a cpu that can run at one time 128k find spikes and return only best spike and result spike if its bigger than spike treshold ....

Josef W. Segur · « **Reply #12 on:** 31 Dec 2007, 03:41:02 pm »

As long as the logic can report the same first 30 spikes for an overflow, that seems excellent.
Joe

Vyper · « **Reply #13 on:** 02 Jan 2008, 06:06:53 am »

I remember in the good old assembly days when u could program the cpu that it could do other things whilst the other hardware was running and when the hardware was done it generated an interupt so the code would jump to a specific place and just fetch what the hardware just had done or do the next part so it could go back to the previous code for what it was doing?!

Wonder if the s@h code is linear?
With that i mean u need to process the WU in a specific manner or could findpulse be ahead of fft and vice versa?

Kind Regards Vyper

Josef W. Segur · « **Reply #14 on:** 02 Jan 2008, 04:33:41 pm »

Quote from: Vyper on 02 Jan 2008, 06:06:53 am

...
Wonder if the s@h code is linear?
With that i mean u need to process the WU in a specific manner or could findpulse be ahead of fft and vice versa?

The basic code is of course linear because it is written for a single worker thread, but there are high level loops which could be modified to distribute to other processors. Here's a quick overview of processing:

1. Read data from the WU, convert to floating point, baseline smooth. This is done once per startup and produces an 8 MiB array.

2. Dechirp the above array into another same size array. This is done at a lot of incremental chirp rates from zero through +/- 100 Hz/sec. It loops between 37193 and 108194 times.

3. Do FFTs on the dechirped array to produce narrower frequency bands for analysis. The original array has 9765.625 Hz. bandwidth, we analyze at bandwidths ranging from 1220.7 Hz. to 0.0745 Hz. At zero chirp all 15 FFT lengths are used, at quite a few other chirps only one FFT length is used, so this would be an awkward place to try to parallelize on that basis. However, each FFT length is used multiple times; for instance length 8 is used 128K times and those can be done in parallel.

4. Convert the FFT output to PowerSpectrum data and analyze for Spikes, Gaussians, Triplets, and Pulses. If the telescope moved more than one beam width during recording of the work, for Triplets and Pulses the data is divided into chunks with just one beam width worth of data.

Basically the data has to be organized before it can be analyzed, but there are opportunities to split the processing into parallel paths.
Joe

Author Topic: Some thinking and theoretic discussion about seti client on GPU (Read 26217 times)

Devaster

Some thinking and theoretic discussion about seti client on GPU

Devaster

Re: Some thinking and theoretic discussion about seti client on GPU

popandbob

Re: Some thinking and theoretic discussion about seti client on GPU

Devaster

Re: Some thinking and theoretic discussion about seti client on GPU

popandbob

Re: Some thinking and theoretic discussion about seti client on GPU

Devaster

Re: Some thinking and theoretic discussion about seti client on GPU

abachler

Re: Some thinking and theoretic discussion about seti client on GPU

roisen.dubh

Re: Some thinking and theoretic discussion about seti client on GPU

Jason G

Re: Some thinking and theoretic discussion about seti client on GPU

Devaster

Re: Some thinking and theoretic discussion about seti client on GPU

Jason G

Re: Some thinking and theoretic discussion about seti client on GPU

Devaster

Re: Some thinking and theoretic discussion about seti client on GPU

Josef W. Segur

Re: Some thinking and theoretic discussion about seti client on GPU

Vyper

Re: Some thinking and theoretic discussion about seti client on GPU

Josef W. Segur

Re: Some thinking and theoretic discussion about seti client on GPU