Some thinking and theoretic discussion about seti client on GPU

Forum > GPU crunching

<< < (3/5) > >>

Jason G:
One thing with the inner loops of the pulse folding, and the chirping routines also, is there are a few different very well hand vectorised versions in there, Though I don't know much about GPU programming at all, I'd imagine they'd need a similar kind of loop iteration independence and blocking etc to take advantage of the parallelism capability. So It may actually help you to examine some of the SSE/SSE2 optimised/vectorised code rather than the standard C code, as a wild guess on my part, some of it may be possible to translate almost straight to GPU code, though definitely not the fastest most suitable for the chip, It may be closer to what you need than the stock, at least in concept.

Let us know if you need help with understanding some of the SSE2 code and/or intrinsics used etc...

Just a thought.

Jason

Devaster:
now i am working on find spikes code. the way what i have used is this :
in original seti code are all steps called sequently for every fft chunk in main analyse loop. i call the analyse functions for all chunks at one time ...
original seti code :

--- Code: ---//main analyse - top analyse loop
for (icfft = state.icfft; icfft < num_cfft; icfft++)
.
.
.
for (ifft = 0; ifft < NumFfts; ifft++)
//inner loop for fft chunks
fft calc;
find spike
.
.
.

--- End code ---

by CUDA with his thread model can i create threads as fft chunk count and run on GPU - this will eliminate the inner loop for fft chunks .... so code look like

--- Code: ---//main analyse - top analyse loop
for (icfft = state.icfft; icfft < num_cfft; icfft++)
.
.
.
fft calc; - for all chunks at one time
find spike - for all chunks at one time
.
.
.

--- End code ---
imagine that as you have a cpu that can run at one time 128k find spikes and return only best spike and result spike if its bigger than spike treshold ....

Josef W. Segur:
As long as the logic can report the same first 30 spikes for an overflow, that seems excellent.
Joe

Vyper:
I remember in the good old assembly days when u could program the cpu that it could do other things whilst the other hardware was running and when the hardware was done it generated an interupt so the code would jump to a specific place and just fetch what the hardware just had done or do the next part so it could go back to the previous code for what it was doing?!

Wonder if the s@h code is linear?
With that i mean u need to process the WU in a specific manner or could findpulse be ahead of fft and vice versa?

Kind Regards Vyper

Josef W. Segur:

--- Quote from: Vyper on 02 Jan 2008, 06:06:53 am ---...
Wonder if the s@h code is linear?
With that i mean u need to process the WU in a specific manner or could findpulse be ahead of fft and vice versa?
--- End quote ---

The basic code is of course linear because it is written for a single worker thread, but there are high level loops which could be modified to distribute to other processors. Here's a quick overview of processing:

1. Read data from the WU, convert to floating point, baseline smooth. This is done once per startup and produces an 8 MiB array.

2. Dechirp the above array into another same size array. This is done at a lot of incremental chirp rates from zero through +/- 100 Hz/sec. It loops between 37193 and 108194 times.

3. Do FFTs on the dechirped array to produce narrower frequency bands for analysis. The original array has 9765.625 Hz. bandwidth, we analyze at bandwidths ranging from 1220.7 Hz. to 0.0745 Hz. At zero chirp all 15 FFT lengths are used, at quite a few other chirps only one FFT length is used, so this would be an awkward place to try to parallelize on that basis. However, each FFT length is used multiple times; for instance length 8 is used 128K times and those can be done in parallel.

4. Convert the FFT output to PowerSpectrum data and analyze for Spikes, Gaussians, Triplets, and Pulses. If the telescope moved more than one beam width during recording of the work, for Triplets and Pulses the data is divided into chunks with just one beam width worth of data.

Basically the data has to be organized before it can be analyzed, but there are opportunities to split the processing into parallel paths.
Joe

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version