From what I understand, Chirping the data is what takes the most amount of crunching. If getting the FFTs to crunch ion the GPU s what is causing the GPU client to go so slowly, why not have the GPUs chirp the data, and then send it to the CPU for the FFTs.
//main analyse - top analyse loopfor (icfft = state.icfft; icfft < num_cfft; icfft++)... for (ifft = 0; ifft < NumFfts; ifft++) //inner loop for fft chunks fft calc; find spike . . .
//main analyse - top analyse loopfor (icfft = state.icfft; icfft < num_cfft; icfft++)...fft calc; - for all chunks at one timefind spike - for all chunks at one time . . .
...Wonder if the s@h code is linear?With that i mean u need to process the WU in a specific manner or could findpulse be ahead of fft and vice versa?