Recent Driver Cuda-safe Project List

Forum > Discussion Forum

(1/4) > >>

Jason G:
Hi All,

Sticky DownClocks ?

It's come to my attention that there are cuda enabled Boinc projects around that still cause 'sticky downclocks' on newer technology drivers, through terminating processes mid kernel execution (non-threadsafe behaviour, see tech stuff below). during snooze, exit &/or error out scenarios.

I'll start a list here indicating what is 'safe' for Cuda 270.xx+ drivers, & what isn't:

Avoid killing Cuda or OpenCL applications via task manager unless you really need to, and expect a reboot will be needed for most newer cards (with newer drivers) if you do, even with applications that use a threadsafe strategy:

Project Name Stock App ThreadsafeExit? 3rdParty App ThreadSafeExit?
------------- ------------------------- ----------------------------
Collatz No N/A
Einstein@Home No (reportedly being updated) ?
GPUGrid No (fall/Autumn update) ?
PrimeGrid No ?
Seti@Home No (version updates in progress) Yes (Later Lunatics builds)

Last updated: 4th September, 2011Technical Background: Cuda & OpenCL Compute languages, and the underlying infrastructure they depend on including drivers & Hardware, have moved to heavy use of 'asynchronous execution' constructs. The use of the Windows api TerminateProcess() function, as used in outdated BoincApi for Snooze/Exit or by killing via task manager or similar, immediately halts execution & frees the resources. Too bad if the Graphics hardware is in the middle of writing to a memory area (asynchronously), and the buffer is freed from under it... These errors tend to force drivers into a failsafe mode only a rebbot can remedy at this time. It's possible future drivers may be hardened somewhat to some forms of abuse, but it seems unlikely this particular application induced situation can be easily prevented, by other means than making the applications behave in a more threadsafe fashion..

Jason

Update: 18th November 2011.
How to determine if your Cuda enabled Boinc project science application might be using non-threadsafe exit code, that can induce 'sticky downlclocks' on any kind of snooze/completion/exit
- Ensure a task for the project is processing 'normally'
- Suspend all inactive tasks (so new ones won't start)
- Monitor GPU clock rate in something like GPU-Z, nVidia Inspector or similar
- repeatedly snooze then resume the running task, a threadsafe snooze shutdown should always return to full normal clocks when resumed.

Update: March 2nd 2012
Unfortunately the message about sticky downclocks caused by non-threadsafe application exit behaviour hasn't apparently been getting out enough for users. The Cuda 4.1 Toolkit release notes text file contains a fairly concise description of the issues at hand, suitable for developers, relating directly to later drivers & proper application termination.

--- Quote ---* The CUDA driver creates worker threads on all platforms, and this can cause issues at process cleanup in some multithreaded applications on all supported operating systems.
On Linux, for example, if an application spawns multiple host pthreads, calls into CUDART, and then exits all user-spawned threads with pthread_exit(), the process may never terminate. Driver threads will not automatically exit once the user's threads have gone down.
The proper solution is to either:
(1) call cudaDeviceReset()* on all used devices before termination of host threads, or,
(2) trigger process termination directly (i.e, with exit()) rather than relying on the process to die after only user-spawned threads have been individually exited.
--- End quote ---
*note that the Cuda 3.2 equivalent of cudaDeviceReset() is cudaThreadExit()

arkayn:
Collatz No N/A

Jason G:

--- Quote from: arkayn on 04 Sep 2011, 06:46:52 am ---Collatz No N/A

--- End quote ---
Thanks, Updated.

[Edit:] added GPU Grid as well.

Jason G:
Updated first post with:

--- Quote ---Update: 18th November 2011.
How to determine if your Cuda enabled Boinc project science application might be using non-threadsafe exit code, that can induce 'sticky downlclocks' on any kind of snooze/completion/exit
- Ensure a task for the project is processing 'normally'
- Suspend all inactive tasks (so new ones won't start)
- Monitor GPU clock rate in something like GPU-Z, nVidia Inspector or similar
- repeatedly snooze then resume the running task, a threadsafe snooze shutdown should always return to full normal clocks when resumed.
--- End quote ---

Richard Haselgrove:
I asked GPUGrid specifically about this in September.

This reply 5 October:

--- Quote from: GDF ---Probably. We have to do it a bit differently, but yes.
--- End quote ---

Then again, 7 November:

--- Quote from: GDF ---We will probably postpone this new application in 2012 to focus on upgrading the server and the new AMD alpha application.
--- End quote ---

Edit - their attitude was "our app is compiled using CUDA3.1 - so nobody needs to use CUDA 4.x drivers yet". Thus, as we've seen elsewhere, ignoring that part of the userbase who might wish to update drivers to suit other projects, or even non-BOINC uses of their GPUs.

Edit2 - to be fair, that last comment came from skgiven, who is a moderator/tester - not from GDF, who is a developer/scientist. So it's not necessarily indicative of the developers' attitudes.

Navigation

[0] Message Index

[#] Next page

Go to full version