Forum > GPU crunching

Fighting the CUDA bug

(1/4) > >>

efmer (fred):
On a number of computer, I got waiting CUDA WU. This is not a problem, but as some of them are kept in memory, it becomes a serious problem.
It causes CUDA to go into fall back mode, or even a total freeze of XP.

I've come up with 2 solutions.

1) Automatically restart the system when the GPU temperature goes below a set value. Works a lot of times.
2) Automatically restart the system when the GPU exe runs more than a set value. This can only work when your card can hold an extra cuda task in memory without crashing the system... Mine can hold 7 of them so I set this value to > 4, that is 2 extra in memory is allowed.


I have 6.6.36 installed but I think this is a problem in earlier versions as well.

If anyone has the same problems and wants to do some testing.
http://efmer.eu/boinc/ This version has 1) implemented.
2) is in beta testing and I anyone wants to test it let me know.

Raistmer:
Do you restart system (OS) or only BOINC. It seems BOINC restart should be enough in this case, not ?

efmer (fred):

--- Quote from: Raistmer on 30 Jun 2009, 06:32:53 am ---Do you restart system (OS) or only BOINC. It seems BOINC restart should be enough in this case, not ?


--- End quote ---
It may, but better be safe...
I think the programs keeps on running, sort of, or are crashed, without the knowledge of BOINC . I haven't tested that, as it mostly happens when I'm not around.
And does it reallocate all the GPU memory in this case? I hope they fix this problem, but is is around sooooo long, and I believe they think they fixed it. Something like the rules are ok so this can't happen.

Raistmer:
It's BOINC's problem. It should never leave GPU apps in memory...

Try to restart only BOINC (as experiment). Even if CUDA MB app still running it should exit after ~30 second with zero status (OK) and no heartbeat message. Then it can be restarted from checkpoint.
If BOINC will be restarted sooner (and it should be) no additional apps will be launched (more exactly - they will exit immediately with "can't aquire lock" message).

OS reboot too wasteful - so many CPU cycles lost... ;) (BOINC restart too, but it still better than task crash or CPU fallback of course).

popandbob:
I believe the problem is caused by boinc's safeguard against non check pointing apps. If an application doesn't reach a checkpoint it will be left in memory regardless of what settings have been set.
Bob

Navigation

[0] Message Index

[#] Next page

Go to full version