+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: Fighting the CUDA bug  (Read 14597 times)

Offline efmer (fred)

  • Alpha Tester
  • Knight o' The Round Table
  • ***
  • Posts: 147
    • efmer
Fighting the CUDA bug
« on: 29 Jun 2009, 03:44:33 am »
On a number of computer, I got waiting CUDA WU. This is not a problem, but as some of them are kept in memory, it becomes a serious problem.
It causes CUDA to go into fall back mode, or even a total freeze of XP.

I've come up with 2 solutions.

1) Automatically restart the system when the GPU temperature goes below a set value. Works a lot of times.
2) Automatically restart the system when the GPU exe runs more than a set value. This can only work when your card can hold an extra cuda task in memory without crashing the system... Mine can hold 7 of them so I set this value to > 4, that is 2 extra in memory is allowed.


I have 6.6.36 installed but I think this is a problem in earlier versions as well.

If anyone has the same problems and wants to do some testing.
http://efmer.eu/boinc/ This version has 1) implemented.
2) is in beta testing and I anyone wants to test it let me know.
TThrottle Keep your temperatures controlled.
BoincTasks The best way to view BOINC

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: Fighting the CUDA bug
« Reply #1 on: 30 Jun 2009, 06:32:53 am »
Do you restart system (OS) or only BOINC. It seems BOINC restart should be enough in this case, not ?

Offline efmer (fred)

  • Alpha Tester
  • Knight o' The Round Table
  • ***
  • Posts: 147
    • efmer
Re: Fighting the CUDA bug
« Reply #2 on: 30 Jun 2009, 06:45:15 am »
Do you restart system (OS) or only BOINC. It seems BOINC restart should be enough in this case, not ?

It may, but better be safe...
I think the programs keeps on running, sort of, or are crashed, without the knowledge of BOINC . I haven't tested that, as it mostly happens when I'm not around.
And does it reallocate all the GPU memory in this case? I hope they fix this problem, but is is around sooooo long, and I believe they think they fixed it. Something like the rules are ok so this can't happen.
TThrottle Keep your temperatures controlled.
BoincTasks The best way to view BOINC

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: Fighting the CUDA bug
« Reply #3 on: 30 Jun 2009, 07:02:43 am »
It's BOINC's problem. It should never leave GPU apps in memory...

Try to restart only BOINC (as experiment). Even if CUDA MB app still running it should exit after ~30 second with zero status (OK) and no heartbeat message. Then it can be restarted from checkpoint.
If BOINC will be restarted sooner (and it should be) no additional apps will be launched (more exactly - they will exit immediately with "can't aquire lock" message).

OS reboot too wasteful - so many CPU cycles lost... ;) (BOINC restart too, but it still better than task crash or CPU fallback of course).

popandbob

  • Guest
Re: Fighting the CUDA bug
« Reply #4 on: 30 Jun 2009, 03:36:01 pm »
I believe the problem is caused by boinc's safeguard against non check pointing apps. If an application doesn't reach a checkpoint it will be left in memory regardless of what settings have been set.
Bob

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: Fighting the CUDA bug
« Reply #5 on: 30 Jun 2009, 03:37:15 pm »
I believe the problem is caused by boinc's safeguard against non check pointing apps. If an application doesn't reach a checkpoint it will be left in memory regardless of what settings have been set.
Bob
CUDA MB does checkpoint. So not the case unfortunately...

Offline Richard Haselgrove

  • Messenger Pigeon
  • Knight who says 'Ni!'
  • *****
  • Posts: 2819
Re: Fighting the CUDA bug
« Reply #6 on: 30 Jun 2009, 04:26:25 pm »
DA has recently 'checked in' (i.e. modified the source code, but not yet compiled a new version) a change: previously/currently, BOINC would leave a CUDA app in memory if it was preempted before the first checkpoint. In future, it will be cleaned out even if it has never checkpointed - so application developers, get your checkpointing code working early on in the development process.

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: Fighting the CUDA bug
« Reply #7 on: 30 Jun 2009, 04:30:29 pm »
so application developers, get your checkpointing code working early on in the development process.
Or make your tasks so fast that they will never need to checkpoint ;D ;D ;D

Offline sunu

  • Alpha Tester
  • Knight who says 'Ni!'
  • ***
  • Posts: 771
Re: Fighting the CUDA bug
« Reply #8 on: 30 Jun 2009, 04:31:46 pm »
DA has recently 'checked in' (i.e. modified the source code, but not yet compiled a new version) a change: previously/currently, BOINC would leave a CUDA app in memory if it was preempted before the first checkpoint. In future, it will be cleaned out even if it has never checkpointed - so application developers, get your checkpointing code working early on in the development process.

Yes, currently, if the cuda app is preempted in the first 30sec or so of its initialisation in cpu, it is left in memory, no matter what settings you've got.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: Fighting the CUDA bug
« Reply #9 on: 30 Jun 2009, 04:35:43 pm »
Or make your tasks so fast that they will never need to checkpoint ;D ;D ;D

That's no joke.  I had this in mind for multithreaded apps, triggered by Alex's treatment of spike finding code on Macs.  Goodbye 80% of BoincAPI if the tasks can be fast enough to not need to bother checkpointing.

Offline Richard Haselgrove

  • Messenger Pigeon
  • Knight who says 'Ni!'
  • *****
  • Posts: 2819
Re: Fighting the CUDA bug
« Reply #10 on: 30 Jun 2009, 04:38:30 pm »
Or make your tasks so fast that they will never need to checkpoint ;D ;D ;D

Are you volunteering to optimise the AQUA CUDA app? I've got one on my 9800GTX+ (84 GFLOPs) which is estimating 79 hours :o (and that's a big improvement - the last one took 89 hours).

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: Fighting the CUDA bug
« Reply #11 on: 30 Jun 2009, 04:45:04 pm »
Or make your tasks so fast that they will never need to checkpoint ;D ;D ;D

Are you volunteering to optimise the AQUA CUDA app? I've got one on my 9800GTX+ (84 GFLOPs) which is estimating 79 hours :o (and that's a big improvement - the last one took 89 hours).
Hehe, no-no-no, as Artifical Realm closed (w/o any explanation, sadly) SETI-only here for now ;)
Cause many users now do many BOINC projects, optimization of another project app will help speedup SETI too, but this is too peripheral way to target ;D

Offline Claggy

  • Alpha Tester
  • Knight who says 'Ni!'
  • ***
  • Posts: 3111
    • My computers at Seti Beta
Re: Fighting the CUDA bug
« Reply #12 on: 30 Jun 2009, 04:52:32 pm »
DA has recently 'checked in' (i.e. modified the source code, but not yet compiled a new version) a change: previously/currently, BOINC would leave a CUDA app in memory if it was preempted before the first checkpoint. In future, it will be cleaned out even if it has never checkpointed - so application developers, get your checkpointing code working early on in the development process.

DA has also 'checked in' another GPU related change after your question today.

Changeset 18531

Claggy

Offline Richard Haselgrove

  • Messenger Pigeon
  • Knight who says 'Ni!'
  • *****
  • Posts: 2819
Re: Fighting the CUDA bug
« Reply #13 on: 30 Jun 2009, 05:03:31 pm »

DA has also 'checked in' another GPU related change after your question today.

Changeset 18531

Claggy

And he's had to fix his own typos

Changeset 18533

Band-aid time, I reckon - but it's a [tacit] acknowledgement of the FIFO bug.....

Offline efmer (fred)

  • Alpha Tester
  • Knight o' The Round Table
  • ***
  • Posts: 147
    • efmer
Re: Fighting the CUDA bug
« Reply #14 on: 01 Jul 2009, 06:04:42 am »
I had about  6 reboots this night. ;D  Today I got  one  just in time to see something, because it happens really quick.  I did an exit on the BOINC manager and checked Stop running science applications.
And that did indeed close them, so they are not crashed.
Starting BOINC again and everything works again... for the time being that is.
TThrottle Keep your temperatures controlled.
BoincTasks The best way to view BOINC

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 44
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 52
Total: 52
Powered by EzPortal