Seti@Home optimized science apps and information

Optimized Seti@Home apps => Windows => GPU crunching => Topic started by: efmer (fred) on 29 Jun 2009, 03:44:33 am

Title: Fighting the CUDA bug
Post by: efmer (fred) on 29 Jun 2009, 03:44:33 am
On a number of computer, I got waiting CUDA WU. This is not a problem, but as some of them are kept in memory, it becomes a serious problem.
It causes CUDA to go into fall back mode, or even a total freeze of XP.

I've come up with 2 solutions.

1) Automatically restart the system when the GPU temperature goes below a set value. Works a lot of times.
2) Automatically restart the system when the GPU exe runs more than a set value. This can only work when your card can hold an extra cuda task in memory without crashing the system... Mine can hold 7 of them so I set this value to > 4, that is 2 extra in memory is allowed.


I have 6.6.36 installed but I think this is a problem in earlier versions as well.

If anyone has the same problems and wants to do some testing.
http://efmer.eu/boinc/ This version has 1) implemented.
2) is in beta testing and I anyone wants to test it let me know.
Title: Re: Fighting the CUDA bug
Post by: Raistmer on 30 Jun 2009, 06:32:53 am
Do you restart system (OS) or only BOINC. It seems BOINC restart should be enough in this case, not ?
Title: Re: Fighting the CUDA bug
Post by: efmer (fred) on 30 Jun 2009, 06:45:15 am
Do you restart system (OS) or only BOINC. It seems BOINC restart should be enough in this case, not ?

It may, but better be safe...
I think the programs keeps on running, sort of, or are crashed, without the knowledge of BOINC . I haven't tested that, as it mostly happens when I'm not around.
And does it reallocate all the GPU memory in this case? I hope they fix this problem, but is is around sooooo long, and I believe they think they fixed it. Something like the rules are ok so this can't happen.
Title: Re: Fighting the CUDA bug
Post by: Raistmer on 30 Jun 2009, 07:02:43 am
It's BOINC's problem. It should never leave GPU apps in memory...

Try to restart only BOINC (as experiment). Even if CUDA MB app still running it should exit after ~30 second with zero status (OK) and no heartbeat message. Then it can be restarted from checkpoint.
If BOINC will be restarted sooner (and it should be) no additional apps will be launched (more exactly - they will exit immediately with "can't aquire lock" message).

OS reboot too wasteful - so many CPU cycles lost... ;) (BOINC restart too, but it still better than task crash or CPU fallback of course).
Title: Re: Fighting the CUDA bug
Post by: popandbob on 30 Jun 2009, 03:36:01 pm
I believe the problem is caused by boinc's safeguard against non check pointing apps. If an application doesn't reach a checkpoint it will be left in memory regardless of what settings have been set.
Bob
Title: Re: Fighting the CUDA bug
Post by: Raistmer on 30 Jun 2009, 03:37:15 pm
I believe the problem is caused by boinc's safeguard against non check pointing apps. If an application doesn't reach a checkpoint it will be left in memory regardless of what settings have been set.
Bob
CUDA MB does checkpoint. So not the case unfortunately...
Title: Re: Fighting the CUDA bug
Post by: Richard Haselgrove on 30 Jun 2009, 04:26:25 pm
DA has recently 'checked in' (i.e. modified the source code, but not yet compiled a new version) a change: previously/currently, BOINC would leave a CUDA app in memory if it was preempted before the first checkpoint. In future, it will be cleaned out even if it has never checkpointed - so application developers, get your checkpointing code working early on in the development process.
Title: Re: Fighting the CUDA bug
Post by: Raistmer on 30 Jun 2009, 04:30:29 pm
so application developers, get your checkpointing code working early on in the development process.
Or make your tasks so fast that they will never need to checkpoint ;D ;D ;D
Title: Re: Fighting the CUDA bug
Post by: sunu on 30 Jun 2009, 04:31:46 pm
DA has recently 'checked in' (i.e. modified the source code, but not yet compiled a new version) a change: previously/currently, BOINC would leave a CUDA app in memory if it was preempted before the first checkpoint. In future, it will be cleaned out even if it has never checkpointed - so application developers, get your checkpointing code working early on in the development process.

Yes, currently, if the cuda app is preempted in the first 30sec or so of its initialisation in cpu, it is left in memory, no matter what settings you've got.
Title: Re: Fighting the CUDA bug
Post by: Jason G on 30 Jun 2009, 04:35:43 pm
Or make your tasks so fast that they will never need to checkpoint ;D ;D ;D

That's no joke.  I had this in mind for multithreaded apps, triggered by Alex's treatment of spike finding code on Macs.  Goodbye 80% of BoincAPI if the tasks can be fast enough to not need to bother checkpointing.
Title: Re: Fighting the CUDA bug
Post by: Richard Haselgrove on 30 Jun 2009, 04:38:30 pm
Or make your tasks so fast that they will never need to checkpoint ;D ;D ;D

Are you volunteering to optimise the AQUA CUDA app? I've got one on my 9800GTX+ (84 GFLOPs) which is estimating 79 hours :o (and that's a big improvement - the last one took 89 hours).
Title: Re: Fighting the CUDA bug
Post by: Raistmer on 30 Jun 2009, 04:45:04 pm
Or make your tasks so fast that they will never need to checkpoint ;D ;D ;D

Are you volunteering to optimise the AQUA CUDA app? I've got one on my 9800GTX+ (84 GFLOPs) which is estimating 79 hours :o (and that's a big improvement - the last one took 89 hours).
Hehe, no-no-no, as Artifical Realm closed (w/o any explanation, sadly) SETI-only here for now ;)
Cause many users now do many BOINC projects, optimization of another project app will help speedup SETI too, but this is too peripheral way to target ;D
Title: Re: Fighting the CUDA bug
Post by: Claggy on 30 Jun 2009, 04:52:32 pm
DA has recently 'checked in' (i.e. modified the source code, but not yet compiled a new version) a change: previously/currently, BOINC would leave a CUDA app in memory if it was preempted before the first checkpoint. In future, it will be cleaned out even if it has never checkpointed - so application developers, get your checkpointing code working early on in the development process.

DA has also 'checked in' another GPU related change after your question today.

Changeset 18531 (http://boinc.berkeley.edu/trac/changeset/18531)

Claggy
Title: Re: Fighting the CUDA bug
Post by: Richard Haselgrove on 30 Jun 2009, 05:03:31 pm

DA has also 'checked in' another GPU related change after your question today.

Changeset 18531 (http://boinc.berkeley.edu/trac/changeset/18531)

Claggy

And he's had to fix his own typos

Changeset 18533 (http://boinc.berkeley.edu/trac/changeset/18533)

Band-aid time, I reckon - but it's a [tacit] acknowledgement of the FIFO bug.....
Title: Re: Fighting the CUDA bug
Post by: efmer (fred) on 01 Jul 2009, 06:04:42 am
I had about  6 reboots this night. ;D  Today I got  one  just in time to see something, because it happens really quick.  I did an exit on the BOINC manager and checked Stop running science applications.
And that did indeed close them, so they are not crashed.
Starting BOINC again and everything works again... for the time being that is.
Title: Re: Fighting the CUDA bug
Post by: kevin6912 on 06 Jul 2009, 10:20:55 am
I found that allowing the GPU task with 1 second elapsed time(more then likely suspended in memory) to run until progress is greater then 0 will remove GPU task from memory when suspended again.

Regards,
Kevin

Created this windows cmd file to count GPU tasks:
@echo off
echo %date% %time%
set findpgmexe=setiathome_6.08_windows_intelx86__cuda.exe
set findpgm=%findpgmexe:.exe=%

tasklist /fi "imagename eq %findpgmexe%"

if exist .\boinc-find-%findpgm%.txt del .\boinc-find-%findpgm%.txt
set tasklistboinc=tasklist /fi "imagename eq %findpgmexe%"
%tasklistboinc% > .\boinc-find-%findpgm%.txt

if exist .\boinc-count-%findpgm%.txt del .\boinc-count-%findpgm%.txt
find /c "%findpgmexe:~0,25%"  .\boinc-find-%findpgm%.txt > .\boinc-count-%findpgm%.txt

FOR /f "tokens=1,2,3 delims=<space>" %%G IN ('%tasklistboinc% ^| find /c "%findpgmexe:~0,25%"') DO set boincgputaskcnt=%%G
echo.
echo GPU task count %boincgputaskcnt%
(set findpgmexe=)
(set findpgm=)
(set boincgputaskcnt=)
pause
Title: Re: Fighting the CUDA bug
Post by: efmer (fred) on 06 Jul 2009, 10:44:04 am
I found that allowing the GPU task with 1 second elapsed time(more then likely suspended in memory) to run until progress is greater then 0 will remove GPU task from memory when suspended again.

Regards,
Kevin

Created this windows cmd file to count GPU tasks:
@echo off
echo %date% %time%
set findpgmexe=setiathome_6.08_windows_intelx86__cuda.exe
set findpgm=%findpgmexe:.exe=%

tasklist /fi "imagename eq %findpgmexe%"

if exist .\boinc-find-%findpgm%.txt del .\boinc-find-%findpgm%.txt
set tasklistboinc=tasklist /fi "imagename eq %findpgmexe%"
%tasklistboinc% > .\boinc-find-%findpgm%.txt

if exist .\boinc-count-%findpgm%.txt del .\boinc-count-%findpgm%.txt
find /c "%findpgmexe:~0,25%"  .\boinc-find-%findpgm%.txt > .\boinc-count-%findpgm%.txt

FOR /f "tokens=1,2,3 delims=<space>" %%G IN ('%tasklistboinc% ^| find /c "%findpgmexe:~0,25%"') DO set boincgputaskcnt=%%G
echo.
echo GPU task count %boincgputaskcnt%
(set findpgmexe=)
(set findpgm=)
(set boincgputaskcnt=)
pause
At the moment there is not enough work to check anything but a lot of task longer than 1 second caused the problem.
They did not report a elapsed time but where a couple of % to 20% or so done. So well beyond the 1 second
Restarting finished these WU ok. And deleting .. with these shortages of 100+ WU is not .... the best idea.
 
Title: Re: Fighting the CUDA bug
Post by: Claggy on 06 Jul 2009, 03:09:04 pm
Ronw's been busy today! Timeline Boinc Trac (http://boinc.berkeley.edu/trac/timeline)

Claggy

boinc_6.6.37_windows_intelx86.exe (http://boinc.berkeley.edu/dl/boinc_6.6.37_windows_intelx86.exe)

boinc_6.6.37_windows_x86_64.exe (http://boinc.berkeley.edu/dl/boinc_6.6.37_windows_x86_64.exe)