Seti@Home optimized science apps and information

Optimized Seti@Home apps => Windows => GPU crunching => Topic started by: Raistmer on 24 Mar 2009, 05:26:59 pm

Title: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Raistmer on 24 Mar 2009, 05:26:59 pm
This V11 CUDA MB mod intended for monitoring CUDA MB apps in multi-GPU environment and restarting freezed apps.
V11 CPU part still in writing so you can use this app with V10 CPU parts (you need rename executable file to MB_6.08_mod_CUDA_V10.exe to use with existing V10 pack).

If your CUDA MB work w/o freezing - you SHOULD NOT  use this build, it's special build just for those who experienced app freezes.

For all who interesting in this mod and have host with freezing CUDA MB apps, please, post here your host config:
CPU (and its SSE level) /OS/number of CUDA devices/GPU models.

This build will look for projects/setiathome.berkeley.edu/number_of_GPUs file and derive number of available CUDA devices from there. So be sure you used V10 pack before and know what number should be in that file.

Current mod version is untested at all, it may experience problems with security rights (it need right to kill another process) under Vista or not only under Vista and so on.
Consider it as experimental software  and don't download it if you just "consumer" type of user.
And please, make backup of BOINC data folder and disable BOINC network access before starting experimentations. There are enough of trashed caches already...

ADDON: CPU counterpart for SSSE3-enabled CPUs added. It should return zero on exiting due to GPU part kill and at some other errors that will allow task restart from last checkpoint.


[attachment deleted by admin]
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Vyper on 24 Mar 2009, 06:15:04 pm
Cool, that was fast.. I'll give it a shot when Thumper is back up and running.

But for now i haven't got a single locked wu.. Yet!!!

Kind regards Vyper
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Geek@Play on 24 Mar 2009, 07:46:32 pm
Raistmer...........

This computer's GPU hangs up at least once every day.  The file that is attached is a cpuz text dump on that computer while it is crunching.

One GPU, NVIDIA GeForce 9800 GT (1023MB) driver: 18208.

Hope this info helps solve the problem.

[attachment deleted by admin]
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: chronek on 25 Mar 2009, 12:46:50 pm
I didnt have any gpu hang - gf 8600m gt - nvidia beta cuda driver for notebook 181.22
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Marius on 25 Mar 2009, 05:17:16 pm
From time to my gpu freezes. it also has some computation errors.

I installed both applications (and stopped the freeze-script). The first GPU task just done so it looks good (like usual!). Just 80 units more before my seti@enhanced queue is empty.

Thanks,
Marius

[attachment deleted by admin]
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Geek@Play on 26 Mar 2009, 01:18:51 pm
Raistmer...........

Can I use this build on by one computer that is always freezes on the GPU?  That computer has only one GPU.  Also what is the number_of_GPUs file and do I need it?
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Raistmer on 26 Mar 2009, 01:34:57 pm
Raistmer...........

Can I use this build on by one computer that is always freezes on the GPU?  That computer has only one GPU.  Also what is the number_of_GPUs file and do I need it?
Still not.
I adding detection capability to teamed AK_v8 build too. Then it will look for  GPU app too.
Current build has few errors that I catched today (was able to run this app on dual GPU host). WIll post enhanced one soon.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Geek@Play on 27 Mar 2009, 12:12:51 am
I may have to stop crunching CUDA on this computer.  Today it has not gone more than 1 hour before CUDA freezes up.  CPU's keep on going.  I tried clocking the CPU and memory back to stock speeds earlier today but still it locks up.  I have no more ideas.

Anybody else have any ideas?  Tomorrow I will swap the NVIDIA card with an identical one from another computer and see what happens.  Perhaps it's a hardware problem on the NVIDIA card and this might help to narrow the problem down.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: chronek on 27 Mar 2009, 02:36:16 am
are you using nvidia cuda drivers? (http://www.nvidia.com/object/cuda_get.html) i have 4-th day gpu crunching - no errors or hungs
(but Mark's ap_info & boinc 6.6.17)
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Raistmer on 27 Mar 2009, 02:48:18 am
I may have to stop crunching CUDA on this computer.  Today it has not gone more than 1 hour before CUDA freezes up.  CPU's keep on going.  I tried clocking the CPU and memory back to stock speeds earlier today but still it locks up.  I have no more ideas.

Anybody else have any ideas?  Tomorrow I will swap the NVIDIA card with an identical one from another computer and see what happens.  Perhaps it's a hardware problem on the NVIDIA card and this might help to narrow the problem down.
I tend to think it's GPU hardware problem indeed.
On my dual GPU host CUDA MB always hangs only on one of boards (there are  8500GT and 9400GT, but only 9400GT freezes). I swapped cards to get 9400GT primary instead 8500GT - freezing continued.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Geek@Play on 27 Mar 2009, 12:29:37 pm
Who said "Computers are fun"?

This morning I swapped video cards around on the problem machine and another machine.  Brought both machines back online and guess what............they BOTH have been running without problems for the last 3 hours.

[edit] Now 6 hours running and no problems.

[edit] GPU froze again.  I have applied to EVGA for an RMA to have the GPU replaced.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Vyper on 29 Mar 2009, 06:43:06 am
Geek:

I don't think that it's related to the gpuboard itself. Think that it's some sort of programstall that occurs in the background not the GPU itself for locking up so if Evga validates your rma because you can't run s@h on your gpu then i would be surprised.

There is tons of these lockups that people are having on the nvidia cuda board with exactly similiar problems that S@H are having. That error seems to differ between various hardware and settings.

I talked to Raistmer over ICQ that i had alot of this issues also and nothing what i did solved anything, different cufft, cudart (cuda 2.0 & cuda 2.1), different raistmer vlar kill apps v5 through v10. Gpu OC downclock Overclock etc could make cuda s@h runnning properly, then lastly i did two other things:

1. On my Phenomsystem i raised the HT bus from multiplier 7 to 8 (Running at 2Ghz on my PhenomII instead of 1.75Ghz)
2. I overclocked my ram on the gpu from std of 999 Mhz to 1100 Mhz..

Voila.. I had aprox 1900 Wu's in stock and then the MB outage occured and i haven't been able to verify this further but NONE of mu WU's got stuck after that change for the next 37 hours until the cache dried out and that have never ever happened before.
So it seems like that cuda behaves differently on different hardware and so on and have a somewhat buggy interruptroutine that seems the cuda app to stall if timingconditions arent met.

I just wanted to share this before we're all starting to accuse the gpu hardware itself for beeing faulty. After all running "programs" on a GPU is quite new and i think that this won't be it until the hardware manufacturers perhaps make a smarter cpu/gpu implementation (cuda) or perhaps new way of waitstate signaling so the gpu won't timeout waiting for data that the gpumemory can't deliver (gpumem)...

My statements above is "pure speculation" and i dont want people thinking that this is it!! I just want to isolate it and until s@h servers are up again in full scale then i won't be able to further confirm that my gpu's aren't locking up for the moment in my own case.

But!! Because of that we s@h users that invest plenty of cash to run gpus 24/7 for s@home do exist and the problem is random.

This quick and dirty fix that Raistmer incorporated IS a necessity for us all that have the lockup issue and this sort of fix for having gpuprogram stall is needed to a high priority extent and should be incorporated in boinc's main program that it monitors "cuda" executables and options to try to pause and unpause stuck work and if the work doesn't respond on pause/resume after three retries it terminates it so we can start on the next work. This problem could reach further than only s@h to perhaps gpugrid or so too.

My 2 cents in this issue.. Thinks and thoughts people?!?

Kind Regards Vyper
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Richard Haselgrove on 29 Mar 2009, 08:14:04 am
Another 2 cents - or tuppence, in my case.

It may be just luck, but I don't think I've seen this stalling/freezing behaviour at all. I've been running three cards 24/7 for the last two months, so I must have done something of the order of 15,000 WUs in all: if any problems were to show up, I think I'd have seen them by now.

The rigs are all similar, and built specifically with crunching in mind: Intel Quads on low-end Foxconn motherboards, with Zotac CUDA cards (my local trade counter specialises in Foxconn/Zotac, so they were the cheapest and easiest to pick up locally). The faster box has a Q9300, dual-channel DDR2 RAM @ 800 MHz, and a 9800GTX+: the other two are slower, with Q6600s, 667MHz RAM, and a 9800GT each. Everything runs at manufacturer's rated speed (which in the case of the GTX+ means a slight overclock above typical nVidia speeds). OS is Windows XP 32-bit, nVidia drivers are 181.20 (January nVidia recommended release), and - usually - the stock Berkeley v6.08 CUDA release. [I sometimes run a burst of VLAR-kill if I need my cache recycled quickly for some new test run]

I started with BOINC v6.4.5: didn't bother with any of the early v6.6's (read too many bad things here!), but bit the bullet and upgraded to v6.6.14, which was the first to properly support both CUDA and optimised apps at the same time. It's a big upgrade, and has to be done carefully, but the effort is worth it. I'm now up to v6.6.18, where I'll probably stick: it's working well, and has kept my caches filled exactly to the plimsoll ine through the current server shortages.

So I think I'm with Vyper on this one: there isn't one single, simple cause of the stalling/freezing that people report. Subtle timing issues between GPU cards and motherboards may well be a contributory factor.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Raistmer on 29 Mar 2009, 08:24:24 am
Agreed with both.
The reason of freezing is unclear still. The first question is how to avoid performance loss if freezing exists. Next question - what cause freezing and is it possible to "unfreeze" host once and for all time.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: theoldman on 29 Mar 2009, 04:50:09 pm
I tend to think it's GPU hardware problem indeed.
On my dual GPU host CUDA MB always hangs only on one of boards (there are  8500GT and 9400GT, but only 9400GT freezes). I swapped cards to get 9400GT primary instead 8500GT - freezing continued.

May be this will help somehow..
I have 8600GT at home and 9600GSO at work. 8600GT has no freezes at all, or may be 1-2 times for a month (unseen - whole family use my PC).
At work 9600GSO freeze 1-2 times in week, or even more.
I'm not 100% sure, but 99% video driver at home and at work is the same version - 181.22. BOINC version and optimised packages are identical too.
The only difference is a hardware. Both videocards are not overclocked, or modified in any other way (I only change coolers with bigger ones :) ) Both computers have P4/3Ghz processors, 2G memory, one brand MB-s (Epox) with almost identical characteristics, BIOS-es. OS is Win XP SP3.
Only significan difference is in videocards.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Geek@Play on 29 Mar 2009, 10:12:55 pm
I still believe that it's a GPU problem.  That very same GPU has been freezing all weekend while the original computer and the GPU I swapped into it have been running without problems.  In short the freezing problem stayed with the a specific GPU even when swapped into a different computer.

Hope to hear from EVGA tomorrow but have received email confirming the RMA was requested by me.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Geek@Play on 30 Mar 2009, 05:41:57 am
EVGA approved the RMA and I will be returning the GPU to them today.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Vyper on 30 Mar 2009, 09:09:48 am
Ok cool. Actually it would be wonderful if it is that issue you have been having from the beginning so you can get rid of it.

Please keep us updated in this case, because if it works for you then a lot of manufacturers could face some hefty RMA from us s@h people :)

Kind regards Vyper
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Vyper on 05 Apr 2009, 03:57:28 am
Well i reported earlier that i did something that solved the freezing app issue..

After S@H got back up and started distributing WU's again two days ago i haven't had one lousy single cuda.exe hung whatsoever...
Frikking amazing but also somewhat disturbing..

So regarding the lockup version of this executable.. I can't test anything if i can't reproduce the problem that i'm having..

I hope that others out there doesn't need to test this version because actually who want's to deal with bugs that you need a workaround for anyway..

If my issues change and the cuda.exe starts hanging again i will definitely beta try this app and report further but as i said, my computer needs to start locking up again first...

Kind regards Vyper
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Raistmer on 05 Apr 2009, 07:15:27 am
Fine, I'm happy that problem disappears w/o such watchdog version using (it will slow down processing, very small slowdown though but ...).
I use it more than week on my dual GPU host and no more hangs I noticed. Maybe it helps, maybe simple no hangups - I didn't look in every stderr to findout there were restarts or not.
No invalid results from that host - it's enough for me for now.
If someone encounter some bug - he can report it. I'm busy with AP and  newly recived ATI cards now.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Zeus Fab3r on 09 Apr 2009, 06:21:32 am
Fine, I'm happy that problem disappears w/o such watchdog version using (it will slow down processing, very small slowdown though but ...).
I use it more than week on my dual GPU host and no more hangs I noticed. Maybe it helps, maybe simple no hangups - I didn't look in every stderr to findout there were restarts or not.
No invalid results from that host - it's enough for me for now.
If someone encounter some bug - he can report it. I'm busy with AP and  newly recived ATI cards now.

Finally a solution that actually works  ;D
3 days, 24/7 w/o single hangup and counting. Great job !
Still have a question. Can I run this cross_watch mod alongside opt AP?

Thanks again Raistmer.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Raistmer on 09 Apr 2009, 04:25:37 pm
Still have a question. Can I run this cross_watch mod alongside opt AP?
As long as at least 2 GPU apps running too. BOINC can't schedule correctly other apps with team MB pack. But you could try to use BOINC 6.6.20 (recommended one now) and this cross-watch build w/o CPU teampack part. Not usre what result it will give, but maybe it will work OK.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: chelski on 18 Apr 2009, 01:36:31 pm
I'm using MB_6.08_mod_CUDA_V11_def_func_FFTW_ESTIMATE_update in a V9 structure (e.g. without a number_of_cpus) in order to run AP on CPUs and MB on GPUS and still get the occasional freeze at 0%.  Is there a workaround for this structure - e.g. cross watch not between CPU-MB app and GPU-MB app but between AP and GPU-MB app to solve the freezing issue?

Please correct me if I'm wrong about how this thing works... things have really moved on rather fast and it is not that clear anymore which app / build to use for which setup.  Thanks

Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Borgholio on 19 Apr 2009, 02:13:07 pm
Currently running the non-team V9 pack on Boinc 6.6.20 and it's working fine aside from the occasional CUDA app freeze.  I have a task set under Windows to run benchmarks every half hour which frees up most stuck tasks but from time to time a task will freeze that requires a restart of BOINC.  If I were to download V11, would I need to upgrade from V9 to V10 first, or could I simply extract the CUDA app from V11 and modify my app-info to look at the new file instead?
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Raistmer on 19 Apr 2009, 04:52:01 pm
I'm using MB_6.08_mod_CUDA_V11_def_func_FFTW_ESTIMATE_update in a V9 structure (e.g. without a number_of_cpus) in order to run AP on CPUs and MB on GPUS and still get the occasional freeze at 0%.  Is there a workaround for this structure - e.g. cross watch not between CPU-MB app and GPU-MB app but between AP and GPU-MB app to solve the freezing issue?

Please correct me if I'm wrong about how this thing works... things have really moved on rather fast and it is not that clear anymore which app / build to use for which setup.  Thanks


No, there is no such mod. But it's possible.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Raistmer on 19 Apr 2009, 04:54:16 pm
Currently running the non-team V9 pack on Boinc 6.6.20 and it's working fine aside from the occasional CUDA app freeze.  I have a task set under Windows to run benchmarks every half hour which frees up most stuck tasks but from time to time a task will freeze that requires a restart of BOINC.  If I were to download V11, would I need to upgrade from V9 to V10 first, or could I simply extract the CUDA app from V11 and modify my app-info to look at the new file instead?
You need correct name of executable file listed in app_info. /moreover, modded Vx CPU files will call CUDA MB executables only with fixed names (different for different versions). So filename should match.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Borgholio on 19 Apr 2009, 07:13:26 pm
So would it be easier then to simply upgrade to v11 completely?
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Raistmer on 19 Apr 2009, 11:07:55 pm
So would it be easier then to simply upgrade to v11 completely?
surerecommended method now is to upgrade to BOINC 6.6.20 and to use its scheduling abilities instead of "teamed" pack.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Borgholio on 20 Apr 2009, 12:07:05 am

surerecommended method now is to upgrade to BOINC 6.6.20 and to use its scheduling abilities instead of "teamed" pack.


Already running 6.6.20.  :)  Just wondering about the cuda frozen task kill.
Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: HSchmirPo on 22 Apr 2009, 01:38:40 pm

surerecommended method now is to upgrade to BOINC 6.6.20 and to use its scheduling abilities instead of "teamed" pack.


Already running 6.6.20.  :)  Just wondering about the cuda frozen task kill.

The same here: updated to 6.6.20 and frozen task. :-[
Should we combine 6.6.20 with v11? :-\

Title: Re: V11 of CUDA MB mod - attempt to restart freezed apps
Post by: Raistmer on 22 Apr 2009, 02:05:49 pm

surerecommended method now is to upgrade to BOINC 6.6.20 and to use its scheduling abilities instead of "teamed" pack.


Already running 6.6.20.  :)  Just wondering about the cuda frozen task kill.

The same here: updated to 6.6.20 and frozen task. :-[
Should we combine 6.6.20 with v11? :-\



Yes, you could try (it's experimental software so "could" and "would" but not "should" ;) ).
But you need correct Number_of_GPUs file in place (as with teamed pack). No teamed CPU AK_v8 apps needed, only this file.