+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: V11 of CUDA MB mod - attempt to restart freezed apps  (Read 24008 times)

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
V11 of CUDA MB mod - attempt to restart freezed apps
« on: 24 Mar 2009, 05:26:59 pm »
This V11 CUDA MB mod intended for monitoring CUDA MB apps in multi-GPU environment and restarting freezed apps.
V11 CPU part still in writing so you can use this app with V10 CPU parts (you need rename executable file to MB_6.08_mod_CUDA_V10.exe to use with existing V10 pack).

If your CUDA MB work w/o freezing - you SHOULD NOT use this build, it's special build just for those who experienced app freezes.

For all who interesting in this mod and have host with freezing CUDA MB apps, please, post here your host config:
CPU (and its SSE level) /OS/number of CUDA devices/GPU models.

This build will look for projects/setiathome.berkeley.edu/number_of_GPUs file and derive number of available CUDA devices from there. So be sure you used V10 pack before and know what number should be in that file.

Current mod version is untested at all, it may experience problems with security rights (it need right to kill another process) under Vista or not only under Vista and so on.
Consider it as experimental software and don't download it if you just "consumer" type of user.
And please, make backup of BOINC data folder and disable BOINC network access before starting experimentations. There are enough of trashed caches already...

ADDON: CPU counterpart for SSSE3-enabled CPUs added. It should return zero on exiting due to GPU part kill and at some other errors that will allow task restart from last checkpoint.


[attachment deleted by admin]
« Last Edit: 25 Mar 2009, 05:00:26 am by Raistmer »

Offline Vyper

  • Alpha Tester
  • Knight Templar
  • ***
  • Posts: 376
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #1 on: 24 Mar 2009, 06:15:04 pm »
Cool, that was fast.. I'll give it a shot when Thumper is back up and running.

But for now i haven't got a single locked wu.. Yet!!!

Kind regards Vyper

Offline Geek@Play

  • Alpha Tester
  • Knight Templar
  • ***
  • Posts: 330
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #2 on: 24 Mar 2009, 07:46:32 pm »
Raistmer...........

This computer's GPU hangs up at least once every day.  The file that is attached is a cpuz text dump on that computer while it is crunching.

One GPU, NVIDIA GeForce 9800 GT (1023MB) driver: 18208.

Hope this info helps solve the problem.

[attachment deleted by admin]
Boinc....Boinc....Boinc....Boinc

chronek

  • Guest
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #3 on: 25 Mar 2009, 12:46:50 pm »
I didnt have any gpu hang - gf 8600m gt - nvidia beta cuda driver for notebook 181.22

Offline Marius

  • Knight o' The Realm
  • **
  • Posts: 84
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #4 on: 25 Mar 2009, 05:17:16 pm »
From time to my gpu freezes. it also has some computation errors.

I installed both applications (and stopped the freeze-script). The first GPU task just done so it looks good (like usual!). Just 80 units more before my seti@enhanced queue is empty.

Thanks,
Marius

[attachment deleted by admin]

Offline Geek@Play

  • Alpha Tester
  • Knight Templar
  • ***
  • Posts: 330
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #5 on: 26 Mar 2009, 01:18:51 pm »
Raistmer...........

Can I use this build on by one computer that is always freezes on the GPU?  That computer has only one GPU.  Also what is the number_of_GPUs file and do I need it?
Boinc....Boinc....Boinc....Boinc

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #6 on: 26 Mar 2009, 01:34:57 pm »
Raistmer...........

Can I use this build on by one computer that is always freezes on the GPU?  That computer has only one GPU.  Also what is the number_of_GPUs file and do I need it?
Still not.
I adding detection capability to teamed AK_v8 build too. Then it will look for  GPU app too.
Current build has few errors that I catched today (was able to run this app on dual GPU host). WIll post enhanced one soon.

Offline Geek@Play

  • Alpha Tester
  • Knight Templar
  • ***
  • Posts: 330
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #7 on: 27 Mar 2009, 12:12:51 am »
I may have to stop crunching CUDA on this computer.  Today it has not gone more than 1 hour before CUDA freezes up.  CPU's keep on going.  I tried clocking the CPU and memory back to stock speeds earlier today but still it locks up.  I have no more ideas.

Anybody else have any ideas?  Tomorrow I will swap the NVIDIA card with an identical one from another computer and see what happens.  Perhaps it's a hardware problem on the NVIDIA card and this might help to narrow the problem down.
« Last Edit: 27 Mar 2009, 12:33:58 am by Geek@Play »
Boinc....Boinc....Boinc....Boinc

chronek

  • Guest
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #8 on: 27 Mar 2009, 02:36:16 am »
are you using nvidia cuda drivers? (http://www.nvidia.com/object/cuda_get.html) i have 4-th day gpu crunching - no errors or hungs
(but Mark's ap_info & boinc 6.6.17)
« Last Edit: 27 Mar 2009, 02:40:39 am by chronek »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #9 on: 27 Mar 2009, 02:48:18 am »
I may have to stop crunching CUDA on this computer.  Today it has not gone more than 1 hour before CUDA freezes up.  CPU's keep on going.  I tried clocking the CPU and memory back to stock speeds earlier today but still it locks up.  I have no more ideas.

Anybody else have any ideas?  Tomorrow I will swap the NVIDIA card with an identical one from another computer and see what happens.  Perhaps it's a hardware problem on the NVIDIA card and this might help to narrow the problem down.
I tend to think it's GPU hardware problem indeed.
On my dual GPU host CUDA MB always hangs only on one of boards (there are  8500GT and 9400GT, but only 9400GT freezes). I swapped cards to get 9400GT primary instead 8500GT - freezing continued.

Offline Geek@Play

  • Alpha Tester
  • Knight Templar
  • ***
  • Posts: 330
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #10 on: 27 Mar 2009, 12:29:37 pm »
Who said "Computers are fun"?

This morning I swapped video cards around on the problem machine and another machine.  Brought both machines back online and guess what............they BOTH have been running without problems for the last 3 hours.

[edit] Now 6 hours running and no problems.

[edit] GPU froze again.  I have applied to EVGA for an RMA to have the GPU replaced.
« Last Edit: 27 Mar 2009, 08:24:12 pm by Geek@Play »
Boinc....Boinc....Boinc....Boinc

Offline Vyper

  • Alpha Tester
  • Knight Templar
  • ***
  • Posts: 376
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #11 on: 29 Mar 2009, 06:43:06 am »
Geek:

I don't think that it's related to the gpuboard itself. Think that it's some sort of programstall that occurs in the background not the GPU itself for locking up so if Evga validates your rma because you can't run s@h on your gpu then i would be surprised.

There is tons of these lockups that people are having on the nvidia cuda board with exactly similiar problems that S@H are having. That error seems to differ between various hardware and settings.

I talked to Raistmer over ICQ that i had alot of this issues also and nothing what i did solved anything, different cufft, cudart (cuda 2.0 & cuda 2.1), different raistmer vlar kill apps v5 through v10. Gpu OC downclock Overclock etc could make cuda s@h runnning properly, then lastly i did two other things:

1. On my Phenomsystem i raised the HT bus from multiplier 7 to 8 (Running at 2Ghz on my PhenomII instead of 1.75Ghz)
2. I overclocked my ram on the gpu from std of 999 Mhz to 1100 Mhz..

Voila.. I had aprox 1900 Wu's in stock and then the MB outage occured and i haven't been able to verify this further but NONE of mu WU's got stuck after that change for the next 37 hours until the cache dried out and that have never ever happened before.
So it seems like that cuda behaves differently on different hardware and so on and have a somewhat buggy interruptroutine that seems the cuda app to stall if timingconditions arent met.

I just wanted to share this before we're all starting to accuse the gpu hardware itself for beeing faulty. After all running "programs" on a GPU is quite new and i think that this won't be it until the hardware manufacturers perhaps make a smarter cpu/gpu implementation (cuda) or perhaps new way of waitstate signaling so the gpu won't timeout waiting for data that the gpumemory can't deliver (gpumem)...

My statements above is "pure speculation" and i dont want people thinking that this is it!! I just want to isolate it and until s@h servers are up again in full scale then i won't be able to further confirm that my gpu's aren't locking up for the moment in my own case.

But!! Because of that we s@h users that invest plenty of cash to run gpus 24/7 for s@home do exist and the problem is random.

This quick and dirty fix that Raistmer incorporated IS a necessity for us all that have the lockup issue and this sort of fix for having gpuprogram stall is needed to a high priority extent and should be incorporated in boinc's main program that it monitors "cuda" executables and options to try to pause and unpause stuck work and if the work doesn't respond on pause/resume after three retries it terminates it so we can start on the next work. This problem could reach further than only s@h to perhaps gpugrid or so too.

My 2 cents in this issue.. Thinks and thoughts people?!?

Kind Regards Vyper
« Last Edit: 29 Mar 2009, 07:05:08 am by Vyper »

Offline Richard Haselgrove

  • Messenger Pigeon
  • Knight who says 'Ni!'
  • *****
  • Posts: 2819
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #12 on: 29 Mar 2009, 08:14:04 am »
Another 2 cents - or tuppence, in my case.

It may be just luck, but I don't think I've seen this stalling/freezing behaviour at all. I've been running three cards 24/7 for the last two months, so I must have done something of the order of 15,000 WUs in all: if any problems were to show up, I think I'd have seen them by now.

The rigs are all similar, and built specifically with crunching in mind: Intel Quads on low-end Foxconn motherboards, with Zotac CUDA cards (my local trade counter specialises in Foxconn/Zotac, so they were the cheapest and easiest to pick up locally). The faster box has a Q9300, dual-channel DDR2 RAM @ 800 MHz, and a 9800GTX+: the other two are slower, with Q6600s, 667MHz RAM, and a 9800GT each. Everything runs at manufacturer's rated speed (which in the case of the GTX+ means a slight overclock above typical nVidia speeds). OS is Windows XP 32-bit, nVidia drivers are 181.20 (January nVidia recommended release), and - usually - the stock Berkeley v6.08 CUDA release. [I sometimes run a burst of VLAR-kill if I need my cache recycled quickly for some new test run]

I started with BOINC v6.4.5: didn't bother with any of the early v6.6's (read too many bad things here!), but bit the bullet and upgraded to v6.6.14, which was the first to properly support both CUDA and optimised apps at the same time. It's a big upgrade, and has to be done carefully, but the effort is worth it. I'm now up to v6.6.18, where I'll probably stick: it's working well, and has kept my caches filled exactly to the plimsoll ine through the current server shortages.

So I think I'm with Vyper on this one: there isn't one single, simple cause of the stalling/freezing that people report. Subtle timing issues between GPU cards and motherboards may well be a contributory factor.

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #13 on: 29 Mar 2009, 08:24:24 am »
Agreed with both.
The reason of freezing is unclear still. The first question is how to avoid performance loss if freezing exists. Next question - what cause freezing and is it possible to "unfreeze" host once and for all time.

theoldman

  • Guest
Re: V11 of CUDA MB mod - attempt to restart freezed apps
« Reply #14 on: 29 Mar 2009, 04:50:09 pm »
I tend to think it's GPU hardware problem indeed.
On my dual GPU host CUDA MB always hangs only on one of boards (there are  8500GT and 9400GT, but only 9400GT freezes). I swapped cards to get 9400GT primary instead 8500GT - freezing continued.

May be this will help somehow..
I have 8600GT at home and 9600GSO at work. 8600GT has no freezes at all, or may be 1-2 times for a month (unseen - whole family use my PC).
At work 9600GSO freeze 1-2 times in week, or even more.
I'm not 100% sure, but 99% video driver at home and at work is the same version - 181.22. BOINC version and optimised packages are identical too.
The only difference is a hardware. Both videocards are not overclocked, or modified in any other way (I only change coolers with bigger ones :) ) Both computers have P4/3Ghz processors, 2G memory, one brand MB-s (Epox) with almost identical characteristics, BIOS-es. OS is Win XP SP3.
Only significan difference is in videocards.
« Last Edit: 29 Mar 2009, 04:58:56 pm by theoldman »

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 227
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 219
Total: 219
Powered by EzPortal