Forum > GPU crunching

V11 of CUDA MB mod - attempt to restart freezed apps

<< < (3/7) > >>

Geek@Play:
Who said "Computers are fun"?

This morning I swapped video cards around on the problem machine and another machine.  Brought both machines back online and guess what............they BOTH have been running without problems for the last 3 hours.

[edit] Now 6 hours running and no problems.

[edit] GPU froze again.  I have applied to EVGA for an RMA to have the GPU replaced.

Vyper:
Geek:

I don't think that it's related to the gpuboard itself. Think that it's some sort of programstall that occurs in the background not the GPU itself for locking up so if Evga validates your rma because you can't run s@h on your gpu then i would be surprised.

There is tons of these lockups that people are having on the nvidia cuda board with exactly similiar problems that S@H are having. That error seems to differ between various hardware and settings.

I talked to Raistmer over ICQ that i had alot of this issues also and nothing what i did solved anything, different cufft, cudart (cuda 2.0 & cuda 2.1), different raistmer vlar kill apps v5 through v10. Gpu OC downclock Overclock etc could make cuda s@h runnning properly, then lastly i did two other things:

1. On my Phenomsystem i raised the HT bus from multiplier 7 to 8 (Running at 2Ghz on my PhenomII instead of 1.75Ghz)
2. I overclocked my ram on the gpu from std of 999 Mhz to 1100 Mhz..

Voila.. I had aprox 1900 Wu's in stock and then the MB outage occured and i haven't been able to verify this further but NONE of mu WU's got stuck after that change for the next 37 hours until the cache dried out and that have never ever happened before.
So it seems like that cuda behaves differently on different hardware and so on and have a somewhat buggy interruptroutine that seems the cuda app to stall if timingconditions arent met.

I just wanted to share this before we're all starting to accuse the gpu hardware itself for beeing faulty. After all running "programs" on a GPU is quite new and i think that this won't be it until the hardware manufacturers perhaps make a smarter cpu/gpu implementation (cuda) or perhaps new way of waitstate signaling so the gpu won't timeout waiting for data that the gpumemory can't deliver (gpumem)...

My statements above is "pure speculation" and i dont want people thinking that this is it!! I just want to isolate it and until s@h servers are up again in full scale then i won't be able to further confirm that my gpu's aren't locking up for the moment in my own case.

But!! Because of that we s@h users that invest plenty of cash to run gpus 24/7 for s@home do exist and the problem is random.

This quick and dirty fix that Raistmer incorporated IS a necessity for us all that have the lockup issue and this sort of fix for having gpuprogram stall is needed to a high priority extent and should be incorporated in boinc's main program that it monitors "cuda" executables and options to try to pause and unpause stuck work and if the work doesn't respond on pause/resume after three retries it terminates it so we can start on the next work. This problem could reach further than only s@h to perhaps gpugrid or so too.

My 2 cents in this issue.. Thinks and thoughts people?!?

Kind Regards Vyper

Richard Haselgrove:
Another 2 cents - or tuppence, in my case.

It may be just luck, but I don't think I've seen this stalling/freezing behaviour at all. I've been running three cards 24/7 for the last two months, so I must have done something of the order of 15,000 WUs in all: if any problems were to show up, I think I'd have seen them by now.

The rigs are all similar, and built specifically with crunching in mind: Intel Quads on low-end Foxconn motherboards, with Zotac CUDA cards (my local trade counter specialises in Foxconn/Zotac, so they were the cheapest and easiest to pick up locally). The faster box has a Q9300, dual-channel DDR2 RAM @ 800 MHz, and a 9800GTX+: the other two are slower, with Q6600s, 667MHz RAM, and a 9800GT each. Everything runs at manufacturer's rated speed (which in the case of the GTX+ means a slight overclock above typical nVidia speeds). OS is Windows XP 32-bit, nVidia drivers are 181.20 (January nVidia recommended release), and - usually - the stock Berkeley v6.08 CUDA release. [I sometimes run a burst of VLAR-kill if I need my cache recycled quickly for some new test run]

I started with BOINC v6.4.5: didn't bother with any of the early v6.6's (read too many bad things here!), but bit the bullet and upgraded to v6.6.14, which was the first to properly support both CUDA and optimised apps at the same time. It's a big upgrade, and has to be done carefully, but the effort is worth it. I'm now up to v6.6.18, where I'll probably stick: it's working well, and has kept my caches filled exactly to the plimsoll ine through the current server shortages.

So I think I'm with Vyper on this one: there isn't one single, simple cause of the stalling/freezing that people report. Subtle timing issues between GPU cards and motherboards may well be a contributory factor.

Raistmer:
Agreed with both.
The reason of freezing is unclear still. The first question is how to avoid performance loss if freezing exists. Next question - what cause freezing and is it possible to "unfreeze" host once and for all time.

theoldman:

--- Quote from: Raistmer on 27 Mar 2009, 02:48:18 am ---I tend to think it's GPU hardware problem indeed.
On my dual GPU host CUDA MB always hangs only on one of boards (there are  8500GT and 9400GT, but only 9400GT freezes). I swapped cards to get 9400GT primary instead 8500GT - freezing continued.

--- End quote ---

May be this will help somehow..
I have 8600GT at home and 9600GSO at work. 8600GT has no freezes at all, or may be 1-2 times for a month (unseen - whole family use my PC).
At work 9600GSO freeze 1-2 times in week, or even more.
I'm not 100% sure, but 99% video driver at home and at work is the same version - 181.22. BOINC version and optimised packages are identical too.
The only difference is a hardware. Both videocards are not overclocked, or modified in any other way (I only change coolers with bigger ones :) ) Both computers have P4/3Ghz processors, 2G memory, one brand MB-s (Epox) with almost identical characteristics, BIOS-es. OS is Win XP SP3.
Only significan difference is in videocards.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version