Seti@Home optimized science apps and information
Optimized Seti@Home apps => Windows => Topic started by: Gizbar on 30 Oct 2009, 04:45:58 pm
-
Help required please. I'm slowly tearing out what little hair I have left as I can't get Cuda working properly on my new Win 7 system.
I have:- Asus M4A79T deluxe, 4Gb corsair DDR3 ram, AMD955 clocked up to 3.6Ghz at stock voltage, and a 9800GTX+ on Win7-64 bit release candidate, with Lunatics v0.1.
I installed a new GTX260 (a Gigabyte super overclock 896Mb, running at 680Mhz core, 1500Mhz shaders, and 2500Mhz ram). Gigabyte say they cherry-pick the GPU's to run at this speed. I only got it because the EVGA I ordered went out of stock and they agreed to do this one at the same price. It seemed to be working fine, tried a couple of benchmarks but don't have much to hand, until I tried to update to Win7-64 bit Home premium full version, and Lunatics v0.2 64bit.
Started off with a complete blank hdd, and installed all my stuff. Tried to save all my work, as discussed on Seti forum, and thought I'd succeeded. Installed Boinc 6.6.41, which I know now seems to be a bit flaky so uninstalled and put 6.6.38 on there. Installed Lunatic v0.2 and it started processing. Also running Nvidia 191.07, and cuda 2.3 dlls
The problem comes when it tries to hand over one Cuda to another. It finishes one unit, and then freezes the screen completely when trying to take up another. The mouse will move for a while, but none of the buttons on windows will work and then the mouse freezes too. It doesn't fail like this on every work unit, but I can't trust it to leave it be anymore. I proved the theory about the wu handover by following it on SysInternals processxp. I've even used EVGA precision to underclock the card slightly, but it doesn't stop it freezing up.
Sorry for the essay, but I'm trying to give as much information as possible...
Any ideas?
regards, Gizbar.
-
Most useful info still missed - link to your host!
-
Ok, I can do that, it is host number 5000538.
But I don't know how I can provide you any details of the ones that are failing. (but see below for one example!)
What seems to be happening is when they fail, they go back to the last checkpoint after rebooting, but Boinc is not finishing the work unit off, it's starting a new one from the cache.
One of the failing workunits is this one:- 02se09ab.28065.22971.12.10.166_1 - got to 03:04 and 26.551%
Hope that helps a bit more.
It was running for a few months on Win7-64bit with the 9800GTX+ on Lunatics v0.1
regards, Gizbar.
-
If you wanna people help you why not easy their work and provide complete URL to host?
-
Ok, here it is.
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5000538
Hope that helps.
Have now tried disabling aero, but no joy.
Running out of ideas because it doesn't seem to be the card. Just had a play with Stalker: Clear Sky just to see if it would crash, but never in the game.
regards, Gizbar.
-
I have:- Asus M4A79T deluxe, 4Gb corsair DDR3 ram, AMD955 clocked up to 3.6Ghz at stock voltage, and a 9800GTX+ on Win7-64 bit release candidate, with Lunatics v0.1.
I installed a new GTX260 (a Gigabyte super overclock 896Mb, running at 680Mhz core, 1500Mhz shaders, and 2500Mhz ram).
regards, Gizbar.
You don't mention your power supply. Your new GPU will draw significantly more power than the old one - could that be the problem?
-
Hi Richard, thanks for the reply.
I don't think so. I have a Corsair 620w power supply, and I've used a power supply guide at http://extreme.outervision.com/psucalculatorlite.jsp which is telling me I should be drawing no more than about 80% of it's rating.
I've done some more testing today with 3dmark06 and used GPU-Z to log the temperatures too. It hasn't crashed once and hasn't been over 74c. While running the Cuda app, the temperature was averaging 72c, which I don't think is too excessive for the GTX260 IIRC. CPU on it's own is running fine, ran 4 instances of Prime95 to test it, never got above 51-52c.
It only seems to have a problem switching from one Cuda wu to the next, AFAICT. This leaves the wu partially done, as the crash makes it go back to the last save point in the wu, which could be as little as 3%. That doesn't bother me in itself, but it does bother me that it doesn't restart that wu, and goes onto another instead. I now have several sitting like this.
I just don't know why. I'll get a new PSU if I have to, but nothing else seems to be failing, and I thought that running the benchmark at top quality for an hour or more would stress the component just as much. Maybe it's because they're all running together? I'll try that now. My other system only has a 500w OCZ power supply in it, and is running flawlessly with the 9800GTX+ in it.
regards, Gizbar.
-
Another thing you should check. If the motherboard can handle the current of your card from the bus. This is also an significant amount of current.
I have a board that crashed while starting a WU but sometimes it ran for a day or so.
The power supply was large enough, but the regulator on the board was not up to the task.
-
Thanks Fred. Not sure how to check that one out, but the board is only a few months old and Asus make a big thing about the no. of phases they have and the stable supply to everything on it. There are no other cards running on any bus apart from that one. I'm going to do a few more tests with 3dmark vantage and then possibly try the non-vlarkill version of the cuda app, to see if that makes a difference. I've not had to sort a problem like this before with Seti, normally any problems are my own doing, and I've had to pay the price for doing it, and get help sorting it out, normally from the Seti boards.
regards, Gizbar.
-
You only tried this on Win 7 X64? I don't think the drivers are quite finished. I still wait for a driver that runs my 2 GTX 295 without crashing all the CUDA tasks. And making my system feel like a 386 systems from way way back.
The same system works just fine on XP 64.
And you may have a defective card, I had more than 100% defects.. ;D before I got some stable cards.
-
It's only a couple of days ago that I installed Win7-64 Home premium, but I've tried so much since then. I think I had the card running on Win7-64 RC, but I'm not 100% sure.
Don't think the card is defective, is there any other GPU test I can try? Something a bit more scientific?
I've just had 3dmark06 demo running with 4 Prime95 tasks running in the background and there were no crashes after 3 or 4 loops. The only problem I had was Prime95 didn't exit cleanly and stopped working as I tried to exit. I'm just about to try the V12 non-Vlarkill app and see what happens then.
Never used XP64 only 32bit. I may try WinXP again after this, if the non-Vlarkill app doesn't work. If it does, I'll try that for a few days, and use reschedule 1.9 to force Vlar to CPU.
The only other thing I can think of is that the PSU can't cope, as Richard suggested. I don't think so, but I'm not sure how I can log the voltages while processing to test it. I have Everest v5, but I don't know if voltages can be logged through it.
Don't want to give up on Win7 easily though, as I've found it pretty good and stable so far.
regards, Gizbar.
-
New update:- Reinstalled Windows XPPro-32 bit and Lunatics with Boinc 6.6.38, and Nvidia 191.07 with cuda 2.3 dll's.
Running GPU tests now to see if that solve problem.
If it does, I'll be cheesed off after paying for the Win7 version, especially after it all seemed to be working on Win7-64 bit RC.
I'll keep you all posted...
regards, Gizbar.
P.S. Thanx to all who have chipped in with ideas and suggestions. It really is appreciated.
G.
-
New new update! (Does that make any sense?)
Been running without any crashes so far (only 1/2 an hour though, so far), and all seems to be well.
Using GPU-Z to log temps, not above 75c, and SysInternals process explorer to see what's happening (although I can't find a log option for that).
What I have noticed though, is that the cuda wu's are randomly stopping, and starting another one. I actually missed it stop, then when looked again it had stopped one at approx 76% and had started another.
Then I actually saw it stop the current one at about 35%, and then go back to the previous one and finish it successfully.
I am now confused even more!
Any ideas on what is happening?
All on host number http://setiathome.berkeley.edu/show_host_detail.php?hostid=5000538I have virtually nothing on this machine as I'm trying to prove that it seems to be either the 64bit version of lunatics that is causing the problem or the shift to Win7-64. Maybe I could try the 32bit version of Lunatics on the 64bit system?
regards, Gizbar.
-
New new update! (Does that make any sense?)
Been running without any crashes so far (only 1/2 an hour though, so far), and all seems to be well.
Using GPU-Z to log temps, not above 75c, and SysInternals process explorer to see what's happening (although I can't find a log option for that).
What I have noticed though, is that the cuda wu's are randomly stopping, and starting another one. I actually missed it stop, then when looked again it had stopped one at approx 76% and had started another.
Then I actually saw it stop the current one at about 35%, and then go back to the previous one and finish it successfully.
I am now confused even more!
Any ideas on what is happening?
All on host number http://setiathome.berkeley.edu/show_host_detail.php?hostid=5000538I have virtually nothing on this machine as I'm trying to prove that it seems to be either the 64bit version of lunatics that is causing the problem or the shift to Win7-64. Maybe I could try the 32bit version of Lunatics on the 64bit system?
regards, Gizbar.
The stopping is quit normal but you need at least BOINC 6.6.38.
And it doesn't matter if the system is 32 of 64 bit.
If the WU finish and are validated properly everything is fine. Sometimes you don't want to know why BOINC is stopping WU it's just how the scheduler works. ;D
I've seen dozens of them at times.
-
Thanks for the info, Fred.
I'm just happy at the moment that it's running successfully without crashing. Never noticed it swapping wu before that's all. If it's normal behaviour I can live with it. My RAC is taking a dive with all this mucking around going on, but I need to find out what is going on.
As posted earlier I went back to version 6.6.38 from version 6.6.41, which I notice has now been pulled from Boinc downloads for Windows systems. Have you tried the 6.10.17 version yet?
Want to get it running properly before messing with another new Boinc version.
regards, Gizbar.
-
Thanks for the info, Fred.
I'm just happy at the moment that it's running successfully without crashing. Never noticed it swapping wu before that's all. If it's normal behaviour I can live with it. My RAC is taking a dive with all this mucking around going on, but I need to find out what is going on.
As posted earlier I went back to version 6.6.38 from version 6.6.41, which I notice has now been pulled from Boinc downloads for Windows systems. Have you tried the 6.10.17 version yet?
Want to get it running properly before messing with another new Boinc version.
regards, Gizbar.
I will wait some time because 6.6.38 works without any problems for me. It's mainly for AMD/ATI users.
-
Gizbar,
I'd make sure you have Boinc 6.6.37 /.38 minimum, or go to 6.10.17, which is very stable now,
6.10.x was originally only getting ATI support, but has had a lot of other fixes / enhancement's since,
the one i like the most is the 'Show active Task' only button, cuts down on a lot of the traffic between
the boinc client and boinc manager.
I'm running it on both my Desktop and Laptop no problem, along with the new Beta 195.39 drivers,
again no problem there eithier, I'd try Boinc 6.10.17 first, then later with the new drivers,
and have a look if there are any later chipset drivers available as well.
The problems with Boinc 6.6.36 are that if you have a largie-ish cache, and you ask for Seti work
and if you get lots of shorties, then Boinc will go EDF on the GPU, start a shortie on the GPU,
it might complete some of it before switching to another since that is in worse deadline pressure the the first shortie,
that's O.K in itself, but the problem is that if Boinc switches GPU Wu's before the first wu has checkpointed,
then Boinc doesn't free up the GPU memory, and every GPU wu after that runs in CPU fallback mode,
taking a whole core, meaning you now have more CPU tasks than cores,
(I don't think that's your problem through)
6.6.37 fixes the problem of GPU tasks going into CPU fallback mode, but not the problem of the actual switching,
6.10.17 fixes the problem with switching GPU tasks, GPU tasks now run not quite FIFO order,
they run in received order by date/time, and subdivided into report deadline order.
Claggy
Edited and added more thoughts
-
Gizbar,
I'd make sure you have Boinc 6.6.37 /.38 minimum, or go to 6.10.17, which very stable now,
6.10.x was originally only getting ATI support, but has had a lot of other fixes / enhancement's since.
The problems with Boinc 6.6.36 are that if you have a largie-ish cache, and you ask for Seti work
and if you get lots of shorties, then Boinc will go EDF on the GPU, start a shortie on the GPU,
it might complete some of it before switching to another since that is in worse deadline pressure the the first shortie,
that's O.K in itself, but the problem is that if Boinc switches GPU Wu's before the first wu has checkpointed,
then Boinc doesn't free up the GPU memory, and every GPU wu after that runs in CPU fallback mode,
taking a whole core, meaning you now have more CPU tasks than cores,
(I don't think that's your problem through)
6.6.37 fixes the problem of GPU tasks going into CPU fallback mode, but not the problem of the actual switching,
6.10.17 fixes the problem with switching GPU tasks, GPU tasks now run not quite FIFO order,
they run in received order by date/time, and subdivided into report deadline order.
Claggy
Thanks, but the rapid pullback or the previous release got me a bit scared. ;D
-
Thanks for the replies.
I've just heard from MarkJ on the Seti forum and he has explained that this could happen with the earlier versions of Boinc numbered 6.6.xx, and has been resolved in some of the later versions and suggested I upgrade to 6.10.17 as well. It has to do with the 'Task Switching Interval', which would let Boinc start a new task instead of running to completion. Please be aware I'm just relaying the information...
I think I'm proving that the card is stable on XPPro-32, it has been running for at least 2.5 hours now without a glitch, freeze, or crash. I'll leave it a bit longer and then start to try to install Win7-64HP again and see where that gets me to.
regards, Gizbar.
-
Thanks for the replies.
I've just heard from MarkJ on the Seti forum and he has explained that this could happen with the earlier versions of Boinc numbered 6.6.xx, and has been resolved in some of the later versions and suggested I upgrade to 6.10.17 as well. It has to do with the 'Task Switching Interval', which would let Boinc start a new task instead of running to completion. Please be aware I'm just relaying the information...
I think I'm proving that the card is stable on XPPro-32, it has been running for at least 2.5 hours now without a glitch, freeze, or crash. I'll leave it a bit longer and then start to try to install Win7-64HP again and see where that gets me to.
regards, Gizbar.
The latest beta driver from nVidia and I got a kernell error after a few minutes. Haven't seen one of those for some time. They still have work to do.
-
Hi Fred, I don't tend to mess around with the Beta drivers much. I only used 191.07 because that was an official release and whql'd by microsoft.
I'm happy that the card is now stable and running well on XPPro-32. Had it running for about 6.5 hours now (got waylaid by the last Grand Prix of the season, lol!) and it hasn't crashed or frozen or glitched once.
Now to try it on Win7-64HP again, and see if I can do a better job of installing it all than I did last time.
Might be offline for a bit.
regards, Gizbar.
-
New update! The story so far...
Ran successfully on XPPro-32 for approx 6.5 hours, so think I pretty much proved the card was stable.
Installed Win7-64HP on system. Told me an update had failed, and to re-run setup, choosing 'Get latest installation updates from internet'.
Did this and then installed outstanding updates. Then installed my preferred AV, which in my case is Avast.
Installed Nvidia drivers 191.07 for Win7-64.
Installed new partition just to install Boinc on. (Thanx Brodo!)
Then copied my installation of Boinc to the partition and installed 6.10.17. Installed Lunatics v0.2 32bit and cuda 2.3 dll's. Didn't work too well first off, so spent some time getting it to reread configs and data etc. Did work the first time with XPPro-32, but was still using 6.6.38 then.
Then ran reschedule to get some GPU work as no work is coming from the servers.
Currently have run 3-4 wu's without failing so far... fingers xxxxxd.
Installed Firefox and update to be able to post this. Don't like IE, only use it when I have to.
Haven't installed anything else yet, even my preferred firewall. Just relying on Windows one, the router, and my AV at the mo.
Testing In Progress..... (but looking promising so far!)
regards, Gizbar.
-
Ok. Still not perfect.
Had a blue screen overnight (1st blue screen ever on Win7!) but hadn't stopped it rebooting after an error, and can't tell what it's failed on yet. But it did go for over 4.5 hours this time which is unprecedented in this story so far.
Will carry on troubleshooting, but it might mean a voltage boost for something along the line. Have to work today, so won't be about 'til later.
regards, Gizbar.
-
Gizbar
Monitor Temps the CPU's Win 7 seems to be a bit more sensitive to over temp... experience...
-
Update:- 12 more hours on, no crashes. Run 4 wu + 1 cuda units all day, and everything still cooking on gas when I got home. Still playing a wait and see game, but many thanks to all suggestions, ideas, and solutions.
@Pappa - Thanks for the suggestion, but I was monitoring temps pretty closely anyway. Room temp was about 22c (72f) while testing, cpu was sitting around 50-52c (o/c from 3.2 to 3.6) that's approx 122-126f, and the GPU was going from approx 73-78c (163-173f in old money). Was using Coretemp64 for cpu, and GPU-Z for Gfx card. Have to clean the dust out of the filters every 2-3 weeks anyway. It makes a big difference, and does stop a lot of it getting into the pc. I've got an Antec 902 case and it does have good airflow. I can increase it as well if need be, as it's not running full speed yet. I've had a lot of trouble with overheating in the past, and the Antec 902 and another Antec 300 have been the best cases I've ever had. Not saying they're perfect, but the best I've had, and whatever cases I get in the future will always have a 'blowhole' fan to exhaust hot air out of the top of the case.
regards, Gizbar.
-
24 hours on, no blue screens, no crashes, and am slowly re-installing all my programs. Everything seems ok now.
regards Gizbar.
-
Thanks for the info, Fred.
I'm just happy at the moment that it's running successfully without crashing. Never noticed it swapping wu before that's all. If it's normal behaviour I can live with it. My RAC is taking a dive with all this mucking around going on, but I need to find out what is going on.
As posted earlier I went back to version 6.6.38 from version 6.6.41, which I notice has now been pulled from Boinc downloads for Windows systems. Have you tried the 6.10.17 version yet?
Want to get it running properly before messing with another new Boinc version.
regards, Gizbar.
I have 5 systems running Win7 x64 with BOINC 6.10.25 (64 bit). I am using Nvidia 195.62 drivers. All seem fine. I was running 191.07 drivers before without any issues. I don't OC my hardware as they run hot enough as it is.
If you don't want to live quite on the bleeding edge use 6.10.18 which is fairly good and won't preempt the cuda tasks. There is a 64 bit multi-beam app in the downloads area and you'll find its faster than the 32 bit app for CPU work. The cuda app is still 32 bit.
-
Thanks for the info, MarkJ. I have a stack of work ready to report/upload due to the enforced power outage at Berkeley that started yesterday. I don't want to trash my upload queue by upgrading it shoddily. How can I upgrade without disturbing all this work?
regards, Gizbar.
-
Boinc Site is now back up after the power outage, and even though I'be selected 'all versions' on the home page, it is still only giving me version 6.10.18 as the recommended version and 6.10.24 as the development version... where, oh where is 6.10.25? :)
-
Boinc Site is now back up after the power outage, and even though I'be selected 'all versions' on the home page, it is still only giving me version 6.10.18 as the recommended version and 6.10.24 as the development version... where, oh where is 6.10.25? :)
http://boinc.berkeley.edu/dl/?C=M;O=D (http://boinc.berkeley.edu/dl/?C=M;O=D)
Claggy
-
Thanks Claggy! ;D
All I have to do is run my queue down and then start upgrading. Have already gone to Nvidia 195.62 current recommended driver. Off to bed now as have to be getting up for work in 6 hours... Boo!
regards Gizbar.
-
For non-Fermi cards 195.x driver has worser performance than 191.x.
-
Thanks Raistmer. I didn't know that. I saw that it was the current Nvidia whql version and installed it. I'll see how it's doing, but I can quite easily roll back to the previous driver if need be.
regards, Gizbar.