Seti@Home optimized science apps and information
Optimized Seti@Home apps => Windows => GPU crunching => Topic started by: _heinz on 18 Jun 2011, 01:34:50 pm
-
Hi all,
if you have problems and errors with x38g please post here.
heinz
~~~~~~~~~~~~~
<core_client_version>6.12.26</core_client_version>
<![CDATA[
<message>
Unzul�ssige Funktion. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
setiathome_CUDA: Found 1 CUDA device(s):
Device 1: GeForce GT 540M, 961 MiB, regsPerBlock 32768
computeCap 2.1, multiProcs 2
clockRate = 1500000
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GT 540M is okay
SETI@home using CUDA accelerated device GeForce GT 540M
Priority of process raised successfully
Priority of worker thread raised successfully
Cuda Active: Plenty of total Global VRAM (>300MiB).
All early cuFft plans postponed, to parallel with first chirp.
) _ _ _)_ o _ _
(__ (_( ) ) (_( (_ ( (_ (
not bad for a human... _)
Multibeam x3 g Preview, Cuda 3.20
Legacy setiathome_enhanced V6 mode.
Work Unit Info:
...............
WU true angle range is : 2.592522
CUFFT error in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_fft.cu' in line 125.
Cuda error 'cudaFree(dev_PowerSpectrumSumMax)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 522 : unknown error.
Cuda error 'cudaFree(dev_outputposition)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 524 : unknown error.
Cuda error 'cudaFree(dev_flagged)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 526 : unknown error.
Cuda error 'cudaFree(dev_NormMaxPower)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 528 : unknown error.
Cuda error 'cudaFree(dev_PoTPrefixSum)' in file '@:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 530 : unknown error.
Cuda error 'cudaFree(dev_PoT)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 532 : unknown error.
Cuda error 'cudaFree(dev_GaussFitResults)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in lin 534 : unknown error.
Cuda error 'cudaFree(dev_t_PowerSpectrum)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 536 : unknown error.
Cuda error 'cudaFree(dev_PowerSpectrum)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 538 : unknown error.
Cuda error 'cudaFree(dev_WorkData)' in file 'c:/[Projects]/X_CudaMB/client/cudaÀudaAcceleration.cu' in line 540 : unknown error.
Cuda error 'cudaFree(dev_flag)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 542 : unknown error.
Cuda error 'cudaFree(dev_cx_ChirpDataArray)DD in file 'c:/[Projects]/X_CudaMB/client/cuda/cwdaAcceleration.cu' in line 546 : unknown error.
Cuda error '»óudaFree(dev_cx_DataArray)' in file 'c:/[Projects]/X_@adaMB/client/cuda/cudaAcceleration.cu' in line 548 : unknown error.
Cuda sync'd & freed.
</stderr_txt>
]]>
-
How many of these have you had Heinz ?
CUFFT error in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_fft.cu' in line 125.
Looks similar to something that crops up on my GTX260 from time to time. Too early in my {ïvestigation for that to say for sure what causes it, as it only seems to happen sometimes & only on certain GPUs. I'm currently digging at the chirp directly preceding those calls, and haven't come across an issue that could cause it there, but I'll keep my eyes out.
-
to now it works very perfect und very fast
on XP with GTX 460 (266.58) i think this is 3 Minutes faster as x32f
Thanks! The 460's & 560ti's are showing vo be very nice cards, looks like the choice for replacing my 260 if I need to. I believe they Can `o re yet.
-
Oops,
http://sediat`ome.beeley.edu/forum_thread.php?id=63429
-
Thanks,
I've been working via PM with Slavac ( http://setiathome.berkeley.edu/show_user.php?usurid=9475661 , 2 x 560ti's ) to isolate how much problems might relate to something fixable in the application, and how much to something else ( i.e. drivers &/or hardware ).
He is reporting no downclocks with the new app & Drivers, but still some of the FFT Errors, usually a few or more per day, as with my P4 with GTX 260. That is why I suspect a deeper issue with either the drivers, library or SDK , but am trying to isolate it to something more specific & find out if there is anything in surrounding code that could be antagonising the issue.
Nothing yet, but I'll keep searching & probably update the app or make other recommendations if I find a way to avoid them on those cards (including my 260).
-
I'm not sure how widespread the prblem is but I have been noticing 560Tis showing up in my results as giving -9s where I and another finished clean. Hadn't looked close enough to figure out if it was just one or two of them or re spread out over the whole type. Hope you find the problem as it looks like a really nice card otherwise. After peading that thread it seems like they have found a pretty good workaround.
-
Yeah upping the voltage slightly is a likely solution for those -9's, especially when running several tasks at once. Probably the manufacturers kept the voltage down to meet a power spec or something.
I'll likely know more in a few days, but it does look like there are timing sensitivity issues as well, possibly to do with memory controller load. I'm testing a build on my p4/GTX260 now that both gives more descriptive error messages, and guts & replaces a bunch of code inherited from nVidia dev rk preceeding the FFTs that fail. That 260 has steadily exhibited the issue, so has proven useful for exploring how to make the app harder.
Prior to the code-gutting excercise, I did a quick test to look for the Error code:
...
Multibeam x39 Preview, Cuda 3.20
Legacy setiathome_enhanced V6 mode.
Work Unit Info:
...............
WU true angle range is : 2.592398
...
A FFT launch failed (try 1), code CUFFT_EXEC_FAILED = 0x6
A FFT launch failed (try 2), code CUFFT_EXEC_FAILED = 0x6
A FFT launch failed (try 3), code CUFFT_EXEC_FAILED = 0x6
A FFT launch failed (try 4), code CUFFT_EXEC_FAILED = 0x6
A FFT launch failed (try 5), code CUFFT_EXEC_FAILED = 0x6
A FFT launch failed (try 6), code CUFFT_EXEC_FAILED = 0x6
A FFT launch failed (try 7), code CUFFT_EXEC_FAILED = 0x6
A FFT launch failed (try 8), code CUFFT_EXEC_FAILED = 0x6
A FFT launch failed (try 9), code CUFFT_EXEC_FAILED = 0x6
A FFT launch failed (try 10), code CUFFT_EXEC_FAILED = 0x6
CUFFT error in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_fft.cu' in line 81. code CUFFT_EXEC_FAILED = 0x6
Googling CU T_EXEC_FAILED reveals common issues with this in the part on various cards. especially on GPUGrid, mostly with GTX 260's , so I have replaced this x39 on that host with a 'special' different x39 build that looks the same but changes a lot of code before the FFTs. I'll be watching that for a day or so, then determine if there indeed was some crankiness in the drivers etc, or careful code leading up to the FFTs resolves the issue.
No more errors yet, touchwood, on the 260 since I changed from 'ordinary x39 with extra descriptive errors' to ' x39 with replaced code before FFTs'. The latest errors with x39 visible (for the moment) for that host is the first kind of x39, so is part of attempting to track things down. I'll be looking for any after 18 Jun 2011 | 16:52:11 UTC for further diagnosis/investigation.
Jason
-
Hi Jason,
on my ATOM- ION x38g runs still on CPU... it has 10% after 5 hours, so ~ 50 hours runtime at end.
Should I abort it ?
Maybe there is not enough memory available while initializing the GPU.
Some standalone test with test wu's are necessary.
Have BOINC 6.10.60
NVIDIA GPU 0: ION (driver version 27533, CUDA version 4000, compute capability 1.1, 64MB, 35 GFLOPS peak)
And since I installed the driver, it shows wrong value 64MB
heinz
-
If I may write some words :)
It looks like 560TI is little power hungry beast. And some manufacturers think that card will operate well with lower voltage settings. But in case of SETI and other BOINC projects it is not true. I have issue with my Gigabyte 560 TI also, and until I raise voltage to 1.0375V I was unable to get stable work. With this voltage I can do 24/7 crunching without any problems. So before you blame your app or SDK, or drivers, do two things: or downclock your GPU to 820 MHZ ( stock freq) or give to GPU more voltage , not below 1.025V. And then do testing.
-
Hi Jason,
on my ATOM- ION x38g runs still on CPU... it has 10% after 5 hours, so ~ 50 hours runtime at end.
Should I abort it ?
We know it has more memory than that, so stop Boinc put a driver that reports properly & reboot. The app should pick up properly ehrn iy sees enough TAM. That mechanism will change in x39 series to say 'go away' if there isn't enough total, and a Boinc temporary exit ( newer BoincApi feature I'll have to update the BoincApi in use to have access to )
-
If I may write some words :)
Thanks Pepi, yup we tracked down the need for a voltage tweak for those in posts & it makes sense.
I'm now currently poking at a different kind of error that occurs on some GPUs *sometimes*. As per heinz' first report of a FFT error, noting that my 260 sees the same I have modified some code & they seem to have gone away on mine. That's the last major 'niggle' I've found under V6 operation so far, and my p4 with GTX 260 seems to have come good for ~24 hours of operation, which I'm keeping an eye on to see if it is really solved.
Jason
-
Hey Jason, we validated. You got the canonical result after the fourth guy got reported. :o
http://setiathome.berkeley.edu/workunit.php?wuid=757762089
In case anyone is wondering why I posted this, as far as I know it's the first where two of us were running the new installer. Jason G was the first and I came in third. We wondered why it didn't decide then to validate instead of sending out to another wingman.
-
Woohoo!, Yay me ;D
-
Hi Jason,
after 30 hours the ION ended the wu resultid=1956143930 (http://setiathome.berkeley.edu/result.php?resultid=1956143930)
have a look at it: -177 (0xffffffffffffff4f)
heinz
-
Yeah that underreporting of VRAM is a problem for you for sure:
setiathome_CUDA: Found 1 CUDA device(s):
Device 1: ION, 64 MiB, regsPerBlock 8192
computeCap 1.1, multiProcs 2
clockRate = 1200000
setiathome_CUDA: device 1 not have enough available global memory. Only found 67108864
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device cannot be used
SETI@home NOT using CUDA, falling back on host CPU processing
It actually did what it's supposed to. which is a surprise, as I'm still probing at the memory initialisation sequence. Resolving why your ION only reports 64MiB should solve your issue.
Jason
-
Yeah that underreporting of VRAM is a problem for you for sure:
It actually did what it's supposed to. which is a surprise, as I'm still probing at the memory initialisation sequence. Resolving why your ION only reports 64MiB should solve your issue.
Jason
Version: 275.33 WHQL
Freigabedatum: 2011.06.01
the latest whql driver (http://www.nvidia.de/object/win7-winvista-32bit-275.33-whql-driver-de.html) and the driver before show wrong values, so I must go more backwards with the driver installation.
Is'nt it a shame that the the latest whql driver has such error again. Blame nvidia. :'(
Or is it BOINC that shows the wrong value ? ?
heinz
-
Is'nt it a shame that the the latest whql driver has such error again. Blame nvidia. :'(
Is that the verde drivers Heinz ? I will update my ION2 (Not currently running Boinc) & see what that says.
-
Is'nt it a shame that the the latest whql driver has such error again. Blame nvidia. :'(
Is that the verde drivers Heinz ? I will update my ION2 (Not currently running Boinc) & see what that says.
It is not the verde
verde is there --> http://www.nvidia.de/object/notebook-win7-winvista-275.33-whql-driver-de.html
As far as I know the verde is still for laptops.
My R3600 is not a laptop
heinz
-
My R3600 is not a laptop
try it anyway :D
[Edit:] downloading onto my netbook now
For the desktop listing there also seems to be a newer beta, will check out the release notes for that
Update: with verde 275.33, Boinc shows 434MiB VRAM on my ION2, so a bit less than the previous driver (That said 444MiB).
Could there be some BIOS aperture size or similar setting for you Heinz, that is limiting the reported memory ?
-
The verde driver did not install.
No compatible hardware found.
Now is latest 27533 installed, in the controlpanel of NVIDIA is "Autosearch Updates" marked.
20.06.2011 18:24:20 NVIDIA GPU 0: ION (driver version 27533, CUDA version 4000, compute capability 1.1, 64MB, 35 GFLOPS peak)
hmm... Till now I have never looked in the BIOS of the R3600.
heinz
modify:
BOINC 6.10.60 shows:
28.04.2011 12:45:52 NVIDIA GPU 0: ION (driver version 27051, CUDA version 4000, compute capability 1.1, 306MB, 35 GFLOPS peak)
anyhow curious 306 MB ?
but this was the last working version(27051 beta)
and this shows ~250
On the ION Boinc shows the driver
07.03.2011 16:36:18 NVIDIA GPU 0: ION (driver version 27032, CUDA version 4000, compute capability 1.1, 242MB, 35 GFLOPS peak)
-
Hi Jason,
The Querying for a CUDA Device is so different in the OPTIMUS Technology,
see http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_Developer_Guide_for_Optimus_Platforms.pdf
page 3
have a look !
heinz
-
hi Jason,
I installed latest nvidia beta driver on my laptop, 275.50-notebook-win7-winvista-64bit-international-beta
BOINC 6.12.26 shows:
20.06.2011 22:23:06 | | NVIDIA GPU 0: GeForce GT 540M (driver version unknown, CUDA version 4000, compute capability 2.1, 962MB, 172 GFLOPS peak)
275.33-notebook-win7-winvista-64bit-international-whql has the same issue.
in general, there must be something wrong in the driver detection of BOINC.
Although, I can run primegrid.
heinz
-
... there must be something wrong in the driver detection of BOINC.
Good possibility. At some stage (when I'm bored) I'll look at how they detect that, to see if it uses driver APIs etc properly.
-
... there must be something wrong in the driver detection of BOINC.
Good possibility. At some stage (when I'm bored) I'll look at how they detect that, to see if it uses driver APIs etc properly.
I wonder if the driver is now reporting the Real amount of RAM the Ion has, and now isn't reporting the extra system RAM the BIOS settings add,
Claggy
-
I wonder if the driver is now reporting the Real amount of RAM the Ion has, and now isn't reporting the extra system RAM the BIOS settings add,
That would seem to break the WDDM driver model, which basically says you get what you're given. I would have thought issues on one ION should appear on another. I haven't looked if there is a BIOS setting for mine as it has 512MiB dedicated & Windows 7, so system shared amount is determined by Turbocache functionality. I'll probably do that.
As I recall, You're good at working out complicated stuff like NewCredit :D . You could, if you're bored at some stage, go through the document at http://msdn.microsoft.com/en-us/windows/hardware/gg487348.aspx , to see if there's anything related , or especially anything I missed that I might need to know when working out the difference to XP Driver Model ( regarding the performance jump at various drivers & XP-WDDM performance difference with older simpler application code, etc )
[Edit:] checked my ION Netbook, no video related settings there at all, oh well
Jason
-
Hi Jason,
with my ION R3600 I'm going back to driver 270.32
21.06.2011 10:06:40 NVIDIA GPU 0: ION (driver version 27032, CUDA version 4000, compute capability 1.1, 242MB, 35 GFLOPS peak)
If I'm lookink up with "AIDA64 Extreme Edition" it shows:
Informationsliste Wert
Video Adapter Eigenschaften
Gerätebeschreibung NVIDIA ION
Adapterserie ION
BIOS Version Version 62.79.63.0.1
Chiptyp ION
DAC Typ Integrated RAMDAC
Treiberdatum 20.02.2011
Treiberversion 8.17.12.7032 - nVIDIA ForceWare 270.32
Treiberanbieter NVIDIA
Speichergröße 256 MB
Installierter Treiber
nvd3dum 8.17.12.7032 - nVIDIA ForceWare 270.32
nvwgf2um 8.17.12.7032
nvwgf2um 8.17.12.7032
Video Adapter Hersteller
Firmenname NVIDIA Corporation
Produktinformation http://www.nvidia.com/page/products.html
Treiberdownload http://www.nvidia.com/content/drivers/drivers.asp
Treiberupdate http://www.aida64.com/driver-updates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Now I will try x38g again
edit:
x38g is running on GPU now. Report result as soon as ready.
heinz
-
What we learn from this:
270.32 is not a WDDM driver and BOINC shows 242MB
The newer WDDM driver will not detected properly by BOINC and does not show correct values of VRAM on any of my systems(ION R3600, and I3 Geforce GT540M)
BOINC on I3 shows:
20.06.2011 22:23:06 | | NVIDIA GPU 0: GeForce GT 540M (driver version unknown, CUDA version 4000, compute capability 2.1, 962MB, 172 GFLOPS peak)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
"AIDA64 Extreme Edition" shows on my I3:
Informationsliste Wert
Grafikprozessor Eigenschaften
Grafikkarte nVIDIA GeForce GT 540M (Medion)
GPU Codename GF108M
PCI-Geräte 10DE-0DF4 / 17C0-10E2 (Rev A1)
Transistoren 585 Mio.
Fertigungstechnologie 40 nm
Gehäusefläche 114 mm2
Bustyp PCI Express 2.0 x16 @ x16
Speichergröße 1 GB
GPU Takt (Geometric Domain) 750 MHz
GPU Takt (Shader Domain) 1500 MHz
RAMDAC Takt 400 MHz
Pixel Pipelines 8
Texturen Mapping Einheiten 16
Unified Shaders 96 (v5.0)
DirectX Hardwareunterstützung DirectX v11
Pixel Füllrate 6000 MPixel/s
Texel Füllrate 12000 MTexel/s
Speicherbus-Eigenschaften
Bustyp DDR3
Busbreite 128 Bit
Tatsächlicher Takt 450 MHz (DDR)
Effektiver Takt 900 MHz
Bandbreite 14.1 GB/s
Auslastung
Grafikprozessor (GPU) 99%
Speichercontroller 0%
Video Engine 0%
nVIDIA ForceWare Clocks
Standard 2D Grafikprozessor (GPU): 50 MHz, Shader: 101 MHz, Speicher: 135 MHz
Low-Power 3D Grafikprozessor (GPU): 202 MHz, Shader: 405 MHz, Speicher: 324 MHz
Performance 3D Grafikprozessor (GPU): 750 MHz, Shader: 1500 MHz, Speicher: 900 MHz
Grafikprozessorhersteller
Firmenname NVIDIA Corporation
Produktinformation http://www.nvidia.com/page/products.html
Treiberdownload http://www.nvidia.com/content/drivers/drivers.asp
Treiberupdate http://www.aida64.com/driver-updates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
hi Jason,
I installed latest nvidia beta driver on my laptop, 275.50-notebook-win7-winvista-64bit-international-beta
BOINC 6.12.26 shows:
20.06.2011 22:23:06 | | NVIDIA GPU 0: GeForce GT 540M (driver version unknown, CUDA version 4000, compute capability 2.1, 962MB, 172 GFLOPS peak)
275.33-notebook-win7-winvista-64bit-international-whql has the same issue.
in general, there must be something wrong in the driver detection of BOINC.
Although, I can run primegrid.
heinz
On my desktop MB have GPU also. I use MB GPU for usually work, and GTX 560TI only for crunching. When nothing is attached to 560 TI then I have same message as you: driver version unknown
but 560TI works without problem in SETI.
-
Hi Jason,
all 3 wu's are done now on the ION (driver 270.32)
hostid=5510631 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5510631)
resultid=1956143932 (http://setiathome.berkeley.edu/result.php?resultid=1956143932)
resultid=1956143934 (http://setiathome.berkeley.edu/result.php?resultid=1956143934)
resultid=1956143936 (http://setiathome.berkeley.edu/result.php?resultid=1956143936)
Cuda Active: All 15 paranoid early cuFft plans succeeded.
what does it mean ? all can be used ?
woundering about
<core_client_version>6.10.58</core_client_version>
reports:
SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected exceeds the storage space allocated.
Flopcounter: 42371081878.117691
Spike count: 30
Pulse count: 0
Triplet count: 0
Gaussian count: 0
called boinc_finish
~~~~~~~~~~~~~~~~~~~~~~~~~~~
x38g reports:
Multibeam x38g Preview, Cuda 3.20
Legacy setiathome_enhanced V6 mode.
Work Unit Info:
...............
WU true angle range is : 2.720454
Flopcounter: 10431952929543.820000
Spike count: 2
Pulse count: 0
Triplet count: 1
Gaussian count: 0
Worker preemptively acknowledging a normal exit.->
called boinc_finish
boinc_exit(): requesting safe worker shutdown ->
boinc_exit(): received safe worker shutdown acknowledge ->
So no validation will happen for me.
heinz
-
So far i can see your results looks O.K.
Your winman using fermi app and produces -9 errors.
So the unit will be sent to third host.
-
Mike beat me to it. You have the same wingman on all three of those work units. He is running a 560TI and is apparently throwing out bad -9 results. Hope the next in line does better. You should get credit no problem on those.
-
Got one invalid result http://setiathome.berkeley.edu/workunit.php?wuid=761506607 Not much to it, I found a pulse the other two wingmen didn't. It was when I was running the 0.38e flavor. Thought I would mention it just in case. It's the only invalid result I've got so far.
-
Just keep an eye on it perryjay.
-
Got one invalid result http://setiathome.berkeley.edu/workunit.php?wuid=761506607 Not much to it, I found a pulse the other two wingmen didn't. It was when I was running the 0.38e flavor. Thought I would mention it just in case. It's the only invalid result I've got so far.
Yep, as mentioned on main, looks like the single, likely low power, pulse that you found, where the others didn't, would be simply due to the innaccurate old nVidia app chirp. So it fits the expected pattern. In science terms yours is 'more correct' of course, and would likely have matched a CPU app wingman strongly, but being ganged up on by 2 older apps that way is going to happen during the transition period.
Jason
-
Got one invalid result http://setiathome.berkeley.edu/workunit.php?wuid=761506607 Not much to it, I found a pulse the other two wingmen didn't. It was when I was running the 0.38e flavor. Thought I would mention it just in case. It's the only invalid result I've got so far.
Yep, as mentioned on main, looks like the single, likely low power, pulse that you found, where the others didn't, would be simply due to the innaccurate old nVidia app chirp. So it fits the expected pattern. In science terms yours is 'more correct' of course, and would likely have matched a CPU app wingman strongly, but being ganged up on by 2 older apps that way is going to happen during the transition period.
Jason
The one reported pulse doesn't fully explain the invalid judgement, since "weakly similar" merely needs half the signals to match. The task was VHAR, so there should have been a best_gaussian with all zero values, that's a gimme match. The reported pulse would be repeated as best_pulse, and if the difference were due to it being only a tiny bit above threshold that best_pulse should match the others close enough. And finally there would be a best_spike. IOW 1 dodgy pulse could have easily had 3 acceptable best_* signals to yield weakly similar. To get invalid 3 of the 4 must not have found a match in the other results.
OTOH, we have no way of knowing the result file didn't get corrupted server-side or something like that. However, I'd expect some indication from other users of similar problems in that case. It's a puzzle which cannot be solved now, just watch to see if it happens again with x38g.
The one on http://setiathome.berkeley.edu/workunit.php?wuid=762393888 is a loss as far as analysis goes, there's no stderr information from x38g.
Joe
-
OTOH, we have no way of knowing the result file didn't get corrupted server-side or something like that. However, I'd expect some indication from other users of similar problems in that case. It's a puzzle which cannot be solved now, just watch to see if it happens again with x38g.
Hmmm, the missing stderr information to me indicates a few possibilities. Either the improved exit code is not functioning as designed (due to system specific issue or other problem in the code itself), there is a communication issue of some sort (I suppose the server load could have some part there), or indeed the server itself lost that information. I've seen no indication that result files wouldn't follow the same behaviour as stderr contents.
I'm finding that as the cuda app issues get rarer, they do get harder to diagnose when they appear. One thing that is noticeable is that users are finding their errors & inconclusives more quickly now that the web pages display in categorised form ;D
-
I noticed the missing stderr not only for my result but also one of the others as well. I didn't think it would do you much good that way but from Jason's comment I guess I should have mentioned it here too.
-
Well, I woke up this morning to another downclocking. I had noticed last night a general sluggishness to my computer but decided not to reboot. Guess I should have. I tend to leave everything running when I quit the computer so I would guess it just built up until something had to give. I don't think it was downclocked for very long so I didn't lose too much. After a reboot everything is back and running good.
EDIT I spoke too soon. It down clocked again. I've rebooted again and it is back up to where it is supposed to be. Guess I will see if it will hold this time. Gotta go cut the grass so I will be away for about an hour. Hope it doesn't go down in that length of time.
-
Can you catch a task name that's in progress when it does it next time ? When the result is uploaded we could then see if the stderr says anything useful.
-
This is one that took forever, not sure if it's the one you want. http://setiathome.berkeley.edu/workunit.php?wuid=765100017
Here's another one that completed and validated http://setiathome.berkeley.edu/workunit.php?wuid=765100083
It seemed to effect my CPU times too but that is hard to tell for sure. This one http://setiathome.berkeley.edu/workunit.php?wuid=766957670 seemed to be way too long. I was finishing a couple of APs at the time it happened so I don't have many CPU tasks done . Since the APs were within an hour or so of completion it didn't effect their runtime by much and I don't know exactly what time it happened the first time.
-
Thanks,
Clearly those runtimes indicate something freaked out. Despite that, there's no visible indication in stderr apart from the excessive runtime on the task report itself, which means I'll need to instrument every kernel launch to find out what's happening. That will take a few days to go through the whole code, then if you;re agreeable I'll drag you into the dev area to pin down the exact point(s) of downclock. I'll do so by using a build instrumented to check for kernel errors and subsquently print the brand new, presumably downclocked, clock speed after the point of failure.
Can you confirm (once again) that these are 'sticky downclocks' requiring a reboot to clear ?
Jason
-
:o The, the dev area???? Can I bring a gun?
Yes, they needed a reboot. Well, at least the first one did. I just went ahead and rebooted when I saw the one today. Figured it was the easiest way to get going again fast. So far this time everything is running okay again now.
-
:o The, the dev area???? Can I bring a gun?
Yes, they needed a reboot. Well, at least the first one did. I just went ahead and rebooted when I saw the one today. Figured it was the easiest way to get going again fast. So far this time everything is running okay again now.
OK, but we won't wait for it to downclock again to try something.
Please swap in the attached, deliberately slightly dialled back for diagnostic purposes, build, while I spend the next few days instrumenting the code. If this one doesn't initiate downclocks on the card in the meantime, then it'll add some possibilities to the investigation, directing me to optimise a particular piece of code I've been hesitant to touch so far (so that part remained stock until this dialled back build).
(x39c, dialled back build attached for diagnostic purposes)
[Edit:] Old build removed. Please use the updated x39d build at:
http://lunatics.kwsn.net/12-gpu-crunching/x38g-reports.msg39407.html#msg39407
-
FYI: there is an easy way to swap in builds if you're confused by the app_info.
-
Okay, just to be sure, where does it go? Do I just replace all instances of <file_name>Lunatics_x38g_win32_cuda32.exe</file_name> in the app info or do I need to put it somewhere else?
Easy way? What's that? I've never seen such a thing. Nothing is easy for a n00b like me! ;D
-
Easy way? What's that? I've never seen such a thing. Nothing is easy for a n00b like me! ;D
Easy way:
- Stop Boinc
- Drop the new exe, unzipped, into the project folder
- edit the MBCuda.aistub file in notepad
- use the edit->replace function to replace all occurrances of x38g with x39c ,
- [change the counts too if desired]
- save & exit notepad
- run the aimerge.cmd batch file that resides in the project directory
- start Boinc & check task manager that x39c runs.
[Edit:] added mention of counts
-
Gawd I'm dumb. First I put the zipped file in, then I forgot the .exe at the end. Okay, now it's running the 39c build.
-
Gawd I'm dumb. First I put the zipped file in, then I forgot the .exe at the end. Okay, now it's running the 39c build.
Cheers. If something happens with that one we should hopefully get a little more info... If not then it points straight to the code I dialled back for refinement. Either way, I'll be going through the whole lot making things at least print the revised clock rate & location in the code if something detectable happens.
-
I can't say how long I'm going to have to run this. Like I said, I hadn't rebooted for awhile and everything had started to slow down before it down clocked. I'll just let it run and see.
-
Just noticed something Perryjay: The stderr task output indicates a core clock of 900MHz. Firstly, is that correct ? and what core voltage is that set to ? (assuming you have a monitor/OC tool such as MSI afterburner installed)
Jason
-
Yes, I'm OCed to 900/1800/1804 I have CPUID Hardware Monitor. The only voltage I find with that is VINO 1.11v. Is that what you mean? I can go looking for MSI Afterburner if you want.
-
Nah that's fine, thanks. Just mostly wanted to see if the clock was reporting correctly. Yeah 1.11V sounds like the core, and should be fine at 900MHz for that card, but it helps to have some reference if something turns up down the road.
Jason.
-
Well, I got Afterburner but it doesn't show current voltage in the little window. Guess you have to move the slide to show anything and that I am not going to do. ;D
Oh, sorry bout not mentioning the over clock. I have mentioned it so many times before I just figured you knew. Dumb move on my part again!
-
LoL, give us a screenshot before moving any sliders, if you could :D
-
Jason, does this new build do anything about clearing up the -12 issue? I just found this WU marked invalid, too many bugs. http://setiathome.berkeley.edu/workunit.php?wuid=764389127 I was the only one running the V 0.38g and the only one to complete it without getting -12.
-
Jason, does this new build do anything about clearing up the -12 issue? I just found this WU marked invalid, too many bugs. http://setiathome.berkeley.edu/workunit.php?wuid=764389127 I was the only one running the V 0.38g and the only one to complete it without getting -12.
AFAIK x38g and x32f have the same improvements relative to triplet handling, they allow 1 more than stock before committing suicide. Of the 3 triplets found, two must have been in the same array; stock fails on that, x3xx Lunatics doesn't.
Joe
-
Jason, does this new build do anything about clearing up the -12 issue? I just found this WU marked invalid, too many bugs. http://setiathome.berkeley.edu/workunit.php?wuid=764389127 I was the only one running the V 0.38g and the only one to complete it without getting -12.
AFAIK x38g and x32f have the same improvements relative to triplet handling, they allow 1 more than stock before committing suicide. Of the 3 triplets found, two must have been in the same array; stock fails on that, x3xx Lunatics doesn't.
Joe
Yep, that's Joe's extension, which continues to serve very well. Ultimately, I have as a goal to converge CPU & GPU results as much as possible/practical/reasonable, even though the nature of floating point arithmetic & the hardware it is executed on pretty much guarantees some amount of variation when different algorithms are used for the same set of computations. That means several of the GPU kernels will end up being reengineered to some degree, and in the meantime can expose cross-platform limitations that were instilled in the original CPU code as well (as with spikes accuracy) due to not having forseen that vastly different hardware would one day be trying to match results.
This kindof juggling is proving to have annoying side effects for the interim period, though my hope is that when seti@home V7.x is released, that the intermediate pain will have proved worthwhile, even if there are still wrinkles to iron out.
One thing to keep in mind, with classical control system redundancy techniques like this used in 'real' systems like aircraft, is that the redundancy is usually specified to have different authors & hardware manufacturers, and that they must agree within accepted variation. With the inconclusives & subsequent reissues we are seeing even between results that look pretty much the same to all external visible features, we are seeign that validation mechanism 'working' as it should.
My current standing is that we are seeing legacy application limitations in combination with new hardware variations add up to 'a circus' of marginally close answers. I feel that the base design change decisions for legacy work & the intent to converge cross platform results moving into V7 will prove the right direction, though I am also certain that some new architectures present further difficulties yet to be divined.
Jason
-
Jason, in case you miss it in the NC forum, I've decided to go back to two at a time. Not long after I posted over there I started getting sluggish again. No down clock but everything running very slow. I shut down Firefox but no change, so I also shut down Thunderbird. Still nothing so I shut down SETI and closed BOINC manager. When I restarted BM and SETI everything came back to normal. I let it run for awhile with no problem but I get the feeling my little 450 doesn't like running three work units at a time 24/7. It seems to like to take a little break every now and again. I'll see how it likes two at a time again and let you know how it goes.
-
OK, no worries. Responded over there. If it happens with 2 as well we might have to dig at that as well, though is probably just related to things that need to be done next anyway.
-
Well over 24 hours now and everything is going along great. Guess I was just pushing the limit by running three at a time.
-
Well over 24 hours now and everything is going along great. Guess I was just pushing the limit by running three at a time.
OK. Keep an eye on things when you can. With things running a bit more smoothly, I am currently starting a rewrite of the problem pulsefinds once and for all (i.e. VLAR & display lag related). That's going to take time & care, but at least the experience garnered so far should see things get a lot better from this point, in terms of both reliability & performance.
Jason
-
For those following this thread & using the x39c diagnostic, please update to the attached build with some added diagnostic info printed on errors.
[Removed old build]
-
Got it Jason, now if only SETI would cooperate. I've just started getting the can't connect to server message when trying to upload. I hope it's Hurricane Electric working on the problem. But anyway, another day of no problems, seems dropping back to two tasks has cured my problem.
Okay, finally got some reports through. Here's the link to one of the validated WUs I finished on x39d just in case you wanted to look at it. http://setiathome.berkeley.edu/result.php?resultid=1967582734
-
In the meantime, I've noticed Cuda 'freaking out' here on the 480 with newer builds ... but only when FireFox is Running... weird. No Errors, but certainly seems to stick in some funky lag-mode.
I'm trying your solution of stepping down from 3 to 2 tasks. If that helps I'd take it as an indication that the loading presented by the newer builds is indeed substantially higher overall. I may have to retest which number of tasks gives the most throughput here, as 2 task loading seems to be >95% now. I wasn't expecting that to change until later when I get a bit more optimisation in... Will see.
[Edit:] Stepping down to 2 seems to have helped here too, will keep an eye on it for a while.
Jason
-
. No Errors, but certainly seems to stick in some funky lag-mode.
If you mean it seems to stick for a little while, I'm seeing that too. Mine seems to stick at around 96 to 98% and hold for somewhere around 30 seconds to a minute then pick up and run on to completion. I haven't tried it with firefox closed and I don't know exactly how often this happens.
-
If you mean it seems to stick for a little while, I'm seeing that too.
Yeah that, & only with firefox running, when I run 3 tasks at once. All fine so far with 2 tasks at once, but will periodically try to induce the behaviour.
I've just now upgraded to the newer Beta drivers (just to throw a confusing change into the mix). I will satisfy myself that all is operating normally with 2 tasks & heavy firefox usage, then try reproduce the behaviour with 3 tasks running. If it doesn't reoccur I'll pin it on something to do with 275.33 under heavy load, if it does then the increased load of updated firefox & newer apps.
[Update:] Back up to 3 tasks at once with the 275.50 beta drivers. No sign of weirdness yet, will thrash firefox tabs periodically & see what happens.
[Update2:] That didn't take long. Poking at firefox for 5 minutes switching between tabs repeatedly did induce the behaviour. Going back down to 2 to watch that setting again. It looks like we're creating a slightly heftier GPU load :D Oh well.
Jason
-
Running 39d on my 460, luckily it is not my main surfing computer and mostly just crunches.
-
In the meantime, I've noticed Cuda 'freaking out' here on the 480 with newer builds ... but only when FireFox is Running... weird. No Errors, but certainly seems to stick in some funky lag-mode.
I've noticed the same with Firefox, but only with either 4 or 5. The 3.6.x versions don't seem to have any effect on the builds.
If I try to open Firefox with the builds running (from around the mid x38 builds) Firefox will hang for around 10 seconds then open and the applications will slow for around 20 seconds and GPU utiliisation will dropm from 93%+ down to ~87\88% utilisation.
Then as soon as I close Firefox, all is good again. Haven't seen this behaviour with either IE9 or Chrome as yet
AS I've been testing Raistmer NV r521 build I thought it could be related to this so wasn't exactly sure what could have been causing this
-
I think we should petition MS & Mozilla to stop trying to use our precious GPU resources :P
-
I think we should petition MS & Mozilla to stop trying to use our precious GPU resources :P
So inconsiderate of them ;D
-
I think we should petition MS & Mozilla to stop trying to use our precious GPU resources :P
Heresy! There's no more important use of computing resources than web browsing!
I don't know which browser started the "hardware acceleration" thing, but the best we can hope for is that they provide an option to turn it off.
Joe
-
Ahah!,
firefox's built in about:support (http://about:support) page shows:
GPU Accelerated Windows1/1 Direct3D 10
on mine, now to find out how to disable that rubbish...
-
I think we should petition MS & Mozilla to stop trying to use our precious GPU resources :P
Heresy! There's no more important use of computing resources than web browsing!
I don't know which browser started the "hardware acceleration" thing, but the best we can hope for is that they provide an option to turn it off.
Joe
Agreed.
Thats one of the reasons i stick with firefox 3.
-
Found it. In firefox 5.0 the setting is on the advanced options page. unticking 'Use hardware acceleration when available' has resulted in about:support showing now:
GPU Accelerated Windows0/1
Cranking the 480 back up to 3 tasks :D
[Edit:] Found an equivalent looking setting in IE9's advanced options as well. ticked "Use Software Rendering Instead of GPU rendering" & restarted the browser as directed by the fine print. Hah! eat cpu cycles browsers :P
-
[Edit:] Found an equivalent looking setting in IE9's advanced options as well. ticked "Use Software Rendering Instead of GPU rendering" & restarted the browser as directed by the fine print. Hah! eat cpu cycles browsers :P
I switched that off on my Laptop's 128Mb 8400M GS almost as soon as IE9 came out as it made the Desktop Very Laggy when collatz was running,
i was going to post today asking if Firefox 4 & 5 had a similar option, but afternoon snooze got in the way, ::)
Claggy
-
Okay, switched off and back to three for me too. Be interesting to see if I can hold up at this rate.
So much for that idea. Noticed my internet slowing then heard my fans slowing down. Checked SETI and saw the to completion time rising instead of falling. Checked EVGA Precision and saw my temp and fan speed was down but it did not downclock. I went ahead and shut down the SETI Client and BM, switched back to two at a time and things are running smoothly again. This poor little GTS 450 1GB just can't handle three at a time.
One more little note, I did not shut down Firefox. I just made my changes to SETI and started it back up. Firefox is running better now too.
-
This poor little GTS 450 1GB just can't handle three at a time....
Oh! The penny has dropped. I've seen a VRAM utilisation blowout here & 3 tasks seems to be using way too much. Over 1.4GiG VRAM used :o ::) That was unintentional & likely you'll be able to go back to 3 once I figure out what has happened there (& fix it). No way should we be using that much per task, and indeed a 1 Gig card won't accomodate 3 strangely greedy instances.
-
While it ran, it ran good. I was only losing about a minute and a half over two at a time by running three. That's running shorties, I'm in the middle of the shorty storm right now. It would really be great if you find the problem and get us going again.
-
While it ran, it ran good. I was only losing about a minute and a half over two at a time by running three. That's running shorties, I'm in the middle of the shorty storm right now. It would really be great if you find the problem and get us going again.
Oh I'll find it alright :D There's some V7 issues to resolve as well, but I am a stickler for trying to shrink memory footprints, simply because I prefer computation over RAM. RAM's Slow ;) The chances of this weird build running on 256MiB cards is currently zero ;D
-
Here's what happened when I tried to run 3 at a time
setiathome_CUDA: Found 1 CUDA device(s):
Device 1: GeForce GTX 465, 993 MiB, regsPerBlock 32768
computeCap 2.0, multiProcs 11
clockRate = 1500000
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GTX 465 is okay
SETI@home using CUDA accelerated device GeForce GTX 465
Priority of process raised successfully
Priority of worker thread raised successfully
Cuda Active: Plenty of total Global VRAM (>300MiB).
All early cuFft plans postponed, to parallel with first chirp.
) _ _ _)_ o _ _
(__ (_( ) ) (_( (_ ( (_ (
not bad for a human... _)
Multibeam x39d Preview, Cuda 3.20
Legacy setiathome_enhanced V6 mode.
Work Unit Info:
...............
WU true angle range is : 2.589599
VRAM: cudaMalloc((void**) &dev_cx_DataArray, 1048576x 8bytes = 8388608bytes, offs256=0, rtotal= 8388608bytes
VRAM: cudaMalloc((void**) &dev_cx_ChirpDataArray, 1179648x 8bytes = 9437184bytes, offs256=0, rtotal= 17825792bytes
VRAM: cudaMalloc((void**) &dev_flag, 1x 8bytes = 8bytes, offs256=0, rtotal= 17825800bytes
VRAM: cudaMalloc((void**) &dev_WorkData, 1179648x 8bytes = 9437184bytes, offs256=0, rtotal= 27262984bytes
VRAM: cudaMalloc((void**) &dev_PowerSpectrum, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 31457288bytes
VRAM: cudaMalloc((void**) &dev_t_PowerSpectrum, 1048584x 4bytes = 1048608bytes, offs256=0, rtotal= 32505896bytes
VRAM: cudaMalloc((void**) &dev_GaussFitResults, 1048576x 16bytes = 16777216bytes, offs256=0, rtotal= 49283112bytes
VRAM: cudaMalloc((void**) &dev_PoT, 1572864x 4bytes = 6291456bytes, offs256=0, rtotal= 55574568bytes
VRAM: cudaMalloc((void**) &dev_PoTPrefixSum, 1572864x 4bytes = 6291456bytes, offs256=0, rtotal= 61866024bytes
VRAM: cudaMalloc((void**) &dev_NormMaxPower, 16384x 4bytes = 65536bytes, offs256=0, rtotal= 61931560bytes
VRAM: cudaMalloc((void**) &dev_flagged, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 66125864bytes
VRAM: cudaMalloc((void**) &dev_outputposition, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 70320168bytes
VRAM: cudaMalloc((void**) &dev_PowerSpectrumSumMax, 262144x 12bytes = 3145728bytes, offs256=0, rtotal= 73465896bytes
VRAM: cudaMallocArray( &dev_gauss_dof_lcgf_cache, 1x 8192bytes = 8192bytes, offs256=176, rtotal= 73474088bytes
VRAM: cudaMallocArray( &dev_null_dof_lcgf_cache, 1x 8192bytes = 8192bytes, offs256=72, rtotal= 73482280bytes
VRAM: cudaMalloc((void**) &dev_find_pulse_flag, 1x 8bytes = 8bytes, offs256=0, rtotal= 73482288bytes
VRAM: cudaMalloc((void**) &dev_t_funct_cache, 1966081x 4bytes = 7864324bytes, offs256=0, rtotal= 81346612bytes
Thread call stack limit is: 1k
CudaThreadSetLimit() returned code
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00529977 read attempt to address 0x00000002
Engaging BOINC Windows Runtime Debugger...
setiathome_CUDA: Found 1 CUDA device(s):
Device 1: GeForce GTX 465, 993 MiB, regsPerBlock 32768
computeCap 2.0, multiProcs 11
clockRate = 1500000
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GTX 465 is okay
SETI@home using CUDA accelerated device GeForce GTX 465
Priority of process raised successfully
Priority of worker thread raised successfully
Cuda Active: Plenty of total Global VRAM (>300MiB).
All early cuFft plans postponed, to parallel with first chirp.
) _ _ _)_ o _ _
(__ (_( ) ) (_( (_ ( (_ (
not bad for a human... _)
Multibeam x39d Preview, Cuda 3.20
Legacy setiathome_enhanced V6 mode.
Work Unit Info:
...............
WU true angle range is : 2.589599
VRAM: cudaMalloc((void**) &dev_cx_DataArray, 1048576x 8bytes = 8388608bytes, offs256=0, rtotal= 8388608bytes
VRAM: cudaMalloc((void**) &dev_cx_ChirpDataArray, 1179648x 8bytes = 9437184bytes, offs256=0, rtotal= 17825792bytes
VRAM: cudaMalloc((void**) &dev_flag, 1x 8bytes = 8bytes, offs256=0, rtotal= 17825800bytes
VRAM: cudaMalloc((void**) &dev_WorkData, 1179648x 8bytes = 9437184bytes, offs256=0, rtotal= 27262984bytes
VRAM: cudaMalloc((void**) &dev_PowerSpectrum, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 31457288bytes
VRAM: cudaMalloc((void**) &dev_t_PowerSpectrum, 1048584x 4bytes = 1048608bytes, offs256=0, rtotal= 32505896bytes
VRAM: cudaMalloc((void**) &dev_GaussFitResults, 1048576x 16bytes = 16777216bytes, offs256=0, rtotal= 49283112bytes
VRAM: cudaMalloc((void**) &dev_PoT, 1572864x 4bytes = 6291456bytes, offs256=0, rtotal= 55574568bytes
VRAM: cudaMalloc((void**) &dev_PoTPrefixSum, 1572864x 4bytes = 6291456bytes, offs256=0, rtotal= 61866024bytes
VRAM: cudaMalloc((void**) &dev_NormMaxPower, 16384x 4bytes = 65536bytes, offs256=0, rtotal= 61931560bytes
VRAM: cudaMalloc((void**) &dev_flagged, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 66125864bytes
VRAM: cudaMalloc((void**) &dev_outputposition, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 70320168bytes
VRAM: cudaMalloc((void**) &dev_PowerSpectrumSumMax, 262144x 12bytes = 3145728bytes, offs256=0, rtotal= 73465896bytes
VRAM: cudaMallocArray( &dev_gauss_dof_lcgf_cache, 1x 8192bytes = 8192bytes, offs256=176, rtotal= 73474088bytes
VRAM: cudaMallocArray( &dev_null_dof_lcgf_cache, 1x 8192bytes = 8192bytes, offs256=72, rtotal= 73482280bytes
VRAM: cudaMalloc((void**) &dev_find_pulse_flag, 1x 8bytes = 8bytes, offs256=0, rtotal= 73482288bytes
VRAM: cudaMalloc((void**) &dev_t_funct_cache, 1966081x 4bytes = 7864324bytes, offs256=0, rtotal= 81346612bytes
Thread call stack limit is: 1k
Cuda Thread Limit was adjusted to 10k
boinc_exit(): requesting safe worker shutdown ->
Worker Acknowledging exit request, spinning-> boinc_exit(): received safe worker shutdown acknowledge ->
changed it back to 2 at a time and the task picked up and looks like it will complete successfully
MSI was reading 925 MiB with the 3rd task running (or trying to) & Boinc reports my card as having 993MB
-
A Hah! Thanks! Will do some tests here :)
-
Try this one with 3 tasks, perryjay & ghost ( x39e attached, reduced footprint back to roughly normal, I hope)
[attachment deleted by admin]
-
Try this one with 3 tasks, perryjay & ghost ( x39e attached, reduced footprint back to roughly normal, I hope)
thats got it ;D
now able to run 3 tasks at a time with memory usage now @ 801MB which is the same as x39d was running 2 tasks
-
Sweet. Note to self: Lack of beer induces ID: 10t errors
-
I just checked my 460 and it showed that I had used up to 740 MB and is at 710 MB right now with 2 at a time.
Changing over to x39e now.
[edit]Looks like it is down to 516 MB now[/edit]
-
I'm here. Memory usuage-826MB, GPU Usage- 94-99%, temp ~ 69degrees. Fan sounds quieter but running at 70%. We'll see how it goes.
Okay, after just a few minutes little has changed. Temp has gone up to 72 degrees, and memory usage has gone down to 817MB. GPU usage has gone to 92 to 97%, staying right around 95% mostly. Here's hoping.
-
Checking in. Looks like I was a touch too late to snag the latest x39 build for testing.
Hoping this helps a bit, my 560ti's have been giving me fits.
-
Not too late. Have just been purposely burying test builds in this thread to limit distribution while still getting some wider testing. Will PM you link to newest (x39e) on Seti Main, to the post a few back. [Done, relayed that x39e should be more helpful in isolating any further problem at least, so fingers crossed it shows something obvious]
-
Hi Jason,
took x39e now for seti main on my GT540M (1GB), but get no work till now.
As soon as I have work, i will post again.
heinz
-
Cheers Heinz. If your error pops up as before then that'll be good for further diagnosis. If it doesn't well that'll be good too.
-
Hi Jason !
This 39e you posted is fastest app on my system. Can do 4 WU in parallel without any problems, but stuck when first of four is finished and new one need to start :(
In the other way, work with much less memory usage then any of previous releases. Now , as always I will crunch at least 100 WU to see how this app works.
Good work!!!
(http://i53.tinypic.com/2ev55vm.jpg)
-
This 39e you posted is fastest app on my system.
That actually surprises me, since I dialled back some things for diagnosis & refinement, and am not focussed on speed at this time. Gradually refining/fixing things I suppose may help 'real-world', as opposed to laboratory bench, speed as well, so I'll keep that in mind as things go further forward.
The memory footprint may yet still end up from a little bit, to considerably smaller. I'm not sure at this stage. 4 at a time is getting pretty eager though :D
Jason
-
I don't know what are you doing, but you doing well :) ( whatever you do with this app) :)
-
Going for a fourth http://setiathome.berkeley.edu/workunit.php?wuid=766762437
I agreed with another running x38g while a stock 6.03 found an extra gaussian. The x38g was first, I was third. Shouldn't I have validated him?
But anyway, made it through the night with no problems at all to report. As to the comment about this being the fastest app yet, could it just be that it seems to load faster and we don't have that snag near the end of the WU anymore. Those two give us about a minutes advantage right there at least.
-
Going for a fourth http://setiathome.berkeley.edu/workunit.php?wuid=766762437
I agreed with another running x38g while a stock 6.03 found an extra gaussian. The x38g was first, I was third. Shouldn't I have validated him?
Yes, x38g on a GTX 460 and x39d on a GTS 450 really ought to be so close that an inconclusive comparison is nearly impossible. IMO the tiny likelihood of one of the reported or "best" signals being at a critical level should be much rarer than necessary to explain the number of inconclusives that are happening even between stock and the x3[8|9] builds.
Edit: Attaching the WU for that particular case. I have no way of comparing x38g to x39e unless someone else tests. I could do a CPU test, but won't unless CUDA testing seems to indicate it's needed.
Joe
-
As far as I'm concerned, x39d & e are different on those particular cards to x38g, and it's those family of 'newer' cards that brought us into x39 diagnostic builds trying to locate a specific issue with those GPUs (& some others).
My current suspicions are along the lines that x38g & earlier builds, on certain cards & drivers ,can have some silent failures, that while not necessarily manifesting in obvious reportable count differences can certainly lead to differences in the best signals.
With regard to the likelihood that some such hidden error exists, with x38g it's possible, while with x39d highly unlikely. In other words while the computation codepaths are basically the same, the driver version & kernel reliability cross GPU is not, which is why we are running 'diagnostic' builds & not optimising for performance at this point.
Jason
-
Hey boss, just wanted to let you know, Raistmer, Claggy and Ghost made me do it!!! They ganged up on me! ::)
Only kidding, but I am running Raistmer's new app for APs on NVidia GPUs. Ghost said it was running okay on his with two MBs running at a time so I guess I will find out if three at a time will work. Haven't got any work for it yet but I'll let you know how it goes.
-
OK. If it pinches all the CPU from the Cuda app, starving it out, blame Raistmer.
-
Going for a fourth http://setiathome.berkeley.edu/workunit.php?wuid=766762437
I agreed with another running x38g while a stock 6.03 found an extra gaussian. The x38g was first, I was third. Shouldn't I have validated him?
Yes, x38g on a GTX 460 and x39d on a GTS 450 really ought to be so close that an inconclusive comparison is nearly impossible. IMO the tiny likelihood of one of the reported or "best" signals being at a critical level should be much rarer than necessary to explain the number of inconclusives that are happening even between stock and the x3[8|9] builds.
Edit: Attaching the WU for that particular case. I have no way of comparing x38g to x39e unless someone else tests. I could do a CPU test, but won't unless CUDA testing seems to indicate it's needed.
Joe
Here's a benchrun comparing x39e to x32f and x38g,
Edit: x39e was Weakly similar against x32f, but Strongly similar, Q= 99.96% against x38g
Edit 2: did an x38d run too, x39d was Weakly similar against x32f, but Strongly similar, Q= 99.97% against x38g
Claggy
-
Looks like the chirp difference to me altered the best gaussian, as opposed to more recent x39 changes.
I'll run that one on AKv8b for a double precision CPU chirp reference comparison.
(Barring the mentioned reliability issues we're looking for, x38g & x39d/e should have matched one another in this case)
-
...
(Barring the mentioned reliability issues we're looking for, x38g & x39d/e should have matched one another in this case)
Claggy's x38g and x39e results did agree on the best_gaussian (and everything else) so can't explain why Perryjay's result didn't get strongly similar against Phud's.
I expect the x32f best_gaussian (which was one of the reported gaussians) is more likely to match CPU results, simply because it has a considerable history of few inconclusives.
Joe
-
I expect the x32f best_gaussian (which was one of the reported gaussians) is more likely to match CPU results, simply because it has a considerable history of few inconclusives.
Well, testing that theory grabbing a AKv8b result to add to the collection (That's taking a while :D).
As no 'direct' Gaussian search modifications were made in x32f through 39e, I currently call the x38g chirp & some other kernels 'unstable' under some conditions on certain cards under as yet undetermined conditions. If it turns out something simpler then I'll be happy with that.
I haven't looked at the spikes' proximity to threshold, but given the known 6.03 limitations (which should show in my AKv8b result if a factor) then I think the 3-way circus on the live runs might go something like this:
x38g Vs 6.03 disagrees by spikes, with possible suspect chirp in x38g presenting effects
x39d Vs 6.03 disagrees by spikes
x38g Vs x39d, possible suspect x38g chirp (reliability)
So far we have seen mismatched gaussians between AKv8 & x32f, with a full length test task from your FG set, I'm putting forward that the accuracy of those in the x38g one is repaired to match CPU by the chirp, but that instability created an issue in the live result not seen under bench, and that the majority of the remaining disagreement comes from the spikes.
-
I've taken out my GTX460 and fitted my 9800GTX+ and i'm in the process of running a bench comparing x39d and x39e against x32f and x38g,
Claggy
-
OK, apart from the fix for the VRAM blowout, x39e is identical to x39d, so you could shorten your test by one build if you wanted, though I suppose the extra run couldn't hurt to see if remaining stability issues show up, despite that none seem to under bench (the frustrating part :))
-
------------
Running app : AK_v8b_win_x64_SSSE3x.exe -verb -nog
with WU : 27fe11ac.12560.9065.8.10.100.wu
Started at : 05:48:23.796
Ended at : 07:24:09.576
5745.740 secs Elapsed
5179.405 secs CPU time
Result : stored as ref for validation.
------------
Running app : Lunatics_x39e_win32_cuda32.exe -verb -nog
with WU : 27fe11ac.12560.9065.8.10.100.wu
Started at : 07:24:12.637
Ended at : 07:30:50.101
397.415 secs Elapsed
50.638 secs CPU time
Speedup : 99.02%
Ratio : 102.28 x
ref-AK_v8b_win_x64_SSSE3x.exe-27fe11ac.12560.9065.8.10.100.wu.res:-
Result : Strongly similar, Q= 99.74%
Attaching bench & result files for manual comparisons.... [Done, analysing]
-
Here is x39a, x39d, and x39e V6. I have been running 3 wu's at a time with x39a live without issues. I have only had one invalid wu, out of thousands. Driver 275.33
Steve
-
Here is x39a, x39d, and x39e V6. I have been running 3 wu's at a time with x39a live without issues. I have only had one invalid wu, out of thousands. Driver 275.33
Steve
Thanks Steve, yeah it's these 460's (and some others) that appear to be sensitive to something I'm abusing. We seem to be home 'n hosed with the 480s
-
That's what I gathered by reading the threads, but I wanted to throw in a test or two myself. Is there any particular app you would like me to run live, or is there any other comparison you would like me to run?
Steve
PS. I did back down my BCLK one click and eliminated my AP invalids.
-
x39e all the way ;D
-
Done!
Steve
-
Manual 27fe11ac.12560.9065.8.10.100 result cross comparison under bench conditions
Claggy's x39e (GTX460) Vs my x39e(GTX480) under bench conditions: Strongly similar, Q= 99.95%
My x39e (GTX 480) Vs AKv8bx64SSSE3x: Result : Strongly similar, Q= 99.74%
Claggy's x38g (GTX 460) Vs x39e(My GTX 480): Result : Strongly similar, Q= 99.97%
Claggy's x32f (GTX 460) Vs AKv8b( My e8400): Weakly similar. (Bodgy Best Gaussian)
Claggy's x32f (GTX 460) Vs x39e(My GTX 480): Weakly similar. (Bodgy Best Gaussian)
Tentative analysis: The known CPU app spikes issues are playing no part here. The bodgy best Gaussian is in x32f due to innacurate stock nVidia chirp ( ~48 bit precision emulated floating point), I've kept complete documentation on how I fixed that chirp in the alpha ivory beer tower. x32f & the stock code it came from are crap with highly chirp sensitive signals.
On the live runs, the x38g result likely didn't match the x39d result purely due to the known stability issues we are here to resolve on that class of card.
AKv8b found here:
Spike count: 8
Pulse count: 4
Triplet count: 0
Gaussian count: 2
with the live run CPU 6.03 guy finding:
Spike count: 8
Pulse count: 4
Triplet count: 0
Gaussian count: 3
Now A stock cuda 6.08 wingman has rocked up, STILL INONCLUSIVE....LoL... ;D
Spike count: 8
Pulse count: 4
Triplet count: 0
Gaussian count: 2
IMO, all the results are broken in some way apart from the offline AKv8b ones & the x39d/e results.
Jason
-
Here's the results of the 9800GTX+ run, all apps Strongly similar,
Claggy
-
Well, unfortunately the 9800 result has the stock-x32f-style bodgy gaussian against AKv8b so there will be something to look at deeper with the pre-Fermis (That gaussian drift against CPU results has been there for a long time though)
-
And here I thought I was just going to show something kinda interesting. I didn't know I was gonna start all this! :o
-
I thought it was the 560's having problems, I have not seen any problems from my little 460.
-
I thought it was the 560's having problems, I have not seen any problems from my little 460.
I think the 560's just have bigger problems,
Claggy
-
I thought it was the 560's having problems, I have not seen any problems from my little 460.
I think the 560's just have bigger problems,
and not all 460's & 560's are created equal either, whereas 480's like mine & Steve's are nVidia reference & likely as close to identical as they come, so relatively predictable. It's something that should not reflect in the results when the code is 100% 'right'.
Since some of the newer code in play is specifically written to exploit the superscaler instruction level parallelism for maximum bandwidth on compute capabiiity 2.1, i.e 48 instead of 32 Cuda cores per multiprocessor with an extra warp scheduler, if somethings' being pushed to the limits kernel execution configuration wise then It;s going to show on those cards with harder constraints first.
Jason
-
Hmm, I noticed the "Bodgy Best Gaussian" occurred at chirp rate ~63.9958 and the presumably correct one at ~79.3237. So while running the WU with stock 6.95 I periodically checked state.sah to see what would turn up near those rates. Before the first there was a fairly weak "best" captured at chirp -15.88619 with a score of 0.125732, then the bodgy with a score of 1.293129, and the final one had a score of 1.305961. I don't know how to evaluate the 0.98% difference between bodgy and final scores.
AK_v8b_win_SSSE3x.exe has just gotten to bodgy and calculates its score as 1.293091 which is close though I'd rather the first difference were in the 6th significant digit than the 5th.
Joe
-
AK_v8b_win_SSSE3x.exe has just gotten to bodgy and calculates its score as 1.293091 which is close though I'd rather the first difference were in the 6th significant digit than the 5th.
If you are too, I'm content to call this one "remaining chirp annoyances, with added yet to be divined nVidia Gaussfit implementation vagaries"
-
I don't know how to evaluate the 0.98% difference between bodgy and final scores.
Just a naive thought on how that could interact with slight chirp differences: If you take a fairly 'phat' gaussian (wider bandwidth bin leakage or similar effect) then stride it in slightly different angles, you will get a slightly different fit (shape) for very similar (but not the same) peak... could there be some substantial aliasing & could [recommending to the project] some windowed transforms improve that SNR (controlling 'lobing' in the frequency domain)?
-
Here's another one you guys might like to ponder http://setiathome.berkeley.edu/workunit.php?wuid=766762437 it gives a 6.03, 6.08, x38g, and my x39e plus another it has just been sent out to running optimized Linux. He should be reporting in soon.
-
Same one I reckon :D
-
May want to see this one too. http://setiathome.berkeley.edu/workunit.php?wuid=771931438 First three to run it got -12s I completed and a 6.03 completed with the same count I had. Still got sent out to another.
Am I finding what you guys are looking for or wasting your time? If these don't help let me know more of what I should look for. ;)
Here's a poor guy with a new GTX 590 throwing a bunch of -9s. http://setiathome.berkeley.edu/show_host_detail.php?hostid=6016350 He's showing 122 invalids with only 36 valid results. He's running x32f.
-
AK_v8b_win_SSSE3x.exe has just gotten to bodgy and calculates its score as 1.293091 which is close though I'd rather the first difference were in the 6th significant digit than the 5th.
If you are too, I'm content to call this one "remaining chirp annoyances, with added yet to be divined nVidia Gaussfit implementation vagaries"
Agreed, and I hope Crunch3r's 64 bit Linux build will resolve it. Otherwise that WU will end up in the very rare "Too many success results" category.
...
Am I finding what you guys are looking for or wasting your time? If these don't help let me know more of what I should look for. ;)
...
You're doing fine, we just wish the project were fully funded and had several technicians available to look into these cases. 8)
Joe
-
Got the same count as this 460 running x38g but still went to inconclusive.. http://setiathome.berkeley.edu/workunit.php?wuid=771572553
-
Got the same count as this 460 running x38g but still went to inconclusive.. http://setiathome.berkeley.edu/workunit.php?wuid=771572553
Yeah now that initially looks to me like x38g failing invisibly with something. As we've seen they are numerically the same under bench conditions even with marginal results like the previous chirp/gaussfit weirdo, but no evidence appears that something went wrong.
I think x38g pushes the pulsefinding a touch too hard for some cards like the 460 & yours, causing some undocumented driver/kernel launch failures later in the process. x39e has that wound back a notch, and extra hardened launches with print & hard error outs before & after every Cuda call. That should make any repeat of what happens in x38g more obvious & descriptive. IOW: discount the x38g result as possibly bad on that wingman, and we'll have to thrash out x39e for problems so a more general update can be provided.
An interesting thing, if possible red herring, is that a 480 (running stock 6.10 cuda_fermi) failed with a Cuda incorrect function on that task (again without a lot of explanation). I know from my own & Steve's 480's that errors on these cards are exceedingly rare with all current builds .... So it's possible there is something funky going on with certain tasks as well.
Jason
-
Duh, right, me good x38g bad, got it! ;)
-
Hi Jason,
took x39e now for seti main on my GT540M (1GB), but get no work till now.
As soon as I have work, i will post again.
heinz
I got 5 tasks.
2 are conclusive,
one against SETI@home Enhanced v6.10 (cuda_fermi)
the other against SETI@home Enhanced v6.03
for 3 I have to wait.
hostid=6023152 (http://setiathome.berkeley.edu/results.php?hostid=6023152)
Only issue every time when a seti wu will start the machines clock down again on my GT540M.
I run it together with primegrid.
To get my oc'ed values(750/900/1500) back I must restart the machine. EVGA Precision was not able to set the frequency new, if it is fallen down.
Still restart helps.
Standard clock is (672/900/1344)
30.06.2011 20:24:17 | | NVIDIA GPU 0: GeForce GT 540M (driver version unknown, CUDA version 4000, compute capability 2.1, 962MB, 172 GFLOPS peak)
BOINC 6.12.26(x64)
I have installed latest 275.50-notebook-win7-winvista-64bit-international-beta
-
...
Only issue every time when a seti wu will start the machines clock down again on my GT540M.
I run it together with primegrid....
OK heinz, I think that no other project has the application fixes for newer Cuda drivers yet. If you suspend primegrid for a while & verfiy you see no more downclocks, then I'd suggest if you want to run that then either go back to a Cuda 3.2 driver (until primegrid fixes their application) or encourage them to apply the boincapi fixes necessary in an application update.
Jason
-
...
Only issue every time when a seti wu will start the machines clock down again on my GT540M.
I run it together with primegrid....
OK heinz, I think that no other project has the application fixes for newer Cuda drivers yet. If you suspend primegrid for a while & verfiy you see no more downclocks, then I'd suggest if you want to run that then either go back to a Cuda 3.2 driver (until primegrid fixes their application) or encourage them to apply the boincapi fixes necessary in an application update.
Jason
from where I can download the boincapi fixes ? If I get them, I can compile a new pg version, to see if that fixes the issue.
thanks
-
from where I can download the boincapi fixes ? If I get them, I can compile a new pg version, to see if that fixes the issue.
thanks
pretty sure I gave you that info ... Looking...
[Edit:] found it:
http://lunatics.kwsn.net/5-windows/re-ap-blanking-experiment.msg39031.html#msg39031
-
May want to see this one too. http://setiathome.berkeley.edu/workunit.php?wuid=771931438 First three to run it got -12s I completed and a 6.03 completed with the same count I had. Still got sent out to another.
Last man finished. Three of us got validated. Last man also ran it as a 6.03.
-
Well am back up & running with a spare 450W PSU myself & had to put in the old 9600GSO to get operational again. The 750W one driving the GTX 480 seems to have bitten the dust, so it's out of action until I can RMA it.
Oh well, looks like crunching, development & everything else will be in slow motion for the time being ::)
-
Ouch!!!!
I am currently using my AMD Quad as my primary desktop since the iMac is just getting old and crash happy. I use my Q8200 machine as the music supplier.
-
What, you don't have a dozen or so extras laying around? Sorry to hear that Jason, hope you get going again real soon. Things are still running good here, no problems to report. I have got my first AP on GPU but will be awhile before I get to it. Hope nothing smokes here by trying to run it. I haven't changed any settings so it will have to fend for itself when it starts.
-
Hi Jason,
I run now still seti alone. Looks like the downclocking is gone.
one result looks very courious --> http://setiathome.berkeley.edu/result.php?resultid=1976573029
<core_client_version>6.12.26</core_client_version>
<![CDATA[
<stderr_txt>
</stderr_txt>
]]>
~~~~~~
nothing in stderr ?
-
Yeah, I'm convinced that's Boinc server or communications related somehow, & have seen empty stderr before with different applications. The newer exit code seems to have reduced the appearances of ones where stderr is truncated, but the whole thing missing isn't one I've been able to pin down to app or boinc client side yet. One day I would like to think of a way to locate the exact point in the system where the stderr contents (or other parts) go missing, whether it's somewhere on our end, the server, or somewhere in between.
Jason
-
What, you don't have a dozen or so extras laying around? ...
LoL, yeah I'm running on the spare now, which has been a marathon juggling excercise:
- 750W psu aparently died,
- unplugged all hardware, including the GTX 480 for a test with spare 300W psu (success)
- checked 750W again ... nogo,
- harvested 450W PSU from lounge machine, repalcing it with the 300W
- removed GTX 260 from lounge machine replacing with old 9600GSO
- fiddled with drivers on that for ages to make sure that one's running & crunching OK (seems to be)
- Installed the 450W & another 9600GSO I had lying around in main machine to get it operational
- got that crunching OK & went to sleep
Still to come, see if the 450W can manage to drive the GTX 260 harvested from the lounge machine, not likely that it would manage the 480. Lot's of tasks to crunch & the 9600GSO may have trouble beating deadlines, LoL.
Then I'll see if the supplier will RMA the 750W PSU. Sheesh, best PSU I ever had (Seasonic X-750), runs stone cold & only last a year? ... oh well...
-
Sheesh, best PSU I ever had (Seasonic X-750), runs stone cold & only last a year? ... oh well...
Humm. Not what I wanted to hear. :o
A few weeks back I put together an i7-950 with the same exact PS...
-
Sheesh, best PSU I ever had (Seasonic X-750), runs stone cold & only last a year? ... oh well...
Humm. Not what I wanted to hear. :o
A few weeks back I put together an i7-950 with the same exact PS...
It's been as Awesome PSU, never any sign of stress, or even getting warm. Hopefully just a freak one-off.
-
It looks like fuse problem in PSU, not PSU itself :) But since you will not open PSU since you loose guarantee you will never know :)
Always you can look at bright side of life :) What if all other components dies, but not PSU. That will me far more damage.
P.S little advice.
Don't mess with 450W PSU of lower quality and GTX 260. It is power hungry beast, and or PSU or GPU will be damaged is there is no sufficient power.
Yesterday I finally got more then 120 WU, so can start crunch normally at least 24 hours :)
-
P.S little advice.
Don't mess with 450W PSU of lower quality and GTX 260. It is power hungry beast, and or PSU or GPU will be damaged is there is no sufficient power.
LoL, I agree, but I've been trying to kill this GTX 260 & PSU (turns out now I look it's a thermaltake 470W, so not completely crap, but not enough for the 260) for a long time so that I can justify getting a newer one for the machine it was in ;D...
Yesterday I finally got more then 120 WU, so can start crunch normally at least 24 hours :)
Besides, I have ~2400 tasks on this machine, if something dies trying to whittle that down I'll call it a noble sacrifice & give them a decent burial. I'll take your adice to heart & run it underclocked though ;)
The stupid part is that I have the skills & tools to fix the 750W unit, but, as you say, don't want to open it *sigh*
-
You try to kill GTX 260?
It is peace of cake: use 12V , put one cable ( + or -) it is irrelevant on any gold contact of PCI express, and with other make contact with all other gold contacts.
That will kill it immediately :)
That is how I kill some computer parts.
PSU is hard to kill, it shutdown itself with any voltage irregularities :)
And for all of those WU, make backup of both BOINC directory regularly :) And put it on some other hard disc :)
-
The stupid part is that I have the skills & tools to fix the 750W unit, but, as you say, don't want to open it *sigh*
I hear you on that! If mine went belly up, I'd have a hard time not opening the case, and digging into it. I'm really sorry your supply died like that. You are right in that these supplies are the best you have ever owned. The 1200 Watt version I have is better than anything I have ever seen.
As far as x39e, or any of the other builds I have run live, I haven't experienced a single problem. These 480's are crunching away like jet engines, and have held up perfectly, even at an extreme overclock. 871 MHz vs 700 Mhz stock.
Steve
-
AS you know I killed the second 1000W PSU with my V8-Xeon. Now in the summer it is not possible to run the machine with 3 Aircooled 470/570, we had some days ago alredy 36 grd celsius outside, and the house where I live has no climatisation. Still a watercooled system and perhaps a 1200W PSU is able running the whole year continuous.
I must wait till late autumn to repair the machine. :'(
heinz
-
The stupid part is that I have the skills & tools to fix the 750W unit, but, as you say, don't want to open it *sigh*
My problem is I'm too impatient to wait for an RMA and don't have that many spare parts laying around that I can swap out. If something goes out on me I'm usually in it trying to fix it within an hour. That's about all the time I can manage to keep my hands out of it. :-X
-
i3, GT540M
Today I got a unknown error (http://setiathome.berkeley.edu/result.php?resultid=1977850597)
Preemptively acknowledging a safe Exit on error->
SETI@home error 1 Unknown error
(cudaAcc_CalcChirpData_kernel_sm13<<<grid, block>>>(cudaAcc_NumDataPoints, 0.5*chirp_rate, recip_sample_rate, dev_cx_DataArray, dev_cx_ChirpDataArray))
File: c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_CalcChirpData_sm13.cu
Line: 89
-
Thanks Heinz, I thought I was the only one to get those :D I may do some further profiling & investigation into the performance of that chirp... The indication of "Error on launch" is enough to say 'something broke here' and I have a couple of ideas that might eliminate those.
How many tasks do you currently run at once on that GPU ?
Please keep an eye out if it happens again.
Jason
-
How many tasks do you currently run at once on that GPU ?
Jason
I was running two tasks.
At the moment I have no work, several downlods are hunging around...must be patient to get some tasks.
heinz
-
I've only got two errors, both -12s. This one is interesting though http://setiathome.berkeley.edu/workunit.php?wuid=770546468 Take a look at the one that did complete it. 23 triplets found??
-
I've only got two errors, both -12s. This one is interesting though http://setiathome.berkeley.edu/workunit.php?wuid=770546468 Take a look at the one that did complete it. 23 triplets found??
Sure! if ||| is one triplet, |||| is 2 triplets, and ||||| is 5 triplets. Add a few more |'s in the same 'PulsePoT' and you get to 23 pretty quickly, which is bigger than where the nVidia code commits suicide.
I've yet to hear a good explanation of why ET might not think that's a good way to catch our attention, so do want to rewrite that to eliminate the -12s. It's only relatively recently that I feel my Cuda experience is getting to the point where I can consider that particular rewrite (among others), so reengineering it to match CPU is on the list, though much lower down than some other issues.
Jason
-
Approach used in OpenCL->CUDA build shold never experience this problem. Maybe it's worth to incorporate it in "pure CUDA" build too.
-
We've got Jason sorted for a new PSU.
-
Great news Slavic.
Get shopping Jason!! ;D
-
Approach used in OpenCL->CUDA build shold never experience this problem. Maybe it's worth to incorporate it in "pure CUDA" build too.
Yeah, when I rewrite that I'll take yours into consideration & apply the 'max bandwidth' approach that's been working well for me so far as well. As mentioned I do want that issue gone, and as errors go it's becoming more common as hardware gets faster & the work noiser.
We've got Jason sorted for a new PSU.
Wow looks like I end up with enough to get the 'big one'. :) I'll let everyone know the good news on main & issue thanks as well.
-
Not unexpected, http://setiathome.berkeley.edu/workunit.php?wuid=766762437 which we looked at pretty closely ended with all 5 getting credit. The stock CUDA 6.08 got canonical so was strongly similar to CRUNCH3R's 6.01 CUDA for 64 bit Linux.
Joe
-
Thanks for keeping track of that one. Yep looks like we need to at least take a look at the gaussians then. I'm not happy with the outwardly clean looking 6.03 run being marginalised. While we can discount the legacy Cuda builds' accuracy for a mixed bag of reasons (& the spikes in 6.03 for that matter), and the x38g one for stability & the unexplored gaussfit code, it's going to be the marginal cases that'll show us where to look. They are going to get stranger I suppose.
Jason
-
X39e's up and running on two 560ti's. We'll see how she does.
-
Hi Jason,
I get one -9 resultoverflow resultid=1981104197 (http://setiathome.berkeley.edu/result.php?resultid=1981104197)
SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected exceeds the storage space allocated.
I thought this is solved ?
-
Hi Jason,
I get one -9 resultoverflow resultid=1981104197 (http://setiathome.berkeley.edu/result.php?resultid=1981104197)
SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected exceeds the storage space allocated.
I thought this is solved ?
That's quite Normal, as long as your wingman finds too many signals too,
You were probably thinking of error that happens when there are too many triplets,
Claggy
-
x38 was giving me quite a few invalids, thus killing my RAC. x39 seems to have killed the invalids issues though I'm getting a few errors. After a few days of run time I'll post what errors I'm getting.
The x39 build also seems to have reduced the frequency of the downclocks I was experiencing.
Jason should I be running a particular driver with the 560's?
-
According to Martin of Martin's lighthouse...
266.66 is the first release with 560 Ti support.
If that's any help. ::)
-
The 275.xx drivers should be a little fast at least.
-
Got another invalid http://setiathome.berkeley.edu/workunit.php?wuid=765094268 I found 11 spikes the other two only found three.
Oh, FYI, I've had a couple of driver restarts so I've cut back just a bit on my over clock. I'm now at 883/1766/1804. I'm also only running two at a time because of running Raistmer's app for the NV AP.
-
Got another invalid http://setiathome.berkeley.edu/workunit.php?wuid=765094268 I found 11 spikes the other two only found three.
Nothing weird there, just 2 CPU apps with missed spikes due to inaccuracy ganging up on you. Looks like x39e is going well so far on that one.
-
So far so good. The invalids I was getting with 38 seem to be gone completely.
-
Here's one that might be of interest http://setiathome.berkeley.edu/workunit.php?wuid=763291020
-
Here's one that might be of interest http://setiathome.berkeley.edu/workunit.php?wuid=763291020
LoL. poor old x32f, he served us well.
-
Almost a month of run-time without any issues. Good work ;)
-
Is there any truth to the rumor that 39e runs at a lower RAC than 38?
-
Here's a pretty one... http://setiathome.berkeley.edu/workunit.php?wuid=762509253
-
Is there any truth to the rumor that 39e runs at a lower RAC than 38?
I believe it runs just a little bit slower than the 38 but that is because it is trying to find some problems some of us were having. I don't know how much that is going to effect your RAC since there are so many variables. With the new credit system you really can't tell since it depends on your wingman's time too.
-
Got my first invalid on the 39 build. What I find interesting is that 38 was throwing all sorts of invalids at me, 39 is almost entirely invalid free. Lovely.
http://setiathome.berkeley.edu/result.php?resultid=1974204985
-
Is there any truth to the rumor that 39e runs at a lower RAC than 38?
I believe it runs just a little bit slower than the 38 but that is because it is trying to find some problems some of us were having. I don't know how much that is going to effect your RAC since there are so many variables. With the new credit system you really can't tell since it depends on your wingman's time too.
It seems to be running 500-700 RAC slower on my machine than 38 though IMO that's such a small figure that it's not worth even mentioning. Was hoping to get some ammunition to belay the "39's horrible for your RAC don't download it" garbage seen at SETI Home.
-
Got my first invalid on the 39 build. What I find interesting is that 38 was throwing all sorts of invalids at me, 39 is almost entirely invalid free. Lovely.
http://setiathome.berkeley.edu/result.php?resultid=1974204985
You picked the wrong work unit. Your invalid was http://setiathome.berkeley.edu/result.php?resultid=1974204984 Looks like you had two gang up on you by not finding those 8 gaussians you found.
-
Got my first invalid on the 39 build. What I find interesting is that 38 was throwing all sorts of invalids at me, 39 is almost entirely invalid free. Lovely.
http://setiathome.berkeley.edu/result.php?resultid=1974204985
You picked the wrong work unit. Your invalid was http://setiathome.berkeley.edu/result.php?resultid=1974204984 Looks like you had two gang up on you by not finding those 8 gaussians you found.
Crap how do I link WU's properly?
-
You were almost there but you must have copied DCappello's work unit location by mistake. He was the second man on that string. I just open the link I want to copy and copy what's in the address bar. http://setiathome.berkeley.edu/workunit.php?wuid=772982715 gives you all the wingmen on that work unit.
-
You were almost there but you must have copied DCappello's work unit location by mistake. He was the second man on that string. I just open the link I want to copy and copy what's in the address bar. http://setiathome.berkeley.edu/workunit.php?wuid=772982715 gives you all the wingmen on that work unit.
Crud sorry about that. Pardon the new guy.
-
Here are a couple of errors I came upon X38g on my 8600 GTS (256Mb of memory, driver 266.58):
Find triplets return flags indicate an error (value: 1)
Last Cuda error code indicates: Success - No errors.
Cuda sync'd & freed.
Preemptively acknowledging a safe Exit on error->
SETI@home error -12 Unknown error
cudaAcc_find_triplets doesn't support more than MAX_TRIPLETS_ABOVE_THRESHOLD numBinsAboveThreshold in find_triplets_kernel
File: c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_pulsefind.cu
Line: 301
full details http://setiathome.berkeley.edu/result.php?resultid=1961527829
and
Cuda error 'find_triplets_kernel' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_pulsefind.cu' in line 276 : unknown error.
Unknown Error.
Cuda error 'cudaMemcpy(&flags, dev_flag, sizeof(*dev_flag), cudaMemcpyDeviceToHost)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_pulsefind.cu' in line 287 : unknown error.
Unknown Error.
Cuda error 'cudaMemset(dev_find_pulse_flag, 0, sizeof(*dev_find_pulse_flag))' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_pulsefind.cu' in line 1606 : unknown error.
Cuda error 'cudaMemcpy(&flags, dev_find_pulse_flag, sizeof(*dev_find_pulse_flag), cudaMemcpyDeviceToHost)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_pulsefind.cu' in line 1614 : unknown error.
Cuda error 'cudaMemcpy(PulseResults, dev_PulseResults, 4 * (cudaAcc_NumDataPoints / AdvanceBy + 1) * sizeof(*dev_PulseResults), cudaMemcpyDeviceToHost)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_pulsefind.cu' in line 1626 : unknown error.
Cuda error 'cudaAcc_transpose' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_transpose.cu' in line 73 : unknown error.
Cuda error 'cudaMemcpy(best_PoT, dev_tmp_pot, max_nb_of_elems * sizeof(float), cudaMemcpyDeviceToHost)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_pulsefind.cu' in line 1629 : unknown error.
Cuda error 'cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, (cudaAcc_NumDataPoints / fftlen) * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_summax.cu' in line 234 : unknown error.
Cuda error 'cudaMemcpy(dev_PoT, dev_PowerSpectrum, cudaAcc_NumDataPoints * sizeof(*dev_PowerSpectrum), cudaMemcpyDeviceToDevice)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_gaussfit.cu' in line 482 : unknown error.
Cuda error 'NormalizePoT_kernel' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_gaussfit.cu' in line 499 : unknown error.
Cuda error 'cudaMemset(dev_flag, 0, sizeof(*dev_flag))' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_gaussfit.cu' in line 502 : unknown error.
Cuda error 'GaussFit_kernel' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_gaussfit.cu' in line 509 : unknown error.
Cuda error 'GaussFit_kernel' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_gaussfit.cu' in line 509 : unknown error.
Cuda error 'cudaMemcpy(&flags, dev_flag, sizeof(*dev_flag), cudaMemcpyDeviceToHost)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_gaussfit.cu' in line 513 : unknown error.
Cuda error 'cudaMemcpy(GaussFitResults, dev_GaussFitResults, cudaAcc_NumDataPoints * sizeof(*dev_GaussFitResults), cudaMemcpyDeviceToHost)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_gaussfit.cu' in line 524 : unknown error.
Cuda error 'cudaMemcpy(tmp_PoT, dev_NormMaxPower, ul_FftLength * sizeof(*dev_NormMaxPower), cudaMemcpyDeviceToHost)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_gaussfit.cu' in line 525 : unknown error.
Cuda error 'cudaAcc_transpose' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_transpose.cu' in line 73 : unknown error.
Cuda error 'cudaMemcpy(best_PoT, dev_t_PowerSpectrum, cudaAcc_NumDataPoints * sizeof(*dev_t_PowerSpectrum), cudaMemcpyDeviceToHost)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_gaussfit.cu' in line 532 : unknown error.
Cuda error 'cudaAcc_CalcChirpData_kernel2' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_CalcChirpData.cu' in line 113 : unknown error.
Cuda error 'cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, (cudaAcc_NumDataPoints / fftlen) * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_summax.cu' in line 234 : unknown error.
Cuda error 'cudaFree(dev_PowerSpectrumSumMax)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 522 : unknown error.
Cuda error 'cudaFree(dev_outputposition)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 524 : unknown error.
Cuda error 'cudaFree(dev_flagged)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 526 : unknown error.
Cuda error 'cudaFree(dev_NormMaxPower)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 528 : unknown error.
Cuda error 'cudaFree(dev_PoTPrefixSum)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 530 : unknown error.
Cuda error 'cudaFree(dev_PoT)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 532 : unknown error.
Cuda error 'cudaFree(dev_GaussFitResults)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 534 : unknown error.
Cuda error 'cudaFree(dev_t_PowerSpectrum)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 536 : unknown error.
Cuda error 'cudaFree(dev_PowerSpectrum)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 538 : unknown error.
Cuda error 'cudaFree(dev_WorkData)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 540 : unknown error.
Cuda error 'cudaFree(dev_flag)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 542 : unknown error.
Cuda error 'cudaFree(dev_sample_rate)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 544 : unknown error.
Cuda error 'cudaFree(dev_cx_ChirpDataArray)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 546 : unknown error.
Cuda error 'cudaFree(dev_cx_DataArray)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 548 : unknown error.
Cuda sync'd & freed.
Preemptively acknowledging a safe Exit on error->
SETI@Home Informational message -9 result_overflow
http://setiathome.berkeley.edu/result.php?resultid=1961505285
-
I have an interesting problem on a new system.
Any unit that runs longer than than exactly & minutes and 40 seconds errors out like this
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x7C90120E
This is a "new" system: QX9650, Gigabyte EX38-DS5 MB, 4G memory, 2 x GTX580's, XP_32, V266.58 drivers upgraded to 275.33 without improvement, new package downloaded and .exe and dll files refreshed, allowed BOINC more memory and disk space (just in case) Still no fix.
Shorties and higher AR units get through ok. It's only when a unit hits the magic 7:40 barrier the problem occurs. Unfortunately I haven't been able to download enough "long" units to see if this occurrs with only one card installed
Link to the errors page (http://setiathome.berkeley.edu/results.php?hostid=6099973&offset=0&show_names=0&state=5&appid=)
-
It's this part of the error that's important:
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
<stderr_txt>
Seems that Boinc thinks that these tasks should be running a lot quicker than they are and times them out with the above error.
Have you tried Fred's Reschedular to adjust the RSC_FPOPS_BOUND on your tasks as this should solve the problem for you
Here's a link to the Fred's site (http://www.efmer.eu/forum_tt/index.php?topic=428.0)
-
Duuh - of course THAT's what a 177 error is :P
Should have woken up but i'd been chasing hardware problems in another box all day and wasn't thinking. It's awhile since I had one
Thanks Ghost
T.A.
-
I made two x38gs fight it out over this one.. http://setiathome.berkeley.edu/workunit.php?wuid=774709022 I ran mine on my CPU. :D
Another x39e and I didn't quite match on this one. Went out to a third running it as v5.28. http://setiathome.berkeley.edu/workunit.php?wuid=772279650
-
Here are a couple of errors I came upon X38g on my 8600 GTS (256Mb of memory, driver 266.58):
...
full details http://setiathome.berkeley.edu/result.php?resultid=1961527829
looks like regular, rarer now, -12. Future work will get rid of those entirely, and they (two kind of -12) originate as hard coded limitations in the original nVidia supplied stock code.
http://setiathome.berkeley.edu/result.php?resultid=1961505285
Looks like a genuine failure of some sort, though x38g doesn't have as helpful output when those occur. x39e is somewhere in this thread... I'll look for that, or make a newer variant available as appropriate as soon as a few more things are understood about the newer build behaviour.
Jason
[Edit:] here' the link to the post containing the 7zipped main exe (only), meant to be added to an existing x38g installation, with suitable app_info edits (or aistub/aimerge).
http://lunatics.kwsn.net/12-gpu-crunching/x38g-reports.msg39472.html#msg39472
-
Just wanted to let everyone see what is happening with one of my errors... http://setiathome.berkeley.edu/workunit.php?wuid=770546468 5 -12s 1 -9 overflow.
Will it ever end? ;D
Oh, my other error shows 2 -9s and 2 -12s.
-
Will it ever end? ;D
Yes it will... The -12's will end altogether, as soon as the V7 autocorrelation code is stable, then I get time to look at & rewrite the triplet kernels (probably using a combination of Raistmer's opencl approach & my max bandwidth kernels)
Legacy builds disappear with the transition to V7, and newer builds are hoped to converge CPU & GPU results cross platform... So that puts an end to another bunch of weak similarity issues.
Long road... But I think when Joe's work, the project's polishing & all the builds are V7 compatible, then we could be looking for new sets of problems to solve... Or better yet, optimising again instead of troubleshooting & bugfixing.
Just better ask these hardware manufacturer's to stop devising new stuff so we can catch up for a while ;)
-
Oh, it seems like since I cut back on my over clock I have been getting fewer inconclusives. We may have been chasing a problem on my side. Most all of my incons right now are caused by a wingman's 450 throwing out -9s. I PMd him but I don't know if it will do any good.
-
... Most all of my incons right now are caused by a wingman's 450 throwing out -9s. I PMd him but I don't know if it will do any good.
What's he running ?
-
Appears to be stock. He also has a 430 in that rig that is turning in good results. http://setiathome.berkeley.edu/result.php?resultid=1986202415
I have been teamed up with him on quite a few work units.
-
I've just come across this invalid result from one of my wingmen:
http://setiathome.berkeley.edu/result.php?resultid=1973794071
Claggy
-
I've just come across this invalid result from one of my wingmen:
http://setiathome.berkeley.edu/result.php?resultid=1973794071
Claggy
Well at least newer builds don't end up in that unhelpful cascade when things go awry. It looks like his original failure was arbitrarily in the preceding power-spectrum, which was already fairly hardened by x38g. Looking at his error list he may have a few other issues going on, with -177s, Cufft failures & possibly driver crashes.
-
Here's a nice one http://setiathome.berkeley.edu/workunit.php?wuid=777762756 the invalid one is running a GTX 570 with the old v12 mod. Will we lose these guys when we go to the new V7?
-
Will we lose these guys when we go to the new V7?
Yep.
-
Great, then maybe we can tewll them to upgrade when the finally visit the forums to ask why they aren't getting any work! Oh and the work units from my buddy with the bad 450 has started validating for me and kicking him out as invalid. Judging from the dozen or so times I was paired with him he must be kicking out hundreds if not thousands of those -9s. No reply from him from my PM. (Imagine that! ::) )
-
Woke up this morning to two -1 errors. http://setiathome.berkeley.edu/workunit.php?wuid=780336597 http://setiathome.berkeley.edu/workunit.php?wuid=779923870 . Don't know what happened but looks like my computer rebooted overnight. EVGA precision has a habit of starting after BOINC Manager has started so it doesn't catch the over clock. I have to restart BM and client to get it right again.
Well, something is happening. I just downclocked again. I'm going to set things back to .5/.5 and try again.
FYI, I picked up and ran an AP on my GPU last night. That could have been the problem. Everything is back up to speed after rebooting but my MB tasks on GPU are running high priority. That should settle down after a couple run here soon. I also have a few APs waiting so we shall see what happens.
-
FYI, I picked up and ran an AP on my GPU last night. That could have been the problem. Everything is back up to speed after rebooting but my MB tasks on GPU are running high priority. That should settle down after a couple run here soon. I also have a few APs waiting so we shall see what happens.
Something to watch: I have no idea if Raistmer included any boincApi fixes for modern Cuda 4.0 drivers, or if they'd be needed under OpenCL running as they most definitely are under Cuda. One way to find out would be to repeatedly exit Boinc (shutting down the AP while in progress) [or just snooze/unsnooze etc) & see if it triggers a sticky downclock or not. If so, then you'll just have to slap Raistmer around a bit to fix it. [ I mean ask nicely... :D]
Jason
-
I'm over in his thread too. Seems it tried to start two APs at once a bit ago. After 10 minutes I had to suspend the second one as it hadn't even started yet. It was strange because even though I changed the count to .5 I had not changed the number of iterations. The only thing I can think of is the fact it was guesstimating the TTC as over 1600 hours and it just over rid everything else. Once I finish this first AP things should drop drastically and I will see what happens.
-
FYI, I picked up and ran an AP on my GPU last night. That could have been the problem. Everything is back up to speed after rebooting but my MB tasks on GPU are running high priority. That should settle down after a couple run here soon. I also have a few APs waiting so we shall see what happens.
Something to watch: I have no idea if Raistmer included any boincApi fixes for modern Cuda 4.0 drivers, or if they'd be needed under OpenCL running as they most definitely are under Cuda. One way to find out would be to repeatedly exit Boinc (shutting down the AP while in progress) [or just snooze/unsnooze etc) & see if it triggers a sticky downclock or not. If so, then you'll just have to slap Raistmer around a bit to fix it. [ I mean ask nicely... :D]
Jason
Raistmer hasn't spoken about doing any api changes on any of his apps for Cuda 4 drivers, his apps also consume large amounts of CPU time when running with Cuda 4 drivers (when running on their own),
if there are other apps running like CPU apps, they manage to claw back some of that CPU time, but i think the elapsed time of the OpenCL tasks suffers,
Claggy
-
Raistmer hasn't spoken about doing any api changes on any of his apps for Cuda 4 drivers, his apps also consume large amounts of CPU time when running with Cuda 4 drivers (when running on their own),
if there are other apps running like CPU apps, they manage to claw back some of that CPU time, but i think the elapsed time of the OpenCL tasks suffers,
Hmm, Well most of the Cuda 4 driver issues with actual cuda code, so far, have turned out to be a consequence of the underlying OS/driver changes, and are remedied by changing boincApi to behave a bit more threadsafe [to get back some stability], and taking into consideration the underlying memory model changes [to get back some performance+ a bit extra in some cases].
I'd expect eventually they'd apply to OpenCL as well, but didn't expect that yet. Oh well, at least before it becomes a real problem for Raistmer on Ati/OpenCL, we should have most of the kinks ironed out & techniques improved on the Cuda side.
[Edit:] do newer Ati drivers 'appear' to be getting slower as well for old-school coding techniques ?
-
[Edit:] do newer Ati drivers 'appear' to be getting slower as well for old-school coding techniques ?
When the setup is working properly no.
But there are some cicumstances some GPUs dont work well on each setup.
Will ty that lout on my sons PC as soon i find some time.
-
Raistmer hasn't spoken about doing any api changes on any of his apps for Cuda 4 drivers, his apps also consume large amounts of CPU time
[Edit:] do newer Ati drivers 'appear' to be getting slower as well for old-school coding techniques ?
Don't know, don't actively run any Native ATI CAL projects, i used to run Collatz, but don't anymore, i've found SDK_2.4 speeds up OpenCL work on my HD5770,
but Raistmer has found slowdowns/instability with Cat 11.6/SDK2.4 on his HD69** when using his OpenCL apps
Claggy
-
OK, thanks both, they could be at some intermediate stage with those too, making things look 'weird'. On the Cuda side I'll test/check a few things to do with the 'unified memory model' while ironing out some of the V7 code.
Jason
-
Well, I've decided to give Claggy's idea a try. I've set mine to .51 for the APs and .49 for the MB CUDA tasks. I hope it works for me. Right now I'm running two MB tasks so it will have to wait until I get some more AP work.
-
Raistmer hasn't spoken about doing any api changes on any of his apps for Cuda 4 drivers, his apps also consume large amounts of CPU time
[Edit:] do newer Ati drivers 'appear' to be getting slower as well for old-school coding techniques ?
I've grabbed a few Collatz_mini Wu's, with Cat 11.6/SD2.4 my first Wu's completed in 6 min 15 secs & 6 min 12secs, down from average of ~6min 31 secs which would have been done with Cat 10.xx over 6 months ago,
Claggy
-
Wish I could be more help, I'm still dealing with downclocking issues.
-
Starting the thread on NC should be of help. I've noticed a number of my wingmen with the 560Ti throwing out -9s and other bad results giving them an invalid or error. It would be good to see if it is something in the card itself causing it.
-
Here's a strange one http://setiathome.berkeley.edu/workunit.php?wuid=771323155 Check out computer 5257703. He's showing he has two GTS460s but all his GPU work is coming up no GPUs found.
Okay, looked a little closer, he's running driver version 258.96 with the x32f app. I tried to PM him but we will see if he responds.
-
Here's a strange one http://setiathome.berkeley.edu/workunit.php?wuid=771323155 Check out computer 5257703. He's showing he has two GTS460s but all his GPU work is coming up no GPUs found.
He probably installed Boinc as a service (protected application)
-
That's possible too. I didn't think of that.
I hope it's just the driver. He's running an I7 and cranking out a lot of good work on it. With those two 460s running he'll take off like a house afire. ;D
another interesting one http://setiathome.berkeley.edu/workunit.php?wuid=763277874 they tried three times to prove me wrong when I didn't match that first -9 and still refused to give me canonical. ;D
Another one inconclusive http://setiathome.berkeley.edu/workunit.php?wuid=782642742 He's running a 470 with v38g. He found 20 spikes I found 12. Looking at his work I'd say I'll win.
-
another interesting one http://setiathome.berkeley.edu/workunit.php?wuid=763277874 they tried three times to prove me wrong when I didn't match that first -9 and still refused to give me canonical. ;D
Nice, the Cuda23 ones really dropped the ball with that task, the one that processed successfully missed a reportable gaussian the CPU apps & yourself picked up. That pretty much correlates with the chirp + gaussian weirdnesses we've been seeing, and need to investigate & understand more deeply. It is logical that the canonical was chosen as the AKv8 one, as it's likely 'in between' the 6.03 & yours, so representative. I also take it as a sign that x38g was moving in the right direction to make better matches to CPU apps on some types of results, but there is still work to do.
Another one inconclusive http://setiathome.berkeley.edu/workunit.php?wuid=782642742 He's running a 470 with v38g. He found 20 spikes I found 12. Looking at his work I'd say I'll win.
He could have power or cooling issues or anything, hard to say. Once some of the more urgent issues are solved, I'll start playing with nVidia APi to see if I can extract thermal / power / clock info . Decent info along those lines could paint a clearer picture in some cases, if some info could be printed to stderr.
Jason
-
Decent info along those lines could paint a clearer picture in some cases, if some info could be printed to stderr.
Will this be coming out in a paperback edition or just hardcover? ::)
-
Will this be coming out in a paperback edition or just hardcover? ::)
LoL, I was thinking of something along the lines of "Your GPU appears to be broken"
-
Here's a strange one http://setiathome.berkeley.edu/workunit.php?wuid=771323155 Check out computer 5257703. He's showing he has two GTS460s but all his GPU work is coming up no GPUs found.
He probably installed Boinc as a service (protected application)
I thought in that case BOINC doesn't see the GPUs at all, though Windows does?
Joe
-
Here's an oldie but a goody http://setiathome.berkeley.edu/workunit.php?wuid=730523283 I know the first one was showing a couple of no heartbeats but it looks like it finished and had the same count as everybody else. Wonder why he didn't get any credit?
-
I thought in that case BOINC doesn't see the GPUs at all, though Windows does?
Joe
I would have to check to be sure on that, but I believe everything appears normal under enumeration, except once in the application you can't initialise the selected device (cudaSetDevice() ) due to permissions or the device already being allocated to another (presumably the user) session somehow. If that is really the case, then the approaching move to the boinc_temporary_exit() feature should help out, indefinitely stalling the work.
Jason
-
Here's an oldie but a goody http://setiathome.berkeley.edu/workunit.php?wuid=730523283 I know the first one was showing a couple of no heartbeats but it looks like it finished and had the same count as everybody else. Wonder why he didn't get any credit?
The count itself has value mostly for approximation & guessing somethings are about right, & other aesthetic qualities. Even the later 295's that processed successfully & became canonical with x38g look to be running on the edge. It has error results eerily matching the 560ti's with insufficient core voltage &/or cooling ( Yes, at this stage it appears the 560ti issues have been isolated as mostly attributable to those two primary factors).
The likelihood the original 6.03 result is broken is very high, given that a flaky looking 295 weakly matched your result... the final 6.03 that resolves the quorum sits 'the other side' of the 295 result from you. That x38g & x38e didn't perfectly match one another in this case is at first surprising until you include the multiple stability influencing factors that could be at play ... Just keep your own temps down & ensure sufficient core voltage etc, so you aren't 'the bad guy' :)
[Edit:] I've just had a brainwave that it may be helpful to add some indication of the number of reportable signals close to threshold, as we did with some astropulse bench testing a while back. I'll give it some thought.
-
I'm glad you got a handle on the 560Ti problem. Is there something you can do from this end or can you get the word out on how to fix it on the users end? I know I see a lot of my inconclusives coming from 560Tis so it sure would be nice for everybody concerned.
-
I'm glad you got a handle on the 560Ti problem. Is there something you can do from this end or can you get the word out on how to fix it on the users end? I know I see a lot of my inconclusives coming from 560Tis so it sure would be nice for everybody concerned.
I'm currently giving it some deep thought. There are special nVidia developer tools available that I may be able to get temperatures & possibly voltages & clock rates. I could in future print lots of explanation to stderr & go into a temprary exit, or at least a failsafe mode of some sort when things look really obviously bad. I think after a long period of careful design, under certain more obvious circumstances it should be possible to choose either a hard error out to induce reissue & avoid contaminating the science database, or under some other known conditions do a temporary exit for some short time period & try again in some predetermined time interval. We'll see, the 560ti situation certainly raises these questions, and is no doubt a result of stock units being pushed far beyond reference nVidia specs.
Jason
-
Got an invalid. I found 8 pulses the other two guys didn't. http://setiathome.berkeley.edu/workunit.php?wuid=781226722
-
Got an invalid. I found 8 pulses the other two guys didn't. http://setiathome.berkeley.edu/workunit.php?wuid=781226722
I have no immediate explanation for that one. Got a copy of the task by chance ?
[Edit:] I'll try get my updated offline bench suite updated & into public downloads at some point. Still wrestling with the fallout from juggling the new PSU etc, but should be under control soon.
-
No, sorry, I just found it on my tasks page.
-
No, sorry, I just found it on my tasks page.
Grab a copy while it's still there.
http://boinc2.ssl.berkeley.edu/sah/download_fanout/a4/08mr11ai.9455.10865.12.10.23
I'll get some easy up to date bench setup organised tomorrow or so, and also run here to see if 8 pulses turn up that shouldn't, and manually see how close they are to threshold.
Jason
-
http://setiathome.berkeley.edu/forum_thread.php?id=64837&nowrap=true#1129324 Is it known behavior? x38g works slower on 26x.xx drivers indeed?
-
http://setiathome.berkeley.edu/forum_thread.php?id=64837&nowrap=true#1129324 Is it known behavior? x38g works slower on 26x.xx drivers indeed?
The 275 drivers are a bit better with some kernels as I gradually apply some of the newer techniques. That will apply to different cards to different degrees, so it becomes a your mileage may vary issue as usual, until more of the kernels get 'upgraded' and full asynch operation is enabled down the line.
[Edit:] like with perryjay's 8xtra pulses I just finished benching, the gradual accumulation of improvements adds up to quite a lot, under 275.50 beta, that I don't even care what old drivers do anymore....
Quick timetable
WU : 8XtraPulses_08mr11ai.9455.10865.12.10.23.wu
Lunatics_x32f_win32_cuda30_preview.exe :
Elapsed 494.422 secs
CPU 77.423 secs
Lunatics_x39f_win32_cuda32.exe :
Elapsed 407.459 secs, speedup: 17.59% ratio: 1.21
CPU 53.430 secs, speedup: 30.99% ratio: 1.45
Still investigating this particular task to see if perryjay broke it, or something else is going on....
[Edit2:] Bad news perryjay :( You broke that one somehow... I get agreement with your wingmen under bench:
Spike count: 0
Pulse count: 0
Triplet count: 0
Gaussian count: 0
Now we just have to figure out what could have gone wrong with yours....
-
http://setiathome.berkeley.edu/forum_thread.php?id=64837&nowrap=true#1129324 Is it known behavior? x38g works slower on 26x.xx drivers indeed?
Yes, the x38 & x39 series are slower on the 267.xx drivers compared to the 275.50 drivers by some margin.
I can do a shortie in ~250 seconds on the 275.50 drivers, but this shoots up to around 290-300 seconds on the 267.xx drivers
-
Can I slap ghost? I had a great post all ready and when I hit post it gave me the message about a new post and ate my post!!! >:(
-
Can I slap ghost? I had a great post all ready and when I hit post it gave me the message about a new post and ate my post!!! >:(
LoL, for next time, when that happens you can usually use your browser's back button, & copy to the clipboard, then paste it into a new post again ;)
Slapping ghost could be difficult & potentially messy with all that ectoplasm, but fun to watch ;D
-
Now, what was I saying? Something about not worrying about a couple of bad WUs as long as they help someone else. Oh yeah, and I have tried going back to driver 267.59 to see how it does. I don't see much difference from the 275.33 driver I was using. I went back because I thought I was having trouble getting Raistmers' app and yours to play nice together but it looks like WU series # 13mr11ag 13143.8656.xxxx is what is giving me trouble. Some are running in around 25 minutes and some are taking an hour and 25 minutes. THey are throwing my time to completion all over the place.
-
... THey are throwing my time to completion all over the place.
Hmmm, maybe it's time to think about reviving/extending my modified boinc with per Application DCFs....
-
Can I slap ghost? I had a great post all ready and when I hit post it gave me the message about a new post and ate my post!!! >:(
LoL ;D
-
This is one of the long ones http://setiathome.berkeley.edu/result.php?resultid=2000429739 and this is one of the short ones http://setiathome.berkeley.edu/result.php?resultid=2000429745 No idea what is happening to the run time.
Oh, and all I had to do was hit the post button again. I hit my back button without checking to make sure the post had gone through.
-
This is one of the long ones http://setiathome.berkeley.edu/result.php?resultid=2000429739 and this is one of the short ones http://setiathome.berkeley.edu/result.php?resultid=2000429745 No idea what is happening to the run time.
Were you running the longer one alongside Raistmer's OpenCL APs ? You could try use Fred's priority thingy, or Process Lasso or similar to jack up the priority on the Cuda app. Doing other stuff on the machine ?
You get any sticky downclocks still? How's the temperatures etc ?
Jason
-
Temp is 69c, no sticky down clocks, and I was running Raistmers' app at the same time for both of them. I'm only running one MB and one AP on my GPU so it would finish like the short one then do the long one. That or a couple of short ones in a row then a long one. I've mostly been changing the unroll on his app trying to find a sweet spot but it doesn't seem to matter as to how fast or slow the MBs are.
-
I've mostly been changing the unroll on his app trying to find a sweet spot but it doesn't seem to matter as to how fast or slow the MBs are.
But it can matter. Different unrolls take different amount of GPU memory. So, at least different memory layouts for CUDA app. And with higher unrolls it can have memory shortage...
-
I've been keeping an eye on the times as I make the changes if I change while a WU is running then I watch how fast it was running as compared to what it does after the change. I've also made the change before the WU runs and compare it to another run before the change. The longer WUs seemed to pause for a minute or two even though the elapsed time kept climbing. It didn't seem to have anything to do with what the AP app was doing. This without me doing any piddling around.
-
This one finally decided it had enough. http://setiathome.berkeley.edu/workunit.php?wuid=771323139
This one is interesting to me. http://setiathome.berkeley.edu/workunit.php?wuid=766554061 The first guy is running 32f but the third guy takes anonymous platform seriously. I can't figure out what version he is running.
-
This one is interesting to me. http://setiathome.berkeley.edu/workunit.php?wuid=766554061 The first guy is running 32f but the third guy takes anonymous platform seriously. I can't figure out what version he is running.
That is Stock 6.08, but running under Anonymous Platform,
Claggy
-
Another invalid for me. http://setiathome.berkeley.edu/workunit.php?wuid=781130909
-
Another invalid for me. http://setiathome.berkeley.edu/workunit.php?wuid=781130909
Hmm, you overflowed on that with pulses for no outwardly obvious reason. Watch those temperatures ;). I've decided to put some level of monitoring of that within future applications, so you'll be caught red handed cooking your GPU :D
-
Couple of things going on around that time. I believe I was still running my higher over clock and playing with Raistmers' app. Possibly running one of his and two of yours or some such as that. Temps have always held pretty much to a safe range so I'm not too worried about that. Figure I pretty much did that invalid myself by fiddling.
around.
-
Cheers. As long as we have some idea what went on with complete evident failure like that ;). The 560ti's market penetration, coupled with the large number running them on the knife edge without realising as such, has me considering ideas for monitoring/control. We'll see.
Jason
[Next Day:] For reference, moved x39e Diagnostic build to a special category
Cuda diagnostic Builds (http://lunatics.kwsn.net/index.php?module=Downloads;catd=47) located under GPU apps in public downloads.
-
Here's a nice one.. http://setiathome.berkeley.edu/workunit.php?wuid=772821765 All that time and then get screwed by a 4.43 client. >:( Oh well.
And then there's this one.. http://setiathome.berkeley.edu/workunit.php?wuid=778238977 -12s galore !!!!
-
And then there's this one.. http://setiathome.berkeley.edu/workunit.php?wuid=778238977 -12s galore !!!!
hehehe, yeah all three stockers choked, oh well.
-
Another invalid http://setiathome.berkeley.edu/workunit.php?wuid=781127263
-
Another invalid http://setiathome.berkeley.edu/workunit.php?wuid=781127263
Hmm, Stiffed on pulses, would be good to grab that one for an offline run as well & see if it was your fault. Not a chance to look for it myself just now, if it's still there a bit later I will.
Jason
[Later:] A quick look I couldn't find it. What are your average GPU temperatures ?
-
Right now sitting at 69C. High seldom goes above 71C.
-
One more invalid. http://setiathome.berkeley.edu/workunit.php?wuid=781608186 I found two pulses the other guys didn't.
-
One more invalid. http://setiathome.berkeley.edu/workunit.php?wuid=781608186 I found two pulses the other guys didn't.
Its on your end then perry.
One of the wingmen running 0.38g.
Still overclocked ?
-
Still overclocked ?
Course he is, LoL. I'm going to fit in temperature monitoring further into x40 series just for perryjay.... ;)
-
Still over clocked but cut back from the 900/1800 to 883/1766. Temps haven't been a problem. I'm at 73 right now but usually lower than that. Still, that's well within limits.
-
Still over clocked but cut back from the 900/1800 to 883/1766. Temps haven't been a problem. I'm at 73 right now but usually lower than that. Still, that's well within limits.
How's it hold up under OCCT 1 hour artefact scan at max complexity ?
-
Okay Aussie, what's that in English? ::) I'll have to google for OCCT whatsit scan and get back to you.
Ewwww, pretty, how do I work it? I guess I'll have to shut down BOINC while I run it huh? Okay, off to try to figure out how I can break it.
-
Okay Aussie, what's that in English? ::) I'll have to google for OCCT whatsit scan and get back to you.
http://www.ocbase.com/perestroika_en/index.php?Download
[Edit:] Here's the settings to use... Don't forget to abort if it starts to get really hot ;)
-
Figures you would come up with a different one than I found. I downloaded the one from EVGA. http://www.evga.com/articles/00530/Default.asp I'll try running it first then go get yours.
-
Similar purpose/usefulness. OCCT on max complexity just seems more hardcore to me..
-
Well, that was interesting. I ran the EVGA scanner for an hour. Afterward I checked it's log file. According to it I am running an ATI 6850 and I had artifacts out the ying yang. It also downclocked me way below anything that should be still running. I don't think I'll try that again.
And..... tried to get your version but it tells me my directx9 is not up to date and it blocks out GPU OCCT and power supply tests. I've got directx 11.
-
And..... tried to get your version but it tells me my directx9 is not up to date and it blocks out GPU OCCT and power supply tests. I've got directx 11.
DirectX in your system will need an update. DirectX 9.0c update was published 18th April 2011.
http://www.microsoft.com/download/en/details.aspx?id=35
-
Well, that didn't take as long as I thought it would. I don't think I ever had DX9 on this machine. Hope it doesn't mess with anything else on here.
-
I tried OCCT on my laptop.
the "Monitoring" part did not open on my i3
seems program is missing some latest chipsets
heinz
-
I ran it for 30 minutes at default 0 shader complexity the first time and got a ton of errors then remembered one of the FAQs that said EVGA precision would cause errors so I stopped it and ran it again for your 6 minute test and set the shader complexity to 8 with no errors.
https://picasaweb.google.com/lh/photo/uFQ_UCQC5Ra7BCZtozmVBSbSZ_Aup0-RSRejz0fueJU?feat=directlink
https://picasaweb.google.com/lh/photo/eG8e5yUOsUAFkZ-fg-BOGybSZ_Aup0-RSRejz0fueJU?feat=directlink
https://picasaweb.google.com/lh/photo/vTDakos6ogUKQLJp0zHSPCbSZ_Aup0-RSRejz0fueJU?feat=directlink
https://picasaweb.google.com/lh/photo/hU2Qv8QKDHix7nBTAEnPXibSZ_Aup0-RSRejz0fueJU?feat=directlink
https://picasaweb.google.com/lh/photo/9RYN9UQs5luTfj2CjnyhISbSZ_Aup0-RSRejz0fueJU?feat=directlink
https://picasaweb.google.com/lh/photo/0sLNi2Pdejctj_P4nYzlAibSZ_Aup0-RSRejz0fueJU?feat=directlink
https://picasaweb.google.com/lh/photo/yi4S9r03UuULaEICDGRMMibSZ_Aup0-RSRejz0fueJU?feat=directlink
-
OK,
looks like some optimised code can push things harder than previously thought for durations too short to detect with monitoring tools. That would explain a lot coupled with overoptimistic factory or user OC's, insufficient cooling or PSU. For now I'd advise backing off any OC if you see invalids not directly attributable to being stiffed by legacy wingmen apps. In the long run I may need to detect signs of instability, and devise a more purpose built stability check.
Stable for 6 mins OCCT on max complexity is good. An hour run would be more thorough, but probably get pretty warm.
Jason
-
Hope I did those links right. After a half hour on the first run it only got up to 85c so temp doesn't seem to be a problem.
-
This one took awhile to decide. http://setiathome.berkeley.edu/workunit.php?wuid=763346130
-
Starting to get errors on gtx 570:
http://setiathome.berkeley.edu/result.php?resultid=2027743944
http://setiathome.berkeley.edu/result.php?resultid=2027727112
http://setiathome.berkeley.edu/result.php?resultid=2027722678
http://setiathome.berkeley.edu/result.php?resultid=2027713834
http://setiathome.berkeley.edu/result.php?resultid=2027666731
http://setiathome.berkeley.edu/result.php?resultid=2027666700
CUFFT error in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_fft.cu' in line 125.
Cuda error 'cudaFree(dev_PowerSpectrumSumMax)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 522 : unknown error.
Cuda error 'cudaFree(dev_outputposition)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 524 : unknown error.
Cuda error 'cudaFree(dev_flagged)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 526 : unknown error.
Cuda error 'cudaFree(dev_NormMaxPower)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 528 : unknown error.
Cuda error 'cudaFree(dev_PoTPrefixSum)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 530 : unknown error.
Cuda error 'cudaFree(dev_PoT)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 532 : unknown error.
Cuda error 'cudaFree(dev_GaussFitResults)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 534 : unknown error.
Cuda error 'cudaFree(dev_t_PowerSpectrum)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 536 : unknown error.
Cuda error 'cudaFree(dev_PowerSpectrum)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 538 : unknown error.
Cuda error 'cudaFree(dev_WorkData)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 540 : unknown error.
Cuda error 'cudaFree(dev_flag)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 542 : unknown error.
Cuda error 'cudaFree(dev_cx_ChirpDataArray)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 546 : unknown error.
Cuda error 'cudaFree(dev_cx_DataArray)' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 548 : unknown error.
Cuda sync'd & freed.
-
Eeh gads. If that's started recently then somethings come unstuck. Please reboot & report temperatures. I can see your driver is 280.19. If something let go, I'd like to find out what.
Jason
-
It all started yesterday - GPU load was 99% but temp was under 40C. Check with GPU-z, cuda-z and nvidia inspector if the GPU has down-clocked itself but it didn't. Reboot, change drivers, OC with EVGA pricision, OC thru bios, ran OCCT, ran EVGA OC Scanner - all these stuff didn't point to a single error. Right now load temps are @65C-70C. I also change the cuda count from .31/.34 to .50/.50 but still got an error on MB but surprisingly no error on AP.
-
Please Ditch AP, aborting tasks as necessary, then after a reboot report to me x38g behaviour only. Thanks.
-
Please Ditch AP, aborting tasks as necessary, then after a reboot report to me x38g behaviour only. Thanks.
After ditching all AP task, x38g is running smoothly.
-
After ditching all AP task, x38g is running smoothly.
Thanks for the information, and I apologise for the curtness of the recommendation. For the AP app, current advice being given is to run either Cuda 4.0 Drivers, or the Beta AP app, but not both together, as apparently it does not include any of the (boincApi) fixes for newer drivers, and the underlying OS/driver changes they address.
Jason
-
Here's one that is interesting. Must say they tried their best to prove me wrong but finally gave up and gave me credit. http://setiathome.berkeley.edu/workunit.php?wuid=783092275
-
Here's one that is interesting. Must say they tried their best to prove me wrong but finally gave up and gave me credit. http://setiathome.berkeley.edu/workunit.php?wuid=783092275
You're also canonical, not that surprisingly with all those apps erroring out.
Jason
-
Using 275.33/38g & 39f I have a downclocking problem with one particular card. There is no problem using 191.07/6.09.
It's one of 2 EVGA GTX285 FTW's in the same box, the other card behaves itself perfectly. I have confirmed it's the card by swapping cards between the sockets.
Jason. As there are two identical cards, one which plays up and one which doesn't. Are there any tests you would would like me to run on the cards to help track down the reason for the downclocking issue in general ?
T.A.
-
One thing that has become apparent lately, is that some of the optimised code may be pushing harder than some OC scanning tools. It is the 560tis that came under scrutiny first for being close to the edge from factory, and seems to have some per manufacturer &/or per Silicon differences.
As a result of the ongoing examination, I may need to take some of the 'hottest' running Cuda kernels, and make some more targeted scanning tool out of it. In the meantime I suggest see what happens if you back the suspect card right down to reference clocks. If that still doesn't help there could be further issues.
If it turns out I am pushing harder that whatever the factories use to bin parts for factory OC models, then I may have to look at some sortof backoff throttle.
Jason
-
On clser examination there is a possibility it's related either to certain WU's or series of WU's.
THE FTW cards are factory OC'ed to 725MHz, I have backed them off to 715MHz and it still drops to half speed with the same degree of randomness.
Two WU's in question are here (http://setiathome.berkeley.edu/result.php?resultid=2026291800) and here (http://setiathome.berkeley.edu/result.php?resultid=2026307997)
T.A.
-
Two WU's in question are here (http://setiathome.berkeley.edu/result.php?resultid=2026291800) and here (http://setiathome.berkeley.edu/result.php?resultid=2026307997)
Thanks, no indication of a cause in those, so we'll keep looking.
What are the temps like going at flat chat?
It's stability of my code (exhibiting apparent task dependency) versus Factory OC's that are in question, i.e. Higher stock than nVidia reference.
nVidia reference clocks for GTX 285 are:
Core: 648 MHz
Shader: 1476 MHz
Mem: 1242 MHz
Still crook at those ?
-
There are some further clues in your errored task list that I'm looking at, tracing some code. Back later with some beer to fuel a further analysis.
There's still quite a few things to eliminate from suspicion, but we'll isolate what's going on eventually.
Jason
-
Hi Jason - Here (http://setiathome.berkeley.edu/result.php?resultid=2027976105) and here (http://setiathome.berkeley.edu/result.php?resultid=2027976084) are a couple more units for your perusal. For comparison THIS (http://setiathome.berkeley.edu/result.php?resultid=2027924538) is a "good" unit from the same card
In a short run of about 12 hours, reducing the card to 648MHz showed no downclocking errors (it was dropping to approx half speed before). I've put it back up to 702MHz which GPUZ claims is the "stock" speed and will report back tomorrow (The 70 Meg downclock from the EVGA factory spec was just too irritating to handle :-)
T.A.
-
Thanks,
It gives me some ammunition to approach things properly with the new 560ti in the other room, which I have yet to put under any crunching or test pieces. I intend to use it to help isolate what's going on, attempting to replicate what some others see .
If I'm pushing some code portions 'too hard' (that is harder than what the factories are using to determine stable OC or bin parts), I'll just have to back those off, making them optional via advanced user settings somehow. (There could be a lot of them, so probably some sortof configuration file would be needed, along with stress tests to determine viable settings), as well as potentially some monitoring & failsafes.
It's likely to end up being a complicated tradeoff, whether to run faster code at a reduced clock rate, or slower code at potentially unstable factory settings, but the most stable config would have to be the default.
Jason
-
Got an error http://setiathome.berkeley.edu/workunit.php?wuid=801331122 Found a triplet thrice. Sounds like a song title or something. :D My original wingman hasn't got to it yet and the new work hasn't gone out yet so I don't know if it was something I did wrong or not. Since it's the first one of those I've seen in awhile I doubt it's me.
-
LoL. It looks like he's running stock, so will find it twice before exploding (assuming your result was all in order). It's probably just an extraterrestrial intergalactic cruiseliner sending an SOS distress beacon in morse code. we don't want those anyway, we're looking for extraterrestrial intelligence, not shuffleboarders.
-
Got it, no shuffleboarders. If I got it thrice does that mean all my wingman will get is S, O ?
-
Got it, no shuffleboarders. If I got it thrice does that mean all my wingman will get is S, O ?
LoL, probably more like the S plus the first 'dah' of the O , since 4 tones would make 2 triplet detections already if you think about it, 5 tones could make [Edit: 3 triplets]. Won't be long until I remove this limit anyway.
-
It's probably just an extraterrestrial intergalactic cruiseliner sending an SOS distress beacon in morse code. we don't want those anyway, we're looking for extraterrestrial intelligence, not shuffleboarders.
Nah, the cruiseliners look like 2035680302 (http://setiathome.berkeley.edu/result.php?resultid=2035680302) - overflow on 23 gaussians after 75 minutes. It's the wake, you know.
-
What stupid credits :(
2.66 credits for 1200 seconds of work
http://setiathome.berkeley.edu/workunit.php?wuid=799356547
-
Oh there's more fun to come yet :D
-
I found what was those results: it looks like my 280.26 drivers broken, so GPU calculates soo much. There was one result finished in 14 seconds get same credit, and same angle in WU.
But this http://setiathome.berkeley.edu/workunit.php?wuid=800021504 is mistery: my gtx 560 Ti or his gtx295 is right? :) Booth uses same app :)
-
His 295 looks to be cooking, (Invalids popping onto his list)
-
His 295 looks to be cooking, (Invalids popping onto his list)
Yes you are right: all inconclusive are from same host with GTX 295 ( I can "see" user of that card happy , because his card crunch so fast ) :))
-
Just to let you know I'm still around http://setiathome.berkeley.edu/workunit.php?wuid=742974732 I thought this one was interesting. Not much else to report, everything has been working well except for the wonderful changes Dr. A made. ::)