Seti@Home optimized science apps and information
Optimized Seti@Home apps => Discussion Forum => Topic started by: sunu on 02 Jun 2010, 05:52:39 pm
-
... and valid results get thrown out of the window...
See this workunit http://setiathome.berkeley.edu/workunit.php?wuid=609263674
My result was the only valid result from all that garbage. And was marked as invalid :(
-
Hm.... looks like our hope that incorrect overflow gives quite random pulses not fulfilled :(
So such disrupted GPU state even more dangerous for project than was thought before!
-
Wish I hadn't seen that. I've got a couple of pendings waiting for a match that look a lot like that.
-
Here's one on my host: workunit.php?wuid=618018953 (http://setiathome.berkeley.edu/workunit.php?wuid=618018953) :(
Claggy
-
Just found this one today http://setiathome.berkeley.edu/workunit.php?wuid=618796380
-
Unfortunately they are many. Yesterday I had another one but today it must have been deleted from the database and I can't post the link.
-
I think they are trying to keep it a secret!! Looks like as soon as we post a link to one of them they erase it from the database. ;D Must be a conspiracy.
-
Could people spotting / reporting this problem please check and report the hardware involved?
I have a horrible feeling that people who just throw a Fermi card into a host and attach, are being issued with the stock Cuda23 application and immediately start trashing WUs.
But we need robust reports from reliable witnesses....
-
...I have a horrible feeling that people who just throw a Fermi card into a host and attach, are being issued with the stock Cuda23 application and immediately start trashing WUs....
From the few I saw, that was the case (470's & 480'wingmen trashing & validating against one another). I do suspect that was the 'errors with 2.3' situation that Eric alluded to a while back (around fermi release IIRC), which might suggest some sortof flag raised somewhere that let him know, like the noisy wus figure that used to show somewhere (?) . Further conjecturing (& hoping), some double -9 intercept may be in place, explaining the rapid result removal sooner than the normal 24 hour assimilation/deletion period. If that's the case, I hope they put those through again for reprocessing.
-
Eric's comment (he actually said cuda24, which we're taking to be a typing error) was made on 20 May - well, actually late afternoon 19 May in his time zone - in the course of a conversation with David and me about Fermi issues at Beta. It came just after the corrected Fermi 6.10 application version was loaded for stock download at Beta.
The 'noisy WUs' figure is still showing on the Science status page (http://setiathome.berkeley.edu/sci_status.html). Since it's "science", I assume it's driven off the validated results transferred from the BOINC to the science database. Historically, it's been 'about 5%'. Last time I looked, it was down to 1.2%, which I took as a compliment to the Radar removal team. Now, it's showing as 4.8%, which probably reflects the scale of the "pseudo -9" problem.
I'm coming to the conclusion that nobody saw this one coming. I certainly hadn't thought about it until this afternoon, and yet I've been working closely with David / Eric / Jason on BOINC+Fermi issues. Even when I told David (much to his surprise) that the Fermi card wouldn't run the cuda23 app at Beta (during the quota overflow discussion), the penny didn't drop that the situation was already building up at Main.
I have now suggested - on boinc_dev, which is the wrong mailing list, but the only one we've got in the absence of an official seti_technical channel - that 6.10_fermi should be installed as a stock application at Main. I think that's the only sensible way to rescue the situation.
Let's hope that no eager young project puppy runs into the lab this afternoon and loads a pristine box of tapes.....
-
Could people spotting / reporting this problem please check and report the hardware involved?
I have a horrible feeling that people who just throw a Fermi card into a host and attach, are being issued with the stock Cuda23 application and immediately start trashing WUs.
But we need robust reports from reliable witnesses....
Well, some of the reported workunits have a fermi card involved while others don't. The workunit from the first post here had 3-4 corrupted results, only one, if I remember correctly, was from a fermi with a cuda23 app. The other workunit I mentioned above had two 2xx cards involved with massive amounts of corrupted -9 results.
-
Suggestion: make copies of the WU and Task detail pages before BOINC purges them. Even better would be to find examples at SETI Beta where purging is disabled, and probably file deletion too. I started looking there, but no luck in finding any. Besides I got distracted by some of the nonsensical credit granting, one of Tetsuji's hosts recently did a set of reissued 0.448 tasks on 6.09 CUDA 23 with claims of 94.12 and grants ranging from 5.87 to 52.01 :o
Joe
-
Sorry, I didn't notice what my wingmen were running but I'm pretty sure the WUs were showing as 6.09s. I'm running a GT9500 on my vista X86 machine.
edit: Forgot to add I'm running the renamed cudart32_30_14 and cufft32_30_14 DLLs with the 197.45 driver.
-
Suggestion: make copies of the WU and Task detail pages before BOINC purges them. Even better would be to find examples at SETI Beta where purging is disabled, and probably file deletion too. I started looking there, but no luck in finding any. Besides I got distracted by some of the nonsensical credit granting, one of Tetsuji's hosts recently did a set of reissued 0.448 tasks on 6.09 CUDA 23 with claims of 94.12 and grants ranging from 5.87 to 52.01 :o
Joe
Unfortunately, searching at Beta probably won't turn up many errors, because I've been leaning on David to get them fixed, and this particular problem (issuing work associated with a non-Fermi app, to a Fermi-equipped host) should no longer happen at Beta. There are just a few still visible on 12316 (http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=12316&offset=0&show_names=0&state=4).
All that's left is - why am I still trapped by yesterday's quota?
Max tasks per day 153
Number of tasks today 273
- and why is credit so erratic?
-
Suggestion: make copies of the WU and Task detail pages before BOINC purges them. Even better would be to find examples at SETI Beta where purging is disabled, and probably file deletion too. I started looking there, but no luck in finding any. Besides I got distracted by some of the nonsensical credit granting, one of Tetsuji's hosts recently did a set of reissued 0.448 tasks on 6.09 CUDA 23 with claims of 94.12 and grants ranging from 5.87 to 52.01 :o
Joe
[offtopic]
Credit granting on beta absolutely screwed. If you look on granting for AP tasks it will be even more obviously
[/offtopic]
And ontopic: cause first listed WU in this thread had no 2 Fermi GPUs, this problem not only Fermi-related (unfortunately). Looks like _any_ invalid overflow has some probability to be validated :(
-
I had a look for hosts matched with my E8500 / 9800GTX+ / HD5700 that were producing inconclusive/Invalid work:
GeneralFrost hostid=5356245 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5356245) NVIDIA GeForce GTX 470 (1248MB) driver: 19775
Arles hostid=5355863 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5355863) NVIDIA GeForce GTX 480 (1503MB) driver: 25715
Balmer hostid=5384948 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5384948) [2] NVIDIA GeForce GTX 480 (1503MB) driver: 19775
djwhu hostid=5424576 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5424576) NVIDIA GeForce GTX 480 (1503MB) driver: 19775
Andrew Bazhaw hostid=5423129 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5423129) NVIDIA GeForce GTX 480 (1503MB) driver: 19775
Ollie hostid=5371034 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5371034) NVIDIA GeForce GTX 480 (1503MB) driver: 19741
smithwr3 hostid=5293938 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5293938) [2] NVIDIA GeForce GTX 480 (1493MB) driver: 25715
Chris hostid=5423967 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5423967) NVIDIA GeForce GTX 480 (1503MB) driver: 25715
Anonymous hostid=4946291 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=4946291) [2] NVIDIA GeForce GTX 275 (895MB) driver: 19745
D. McQueen hostid=4846359 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=4846359) NVIDIA GeForce GTX 260 (895MB) driver: 19713
Rory Isenberg hostid=5255297 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5255297) NVIDIA GeForce GTX 260 (877MB) driver: 19745
NEG hostid=1931164 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=1931164) [2] NVIDIA GeForce 9600 GT (495MB) driver: 19745
Michael Sangs hostid=5354486 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5354486) [3] NVIDIA GeForce GTX 295 (895MB) driver: 19107
Tim Lee hostid=5301365 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5301365) [2] NVIDIA GeForce 9800 GTX/9800 GTX+ (1024MB) driver: 19621
and there were a few more non Fermi hosts,
Claggy >:(
Edit: another five Fermi's:
Bittkau hostid=5336843 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5336843) NVIDIA GeForce GTX 480 (1503MB) driver: 19741
William hostid=5414447 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5414447) NVIDIA GeForce GTX 470 (1248MB) driver: 19775
Anonymous hostid=5419662 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5419662) NVIDIA GeForce GTX 470 (1248MB) driver: 19745
Setiman hostid=5227589 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5227589) NVIDIA GeForce GTX 470 (1248MB) driver: 19775
basti84 hostid=5391741 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5391741) NVIDIA GeForce GTX 470 (1248MB) driver: 25715
edit 2: added more Fermi:
Aaron Danbury hostid=5373696 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5373696) NVIDIA GeForce GTX 480 (1503MB) driver: 25715
Anonymous hostid=5025277 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5025277) NVIDIA GeForce GTX 480 (1503MB) driver: 19741
Edit 3: added more Fermi:
My9t5Talon hostid=5419671 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5419671) NVIDIA GeForce GTX 470 (1248MB) driver: 19775
simi_id hostid=5419256 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5419256) NVIDIA GeForce GTX 480 (1503MB) driver: 19741
-
...
Claggy >:(
I think we should send David A. around to tell them off! :o
-
And most important thing I see in this list - they not only FERMI GPUS!
But there can be 2 independent problems still:
1) corrupted GPU state of pre-FERMI GPU that produces random pulses and programmatic error in early CUDA app that produces non-random, but invalid pulses on FERMI GPUs.
Second one will always pass into database if 2 FERMI with broken apps meet together.
But first most probably should not pass Validator (our database Guardian in some sense became Blind Guardian ;D ;D ;D )
Problem with broken app for FERMI is resolvable. But will 1) let invalid results go into database too or not - it's hard question that needs some more evidencies IMO. If yes.... ::)
-
Of the Fermi cards (the first 8 in Claggy's list), seven are running the stock v6.09_cuda23 application.
Just one - smithwr3 - is using an app_info, and he's got one of Raistmer's v12 builds. I don't know enough about the std_err to be able to tell whether it's the special one he did for Fermi, and since smithwr3 hasn't posted in the forums since he was struggling with a Mac almost 4 years ago, there's not much to go on.
-
whether it's the special one he did for Fermi, and since smithwr3 hasn't posted in the forums since he was struggling with a Mac almost 4 years ago, there's not much to go on.
Very low probability he could take that version. And even so, IMO Jason showed already that initial CUDA MB code has programmatic error that "silent" on pre-FERMI GPUs but leads to invalid computations on FERMI . I used same codebase, just rebuilt app with new SDK. That is, V12 in no way FERMI compatible.
-
Just checked my validation inconclusive and found three or four where my wingman turned in .01. I should be ok on most because the third wingmen are running on their CPUs. I followed out the .01 wingmen and all seem to be running either a 470 or 480. They are also getting a lot of .01 credit claims validated. Again, tracing out their wingmen, they are also running 470/480s. I wonder just how much we are missing because the third man turns in good also. :o
-
Of the Fermi cards (the first 8 in Claggy's list), seven are running the stock v6.09_cuda23 application.
And of the five additional Fermis that Claggy has listed, every one is running stock v6.09_cuda23.
Raistmer is absolutely right to say that there are two distinict problems:
1) Random state corruption of older cards
2) Fermi cards running incompatible applications
The point is, the second problem could be solved at a stroke by deploying the v6.10 Fermi app which has been tested - and has passed the test - at Beta.
That's an incredibly easy solution, and would reomove, on Claggy's figures, a hugely significant part of the problem.
The random failures would remain, to be dealt with as we understand the problem further. But that problem has existed for months, without reaching critical mass. If we remove the Fermi co-validators, it should remain insignificant: but the rise of the Fermi means we can't ingore problem #2 any longer.
-
Sure, I though yesterday night project maintenance will bring 6.10 to SETI main, still not ?
-
No sign of it. But something's going on: looking at Claggy's list, only one (djwhu, 5424576 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5424576)) is still blowing away significant numbers oi WUs - and incidentally confirming that mid-AR suffer the same fate. The latest addition (Aaron Danbury, 5373696 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5373696)) has done a few, but with the new work supply, it would normally have many more. So it seems that they may have put some sort of limiter into the system, but it's not obvious what.
David has got the message (off-list response), but hasn't got a reply from Eric yet.
And the problem is about to get worse - Fermi GTX 465s have landed in the shops, and are already being discounted: I was offered one for £215.99 in a mailshot. Won't interest the hard-core crunchers, but will certainly attract a few into the fit-and-forget segment.
-
It is to mention, I get no work for fermi application since yesterday.
All wu's are coming are for cpu only.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I think it is blocked till the situation is solved.
:)
-
It is to mention, I get no work for fermi application since yesterday.
All wu's are coming are for cpu only.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I think it is blocked till the situation is solved.
:)
Hard but probably correct decision from Berkeley's side. Hope they will be able to solve this soon cause all they need id to follow Richard's suggestion about 6.10 on main.
Looks not very hard to do actually.
-
It's not blocked, because I've got four new tasks today. But I think it may be very severely throttled.
Raistmer, people on the main board are still recommending your V12b_FERMI. Has anybody actually tested it, and posted any results? If not, I've just downloaded a copy and I'll run a bench when the 470 is free (GPUGrid task due to finish in ~2 hours). If it doesn't work, as you suggested last night, I suggest you remove the download archive.
-
It's not blocked, because I've got four new tasks today. But I think it may be very severely throttled.
Raistmer, people on the main board are still recommending your V12b_FERMI. Has anybody actually tested it, and posted any results? If not, I've just downloaded a copy and I'll run a bench when the 470 is free (GPUGrid task due to finish in ~2 hours). If it doesn't work, as you suggested last night, I suggest you remove the download archive.
AFAIK Todd Hebert tested it and found uncompatible with FERMI.
It was at the beginning of corresponding thread.
Later this info somehow modified... So better do short test and close this topic completely. Surely I will remove it if it not compatible indeed.
-
Ran the bench with a range of Joe's full-length WUs. One of them worked, but three failures: given the ARs involved, I think this needs withdrawing, pronto. Could you remove the archive, please, and get the Mods to lock your thread after a suitable explanation?
While it was running, I looked through Todd Hebert's posts. He says a couple of times that he got the files from you, and even posts an app_info for MB_6.09_CUDA_V12b_FERMI.exe (message 990059 (http://setiathome.berkeley.edu/forum_thread.php?id=59666&nowrap=true#990059)). But I don't see anywhere where he posts, or even describes, a test result advising a change of direction. Yet other people, like ScimanStev, describe downloading files from Todd which - it turns out - have the stock app included.
I can't say I'm very impressed by the integrity of this process.
-
Ran the bench with a range of Joe's full-length WUs. One of them worked, but three failures: given the ARs involved, I think this needs withdrawing, pronto. Could you remove the archive, please, and get the Mods to lock your thread after a suitable explanation?
While it was running, I looked through Todd Hebert's posts. He says a couple of times that he got the files from you, and even posts an app_info for MB_6.09_CUDA_V12b_FERMI.exe (message 990059 (http://setiathome.berkeley.edu/forum_thread.php?id=59666&nowrap=true#990059)). But I don't see anywhere where he posts, or even describes, a test result advising a change of direction. Yet other people, like ScimanStev, describe downloading files from Todd which - it turns out - have the stock app included.
I can't say I'm very impressed by the integrity of this process.
Ok, I will recommend to use stock 6.10 from beta then.
EDIT: done.
-
Thanks. I've now set it to run live on the main project - host 2901600 (http://setiathome.berkeley.edu/results.php?hostid=2901600). No problem fetching work - just a shortlist for the moment, because (a) I run a short cache, and (b) DCF hasn't settled yet - still estimating three hours!
All those pseudo -9s that we started this thread with will have driven DCF way low. I think we may have encountered another of BOINC's safety features - IIRC BOINC cuts down on work fetch if DCF ever gets into 'insane' territory, either high or low. There's a lot of very sound engineering practice in the original BOINC design, but I fear we're in danger of losing it with all these hurried, on-the-fly, bodges to cope with evovling technologies like GPUs.
-
Yeah, life too fast to properly think about it, BOINC not escaped this :) But some block with fast reaction time to stop invalid overflows would be good thing IMO.
They damage project in too many ways.
-
Already got a wingmate to add to Claggy's list:
Pieter hostid=5431046 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5431046) NVIDIA GeForce 9800 GT (1005MB) driver: 19745
Host created today, downloaded 564 tasks, got two of them to validate at 0.01 credits, I'm too depressed to look-see how many pages-full he's wasted.
-
Already got a wingmate to add to Claggy's list:
Pieter hostid=5431046 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5431046) NVIDIA GeForce 9800 GT (1005MB) driver: 19745
Host created today, downloaded 564 tasks, got two of them to validate at 0.01 credits, I'm too depressed to look-see how many pages-full he's wasted.
220 pending, 2 validated, all teensie claims.
One of the two "valid" is good evidence, text captures attached as WU619984348.7z, also attaching text captures for paired 4xx case noted by Sutaru as WU619465291.7z.
Joe
-
A cuda_fermi application, v6.10, was loaded about 30 minutes ago. No-one will have any WUs yet, of course, because the splitters haven't been restarted.
I'd prefer not to test the stock download process myself if I can avoid it, because I'm rigged with an app_info and still have some VLARs waiting for optimised CPU handling. But if we could keep an eye on Claggy's list, and see if the Fermis start producing valid work, that would be good news.
-
Please check md5 (binary equivalent) of exe against beta ones
-
Please check md5 (binary equivalent) of exe against beta ones
E448A1489782723161EFAF99B9494661
in both cases. Binary FC says the same, too.
So this will be the one which describes itself as 6.09 in stderr, then ;D
-
LoL, thanks. :D
-
berkeley switched off all data distribution, a message is on the front site. --->
"We are experiencing a problem such that some GPU platforms are quickly overflowing on all workunits that they receive. Rather than burn through a great deal of data that we would have to redistribute, we are turning off data distribution until we get this debugged."
08.06.2010 09:27:38 SETI@home update requested by user
08.06.2010 09:27:41 SETI@home Fetching scheduler list
08.06.2010 09:27:43 SETI@home Master file download succeeded
08.06.2010 09:27:48 SETI@home Sending scheduler request: Requested by user.
08.06.2010 09:27:48 SETI@home Reporting 6 completed tasks, requesting new tasks for CPU and GPU
08.06.2010 09:27:51 Project communication failed: attempting access to reference site
08.06.2010 09:27:51 SETI@home Scheduler request failed: Couldn't connect to server
08.06.2010 09:27:53 Internet access OK - project servers may be temporarily down.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
thats good, so they are working on it.
heinz
-
Eric reports in the 'Scheduler problems' (http://setiathome.berkeley.edu/forum_thread.php?id=60248) News thread:
"We're having difficulty getting a new scheduler running that handles cuda_fermi applications properly. We'll be down until we get it sorted out."
Claggy
-
This may be considered a case of "be careful what you wish for" - I sent Eric an email just as the lab opened yesterday, drawing attention to the scale of the problem and the list in this thread. Probably not what he wanted to read on a Monday morning, and deploying a new scheduler probably wan't his plan for the day either.
But it had to be done - shame it didn't go smoothly.
-
LoL, I did the same and recived answer that they worked on it right now (about the time 6.10 was spotted on main). Interesting, what was wrong with 6.10 ?...
-
LoL, I did the same and recived answer that they worked on it right now (about the time 6.10 was spotted on main). Interesting, what was wrong with 6.10 ?...
Perhaps a host which got in a bad state trying to run 6.08 on a GTX 4xx will need a reboot to clear the problem. If so, automatically updating a host to 6.10 wouldn't help.
...I sent Eric an email just as the lab opened yesterday, drawing attention to the scale of the problem and the list in this thread.
...
User "korpela" has not been active in the last 24 hours, though this thread is visible to all and he might have viewed it as a guest.
Joe
-
Perhaps a host which got in a bad state trying to run 6.08 on a GTX 4xx will need a reboot to clear the problem. If so, automatically updating a host to 6.10 wouldn't help.
I've not seen anything like that, and my GTX 470 has tried them all (6.09, 6.08, v12b). They just fail in their various ways, and move on to the next task. It's very different from the 'sporadic error state' on older GPUs, where the failure persists from task to task until reboot.
-
I started going through my resends again this morning, no new Fermi's, just these hosts:
Sigurd G.Schinke hostid=5372764 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5372764) NVIDIA GeForce GTX 260 (881MB) driver: 19713
Marc Jarry hostid=4247889 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=4247889) NVIDIA GeForce 9600 GT (511MB) driver: 19745
BabelAbu hostid=5374194 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5374194) NVIDIA GeForce GTX 260 (1792MB) driver: 19732
malycc hostid=5386713 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5386713) NVIDIA GeForce 9500 GT (511MB) driver: 19745
The Beef hostid=5289552 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5289552) [2] NVIDIA GeForce GTX 295 (896MB) driver: 19562
Anonymous hostid=5049618 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5049618) [2] NVIDIA GeForce GTX 295 (895MB) driver: 19038
k.pieschl hostid=3192436 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=3192436) NVIDIA GeForce 9800 GT (1023MB) driver: 18634
Claggy
-
Perhaps a host which got in a bad state trying to run 6.08 on a GTX 4xx will need a reboot to clear the problem. If so, automatically updating a host to 6.10 wouldn't help.
I've not seen anything like that, and my GTX 470 has tried them all (6.09, 6.08, v12b). They just fail in their various ways, and move on to the next task. It's very different from the 'sporadic error state' on older GPUs, where the failure persists from task to task until reboot.
Thanks, I had missed that very important difference. Now that the project is flowing data again perhaps we'll be able to figure out what they had to modify.
Joe
-
Well, this should be interesting.All the work units I've downloaded since we came back up have been showing as "not in DB" on my SETI website tasks page instead of anonymous platform. I mentioned it in a post on the NC forum and Skildude posted he is getting it too. Mine are showing up as 6.03s and 6.08s on my Manager's task list though. It will be awhile before I get to them so I hope they will run ok.
-
Well, this should be interesting.All the work units I've downloaded since we came back up have been showing as "not in DB" on my SETI website tasks page instead of anonymous platform. I mentioned it in a post on the NC forum and Skildude posted he is getting it too. Mine are showing up as 6.03s and 6.08s on my Manager's task list though. It will be awhile before I get to them so I hope they will run ok.
Yes, I'm getting that too. On another note, another valid result got invalidated by two fermi cards using wrong app. Link and printed.
http://setiathome.berkeley.edu/workunit.php?wuid=615573439
-
Same here. Anonymous platform replaced with new (in my case it's even translated to "Нет в ДБ") message. For CPU-assigned tasks too.
-
On my machine all looks like normal
get 6.03 wu's and one 6.10 (cuda fermi)
~~~~~~~~~~~~~~~~~~~~~~~~~
On my result page I see 10 entries not in DB
-
Much more disturbing message:
Your app_info.xml doesn't have a usable version of Seti@hpme enhanced.
Reached daily quota of 100 tasks
Well, until yesterday, berkeley liked my app_info.xml.
-
They're obviously making rather wider changes to the work scheduling process, incorporating some of the improvements which have been tested at SETI Beta recently. But it doesn't seem to be going very smoothly.
There's not supposed to be any change to the app_info format, so I suggest you treat that as a server error: don't mess around with app_info in a vain attempt to get it working!
There's no work available today anyway, so I suggest we all just sit back and watch until the dust settles.
-
We've got again those validate errors with no apparent reason http://setiathome.berkeley.edu/workunit.php?wuid=559969383
-
We've got again those validate errors with no apparent reason http://setiathome.berkeley.edu/workunit.php?wuid=559969383
The reason is it was reported on 7 February. I don't see how you can call that "again": I would accept "still". It's those AWOL wingmates who stopped any work happening from January to April that we should be concerned about.
-
The reason is it was reported on 7 February. I don't see how you can call that "again": I would accept "still". It's those AWOL wingmates who stopped any work happening from January to April that we should be concerned about.
No, it was invalidated after that June 12 (or maybe 16) result. Before that date it wasn't in my invalid list.
-
The reason is it was reported on 7 February. I don't see how you can call that "again": I would accept "still". It's those AWOL wingmates who stopped any work happening from January to April that we should be concerned about.
No, it got invalidated after that June 12 (or maybe 16) result. Before that date it wasn't in my invalid list.
It wouldn't even have been looked at until 12 June. When that second 'success' report came in, the validator will have tried to find the uploaded result file that should have preceded the 7 February report - and presunably it wasn't there.
Without a result file, that one could never be validated: presumably the file associated with the 12 June report was findable, but there was nothing to compare it with, hence 'inconclusive' and 'pending'.
-
Why it wasn't there?
-
I had a look at the list of hosts i posted earlier in this thread,
some of the Fermi's are still trying to do Cuda23 work and don't have the Fermi app,
then there's smithwr3 who still using the optimised V12 app.
Claggy
Edit: I retract that, i was talking out of my A*se, they had done Cuda23 work at the beginning of June, and haven't done any since.
-
Got a false Invalid task today, wingmen are smithwr3 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5293938) and Fabio Chimienti (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5440744), both running V12 still, both produced -9's, :(
wuid=646519895 (http://setiathome.berkeley.edu/workunit.php?wuid=646519895)
Claggy
-
That smithwr has 18,000 tasks on his list. It's scary thinking how many of those are returning -9.
-
That smithwr has 18,000 tasks on his list. It's scary thinking how many of those are returning -9.
Quota system doesn't work very well does it?
Every invalid task/error resets his quota to 100, but with Seti's *8 Multiplier, he still gets 800 tasks a day,
I think the quota should go a lot lower than 100, or take the Multiplier away.
Claggy
-
If all his results are invalid his quota should go to 1 eventually, i.e. 8 GPU tasks per day.
[BTW, it looks like very same case new quota system should protect from.
Even if he return good CPU results - GPU should be inhibited.
If not - current quota system still flawed.
]
[And GPU multiplier should be removed indeed. If GPU works fine it will have no effective limits w/o any multiplier, but if it broken this multiplier just multiplies trashed tasks... ]
-
Here's his invalid list.. http://setiathome.berkeley.edu/results.php?hostid=5293938&offset=0&show_names=0&state=4 Anyone want to try to count them? I got four pages into it and decided not! :-(
-
Here's his invalid list.. http://setiathome.berkeley.edu/results.php?hostid=5293938&offset=0&show_names=0&state=4 Anyone want to try to count them? I got four pages into it and decided not! :-(
1223 atm. put something big in offset, then work your way up or down by say 100 - or larger depinding on expectations. If you go too high it'll tell no tasks to display.
Handy to move to the oldest pending/in progress task too.
-
The smithwr3 host 5293938 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5293938) several minutes ago had 147 "valid" tasks (http://setiathome.berkeley.edu/results.php?hostid=5293938&offset=140&show_names=0&state=3), not all of those are false overflows, as v12 apparently does not do that on VHAR or VLAR tasks and the host is also doing CPU work. It also has 1224 invalid tasks (http://setiathome.berkeley.edu/results.php?hostid=5293938&offset=1220&show_names=0&state=4) which probably are all false overflows, though I certainly didn't check more than a tiny sample.
The BOINC quota mechanism is, and always has been, only capable of protecting against totally bad processing, and even so the protection is delayed; a host which goes bad with thousands of tasks already cached is penalized too late to save those tasks.
Is there a chance the "Notices" in 6.11+ might be considered obvious enough that the servers could send a Notice that the host appears to have failed along with a command which causes BOINC to not start any more tasks for the project until the user has read the Notice and believes the problem fixed? I don't know if the BOINC devs would consider something like that, discussions about quota and related things on the boinc_dev list seem to always end inconclusively.
Joe
-
Could I ask all Lunatics to review http://setiathome.berkeley.edu/forum_thread.php?id=62573 ?
The old problem of false -9 overflows, caused by outdated applications running on Fermi GPUs, is still with us, and still polluting the database with junk science.
Back in June when this thread was started, the major problem was the stock applications. We got the project to clean up its act, and the stock apps are now working properly for the 'set-and-forget' types.
Which means that the remaining problems are attributable, almost exclusively, to third-party applications: V12 vlar-autokill, in the linked thread.
As I've written in that thread, vlar-autokill was a fit and proper app in its time, and I have no criticism of Raistmer for releasing it. But it's now an embarrassment.
So, how can we cork the beastie back into its bottle? Unfortunately, I can't think of a way - short of the nuclear option of the project blocking all anonymous platform apps. Maybe some bespoke programming could be put into the scheduler to selectively block known bad app/hardware pairings, but I can't imagine project staff being happy about diverting scarce development time into doing that.
But I think there are two things we ought to consider doing.
The first is to be much harder on the ill-informed message board "advisors" - people like Sutaru and skildude - who advocate optimised applications as the cure-all for everything, but consistently fail to pass on the associated responsibility for understanding and long-term monitoring.
And secondly, how about building a 'suicide pill' into the apps themselves? Maybe in the first place for Beta apps - nobody should run a Beta app for longer than, say, one month: and if they are still actively testing after that time, an enforced re-install isn't too much of a problem.
The trouble is that I don't think that anything short of a physical block (suicide pills are common for trialware) will catch the sort of users I've linked in that thread - no message board activity, no team membership. And I'm sorry, but NO: I'm not going to start sending out unsolicited PMs and emails.
The thing that worries me most of all is that I can't see users like that coming here and collecting optimised apps on their own (even though they would at least see the warnings if they did). I'm beginning to wonder how many re-hosting websites there might be out there - overclockers, BOINC team sites, that sort of thing - which might be distributing Lunatics apps with no 'best practice' advice whatsoever.
Postscript: while previewing, I saw that the previous post in this thread concerned the very same host 5293938 that also featured today. So that's over four months the problem has continued unchecked.
-
Perhaps redefining/updating to a new version/planclass, disabling work for all existing ones is an option. Forcing a stock app update, and matching planclass update for newer opt apps. I don't know enough about that Boinc app distribution mechanism to know if & how that would work.
I wouldn't mind developing an autoupdater for future production releases here. There will be those that will still circumvent both the stock & opt update anyway, some 'legitimately' , some to defiantly run what they want anyway. The science process itself needs to catch these with the validation & quota mechanisms (and subsequent science process of course), since user specific configuration & 'jiggering' might be considered as having similar destructive potential as anywhere from a cosmic ray bit-flip to a massive hardware failure. That goes for any app, not just GPUs. I'm sure there are brand machines that just shouldn't crunch at all, people that just should not be allowed near computers. Unfortunately we're not the PC police, though maybe we should be ;)
Promoting use of outdated known buggy builds, old drivers & outright 'jiggering' has gone on in the past. Especially when directed toward inexperienced users I've always found it more than a bit frustrating, and had to put a stop to it in one specific occasion I've seen it here. In one particular instance massive argument ensued & only ended with me banning the user to think about it, which sadly escalated the argument, forcing Admins hand (not mine) to permanently delete the users' account. Along with security concerns, that also resulted, in part, in the tightening of beta participation requirements & restrictions here to more select group.
While we aren't the computer police, we don't have to put up with bad advice here, and can do our best to correct faulty advice where we spot it, and try to come up with ways to encourage doing the right thing. Unfortunately in the case of problems inherited from the Fermi incompatibility, I don't see a lot of ways to encourage that other than simply making newer releases better, more widely compatible, AND faster, which is proving to be quite long road.
Jason
-
I agree, we are not computer policy, not M$ and sometime our development time scarce too, btw ;)
Effectively disabling malfunctioning participants is BOINC (I repeat, BOINC, not project staff) prerogative. We need framwork for doing common things with it, not just bloatware as new BOINC versions become more and more alike. I seriously thinking sometime to write perl script to process all tasks in directory W/O BOINC and launch it only for network communications.
If plan class/version limits not effective - new means should be integrated in BOINC IMO. It's impossible to create app that will work on every still not even existed hardware where some idiots would like it to run. I'm truly can't understand how someone can use not FERMI-compatible app on FERMI GPU if it gives errors alomost constantly, people just never look in result page maybe?...
About "suicide pill" - if it's implemented as library with enough easy to use interface I'm ready to include it in my builds. But have no intentions to develop such thing.
P.S. And about bad advices... It's true problem, IMHO, but recently I'm just tired to argument with bad advice. I'm just trying to give more correct answer to original poster w/o discussions with uneducated but active ones. Life is short....
-
About "suicide pill" - if it's implemented as library with enough easy to use interface I'm ready to include it in my builds. But have no intentions to develop such thing.
I'll 'consider' it as possibility for future release, though it doesn't, of course, solve either the outdated release, or the intentionally 'jiggered' environments, so I'm approaching it (the whole idea) with some scepticism (probably a good thing).
I'm truly can't understand how someone can use not FERMI-compatible app on FERMI GPU if it gives errors alomost constantly, people just never look in result page maybe?...
Yes, partly that. And now add to that certain people espousing overriding the stock app with stock Cuda23 (via app_info) insisting that's the fastest... got it ? ( Problem Logical conclusions evolving from that may include that v12 VLArKill would be a good idea to use on Fermi ... IT ISN'T for anyone reading, don't do it! )
-
BTW, if I recall right, there was some similar problem with Einstein project and one of Akosf (not sure I reproduced nickname right, Akos ) opt builds. It failed to process new data correctly.
Also same problem appeared at least once on MW, again, with 3-rd party app.
What they (project admins) did in those cases? Maybe SETI project can learn from it ?
-
What they (project admins) did in those cases? Maybe SETI project can learn from it ?
Don't know. Some point of responsibility lies with the 3rd party developers, IMO of course, and as we take pains to improve validation & stability at every step, we endeavour to meet those responsibilities. But let's face it, if erroneous app results can get through 'the system', then so can hardware faults, bad configurations, and sheer vandalism. the 'science' catches that, not validators or mySQL queries (unless relational databases have become more sophisticated than I remember, it's a reduced database, not a knowledgebase ;) Thise are project staff. NTPCKR & RFI systems to handle that further on.)
For our purposes here, I think we need to devalue the integrity of a single detection in our minds, when the reality says it would go through a whole persistence & re-observation process before publication of any WoWness. We don;t need another WoW like signal, we have one of those and it has proven insufficient to scientifically confirm the presence of an extraterrestrial civilisation.
Jason
-
BTW, there is another possibility for such long unmaintained host setup.
Sometime host can escape grasp of initial installer. I have such host in my fleet for example. It still produces correct data, but even if it will go wrong - I will not able to do anything with it.
Perhaps host deletion/blocking from SETI web site should be added....
-
Perhaps host deletion/blocking from SETI web site should be added....
Joking, Automated system to contact ISP, requesting to connect 240 Volts down the cable ? ;D
More seriously, updated version/planclass described earlier should cut work off from that host, though as mentioned I don't know the practicalities of that.
-
About "suicide pill" - if it's implemented as library with enough easy to use interface I'm ready to include it in my builds. But have no intentions to develop such thing.
When I was asked to produce a 'trial' version of one of my programs, many years ago, I found some shareware which could be applied to the completed, compiled .exe file - a sort of wrapper. That, of course, meant that the trial version of the program was identical to the paid-for version, and the wrapper was completely independent of the development environment used to produce it. But IIRC, it popped up a nag screen saying how many more days it could be used - not advisable for an app with no other UI ::). There were other restriction policies available, but that's the only one I used.
It's only installed on a now-retired development machine - I could fire that up and retrieve it if anyone wants. Similarly, newer/better products should be available?
-
It's only installed on a now-retired development machine - I could fire that up and retrieve it if anyone wants. Similarly, newer/better products should be available?
OK, a dumb idea along those lines from me. Induce an update cycle & reset the project (via bonccmd.exe or similar mechanism) if not current.
[Later:] Something looking potentially relevant to today's discussion... NTPCKR seems to be configured to ignore spikes (my guess, nothing more, from the command line).
+ <daemon>
+ <host>maul</host>
+ <cmd>ntpckr.x86_64 -nospikes -mod 4 0 -hpsd -dayscool 5 -summarize -projectdir /home/boincadm/projects/sah</cmd>
+ <output>ntpckr1.log</output>
+ <pid_file>ntpckr1.pid</pid_file>
+ <disabled>1</disabled>
+ </daemon>
...
<task>
+ <host>maul</host>
+ <cmd>update_candidate_counts.x86_64 -projectdir /home/boincadm/projects/sah</cmd>
+ <output>update_candidate_counts.log</output>
+ <period>1 hours</period>
+ <disabled>0</disabled>
+ </task>
-
I thought they said they had something to catch false valids. I guess that entry in nitpkr about nospikes is what I was thinking about. I try to make sure the people I recommend optimized apps to understand the risks involved but I do forget occasionally. I also ask that they check their work and come back if they have any problems. I also send PMs now and then but as Richard said in the SETI thread, I don't like to do that. Another problem with that is so many of the problem machines are anonymous users or have email notification turned off so they never see they have a PM. I guess it comes down to we all do what we can and if that is not enough, so be it, we tried.
-
I thought they said they had something to catch false valids. I guess that entry in nitpkr about nospikes is what I was thinking about. .... I guess it comes down to we all do what we can and if that is not enough, so be it, we tried.
Exactly. We can do our best within 'reasonable' efforts, but there will always be those situations & personalities that escape or intentionally avoid the 'right thing'. It is really the ultimate duty of the project in question, and as Raistmer indicates the Boinc framework itself, to ensure the integrity of any results is adequate to support the claims made in any published announcement or material.
I ask people to step back & take a look for a minute. This is part of the science of distributed computing, and a worthy challenge to make sure we are doing what we can, and that the system can be robustified to adequetly handle as many possibilities as we can, moving forward. We want, as developers, to strive toward perfection, whatever that is, but it is not a realistic goal to use absolute measures. The universe is NOT digital.
Jason
-
The universe is NOT digital.
Jason
Just can't restrain ;D : only if we all live not on 13th floor... ;) (http://www.imdb.com/title/tt0139809/plotsummary)
-
The universe is NOT digital.
Jason
Just can't restrain ;D : only if we all live not on 13th floor... ;) (http://www.imdb.com/title/tt0139809/plotsummary)
LoL, It reminds me of when 'the Matrix' came out. Many presented theories suggesting religious & scientific connections. I proposed it was a movie made to make money & was lambashed for that ::) oh well...
-
Hehe, second and, especially, third parts - definitely ;)
[
And I was "killed" by their idea to use mens as energy source, quite dumb idea IMHO. Why not read good books and take some ideas from where, Dan Simmons "Endymion" for example, AI used mens brains much more cleaver there IMHO :)
]
-
Hehe, second and, especially, third parts - definitely ;)
[
And I was "killed" by their idea to use mens as energy source, quite dumb idea IMHO. Why not read good books and take some ideas from where, Dan Simmons "Endymion" for example, AI used mens brains much more cleaver there IMHO :)
]
There's probably some element to my opinions that could be connected with scifi writing, more precisely some Asimov style 'anarchy' or 'fate' through statistical inevitability or chaos ( Hari Seldon style ). I still find Doc EE Smith's notions of overcoming the laws of nature appealing, allowing us to throw stars & planets about like billiard balls should the need arise (never know when one might need to throw a planet around), but I don't see the two ideas as completely mutually exclusive.
Jason
-
I think we should write a errorfile similar like this in our project app.
whit it's help we can count the errors and avoid misfunctionality
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#if undefined _maxerr
define _maxerr=20
#endif
if errorfile exists
{
read errorfile into errcount
add 1 to errcount
write errcount to errorfile
if errcount==_maxerr {
here can we do what we want todo
delete *.exe
resetting the project
resetting the machine
}
we counted errcount up, but it is smaller than maxerr
if a crash occured now, we have alredy counted it
} <--end file exists
else
{
(errorfile does not exist)
errcount=0
create file errorfile
write errcount to errorfile
}
....normal programm code
....
crash or no crash
....
no crash occured we come to label ende:
ende:
if errcount==0
{ no error occured, or no exit before normal end
delete errfile
}
exit(0) -->next job
heinz
-
Heinz, the whole problem with old app and FERMI - app doesn't crash, it just produces trash result.
-
Heinz, the whole problem with old app and FERMI - app doesn't crash, it just produces trash result.
This is really some more complex task to sort it out.
isn't it this -9 result overflow ?
its a validate problem.
We must also ask how can we find misconfigured machines ? (by BOINC ? ) and reset the project on it.
-
Heinz, the whole problem with old app and FERMI - app doesn't crash, it just produces trash result.
This is really some more complex task to sort it out.
isn't it this -9 result overflow ?
its a validate problem.
We must also ask how can we find misconfigured machines ? (by BOINC ? ) and reset the project on it.
Resetting the project on such host won't help, app_info and optimised apps still kept in this situation, only a detach will get rid of optimised apps,
Can the project force hosts to detach?, i've had all my Wu's reported as detached, (when server under stress and issueing ghosts),
but apps and app_info still there,
Claggy
-
At the moment, toward development of future applications, part of my goal is some robustification. While this won't address existing zombie hosts being discussed here, higher performance would be a fairly strong incentive for at least some of those to update. So while existing X-series builds don't have the familiar Fermi-issues, some safeguards can be put in place to ensure a repeat of this scenario, or a similar one, cannot occur again. I feel that development in that direction would be more rewarding than added kill switch or other disabling mechanisms, yet still handle extreme cases of misconfiguration/incorrect installation or other hardware or driver failures. ( x33 prototype already has the primitive & limited effectiveness CPU-fall-back reinstated with the slowest possible code, but I've yet to see it actually fall-back, so some trigger would be needed to test this part...)
Just bouncing things around for now. How about, as an example for addressing the '-9 overflow' scenario directly:
- An overflow on spikes is found in any single, or sequence of CFFT pairs
- That Overflow is to be treated with suspicion, enter a fail-safe sequence
- the failsafe sequence records some indicators to stderr, reinitialises the Cuda context entirely, and reprocesses the same cfft sequence.
- If this reprocessing of the data now yields No overflow, then the processing can continue, on the presumption we have recovered from some catastrophic driver error or other issues (Some forms of recovery won't apply under XPDM, but will happily recover on WDDM )
- If reprocessing the data yielded another overflow, then it could be a genuine overflow, or catastrophic / unrecoverable lower level failure ... reprocess with generic (slow) CPU code, indicating via stderr there are possible significant issues going on (if CPU code did not indicate an overflow).
I don't know about others' feelings on the issues, but to me this kind of fail-safe behaviour, effectively reverting to an alternative fail-safe / recovery sequence, sits much better with me than other alternatives so far. I am reasonably certain that most kernel hard failures, along with many problem issues could be handled with similar techniques, and that it won't really be 'that hard' to implement them. (Though much more thorough & useful than the original CPU-fall-back sequence triggered by memory allocation problems)
Just thoughts for now, but I thought I'd throw this into the current discussion before we decide to venture down development avenues that might be somewhat tangential to longer term development goals, or interest.
Jason
-
I'd be fully in favour of this sort of "self-validation", or at least an internal sanity check on the result.
Only two caveats:
* Can we predict the possible failure modes on next-generation (and subsequent) hardware? A nice healthy crash or abend is fine: BOINC copes with that with no problems. Even complete garbage at all ARs wouldn't be too bad. But nobody expected or predicted the Fermi 'works at some ARs, garbage at others' scenario, which is the most poisonous of all (the few good results allow continuous quota for processing, with the majority going to waste). Can we presume that the next fault will manifest with excessive spikes?
* How much time will developing a fail-safe mechanism add to the development process? The longer we go without a new release, the longer it will be before the release publicity prompts people to look at their rigs again.
On a positive note, one of the faulty hosts that Joe found overnight has been corrected (now running stock). Reading between the lines, I suspect that a discussion on the team SETI.Germany forum may have helped. Those alternative information channels could be helpful with the immediate problem, pending technical resolution.
-
* Can we predict the possible failure modes on next-generation (and subsequent) hardware?
Yes, We do so using the real world example we have as a template, which is a worst case total Kernel failure with no detected error codes or other problems to indicate that it failed, other than data corruption. That's a rather extreme case made by a quite special convergence of tenuously related & unique conditions involving dubious coding practice and technology changes, but there are added in Cuda 3.0 onwards multiple added mechanisms to prevent such convergence again... though since it happened, we use this worst case, relatively hard to detect but easy to handle, example as a template.
* How much time will developing a fail-safe mechanism add to the development process? The longer we go without a new release, the longer it will be before the release publicity prompts people to look at their rigs again.
Not an inconsiderable amount of time, but IMO less work than built in self destruct mechanisms etc, taking the holistic view maintenance efforts involved along both directions. Basically for every Cuda invocation that exists it gets recovery & redundancy through some 'different vendor' hardware & code, aerospace engineering style. Since recovery mechanisms exist via drivers to varying degrees, the fallback/redundancy with proven code handles all failures from that, which in turn can generate a rare hard error (Rather than reporting success with corrupted data)
-
Just my two cents worth, I agree with the fallback idea. That way the work will get done or prove itself to be a real -9, whichever. If this is done for every -9 the owner should notice the slowdown if he is watching at all and do something about it.
One thing I've noticed is a couple of my wingmen running the new 570s have been turning in -9s even when running stock. Well, one was stock the other was running 32f. I've been trying to send PMs to those I know are running the wrong app but not sure what to tell these guys. Another is a wingman running a 295 that is only half bad. One half turning in good work, the other -9s. If he's just a casual cruncher he may just see his credit rising and his RAC stable and figure he's reached the best he can do without finding out he has a problem. I think this is probably heat related and a good cleaning may get him going again but I'm afraid to try sending him a PM on the off chance I'm wrong.
-
Just my two cents worth, I agree with the fallback idea. That way the work will get done or prove itself to be a real -9, whichever. If this is done for every -9 the owner should notice the slowdown if he is watching at all and do something about it.
One thing I've noticed is a couple of my wingmen running the new 570s have been turning in -9s even when running stock. Well, one was stock the other was running 32f. I've been trying to send PMs to those I know are running the wrong app but not sure what to tell these guys. Another is a wingman running a 295 that is only half bad. One half turning in good work, the other -9s. If he's just a casual cruncher he may just see his credit rising and his RAC stable and figure he's reached the best he can do without finding out he has a problem. I think this is probably heat related and a good cleaning may get him going again but I'm afraid to try sending him a PM on the off chance I'm wrong.
Hi Perryjay,
Either app, stock cuda_fermi or x32f, should be fine. Could be dealing with immature drivers or IMO more likely overeager OC, time will tell. Yes the more I think about it, falling back to the slowest, most reliable & proven possible code for -9's and other obvious problems seems like the best way (for the moment) to enforce some kind of sanity. I don't mind the extra work for that kindof development, so will gear up in that direction as I move toward adding performance improvements we already isolated.
Jason
-
Yes the more I think about it, falling back to the slowest, most reliable & proven possible code for -9's and other obvious problems seems like the best way (for the moment) to enforce some kind of sanity. I don't mind the extra work for that kindof development, so will gear up in that direction as I move toward adding performance improvements we already isolated.
Jason
I don't think I like it. So if we are in the middle of a high AR storm the optimized app will be slower even from the stock app since the work will be done twice? Unless I didn't understand well.
-
Sunu,
as I understand it, a bad -9 overflow only runs a few seconds . What is being talked about is falling back to the CPU to try running it again just like those that give out of memory messages. Though I am probably wrong about that. It will only effect -9s and will keep a faulty machine from sending in hundreds of them. Those of us with clean running machines shouldn't have any problem with this approach.
-
I don't think I like it. So if we are in the middle of a high AR storm the optimized app will be slower even from the stock app since the work will be done twice? Unless I didn't understand well.
Lol, no I wouldn't bother going to effort if it was going to make regular crunching slower ;). I would of course just throw a hard error code instead (which likewise avoids contaminating the results, but damages quota & wastes crunch time in another way)
For the most part we're really talking about properly handling situations that shouldn't really ever occur on properly configured, entact hardware. The genuine -9's are the exception, for which at most the 1 whole CFFT pair where the overflow appeared, rather than the whole task, would be reprocessed (fractions of a second, rather than 100's of seconds).
-
Who's keeping the list?
[Edit: Oh sorry, all already there ::) missed the lastest list over the holidays... - still wondering about the 6.02 though]
I think I found two more hosts with V12 on a GTX460 after a complaint of inconclusives against GPU on NC (http://setiathome.berkeley.edu/forum_thread.php?id=62698).
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5305178
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5257703
pulling some more from the database, probably duplicates from when we last checked.
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5293938
5472266
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5149058
Also host 5508489 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5508489) is running '6.02' ?????? http://setiathome.berkeley.edu/result.php?resultid=1766879380
And doing inconclisives againd x32f - just found another host with 6.02. Ouch.
also quite a few very different counts between x32f and 6.09 - how often should that happen?! I'll better stop looking through inconclusives now...
-
Joe Segur posted a list in number crunching - I've linked from the new thread. I think all your Fermis are already known, though 6.02 is a new (or newly identified) problem.
Edit - the app details for http://setiathome.berkeley.edu/host_app_versions.php?hostid=5508489 indicate it's actually running stock v6.03. Do I vaguely remember that Eric forgot to bump the internal version number on that build, just as stock v6.10 Fermi reports v6.09 in stderr_txt? In any event, although the host clearly has problems, it isn't a mis-use of anonymous platform that's causing it.
-
also quite a few very different counts between x32f and 6.09 - how often should that happen?! I'll better stop looking through inconclusives now...
Hehe :)
Usually we all stop to looking for inconclusives right after app release.... And maybe it's very bad practice :)
-
And more seriously - we have some fancy statistic from SETI servers but few very important pieces are missed completely.
For example, counters that describe inconclusives and invalids rates per host per app version.
If we would have such we could do app "profiling" on quite different level of quality.
-
Joe Segur posted a list in number crunching - I've linked from the new thread. I think all your Fermis are already known, though 6.02 is a new (or newly identified) problem.
Edit - the app details for http://setiathome.berkeley.edu/host_app_versions.php?hostid=5508489 indicate it's actually running stock v6.03. Do I vaguely remember that Eric forgot to bump the internal version number on that build, just as stock v6.10 Fermi reports v6.09 in stderr_txt? In any event, although the host clearly has problems, it isn't a mis-use of anonymous platform that's causing it.
Yes, thank Richard, saw your reply there, that's when I amended my post here.
'That build' has a problem then - there were quite a few CPU to GPU inconclusives over multiple hosts showing up with 6.02 on CPU - crosschecking
ok, difficult to say what it's valid against, with results being purged so quickly atm, but hosts with this build have difficulties against 6.09 and x32f - I've seen valids against V12 :P
Also valids against 6.09 ::). should have opend a new thread...
-
Isn't that what we're already talking about in http://lunatics.kwsn.net/gpu-crunching/08jn10ad-4151-19449-3-10-56-test-case.0.html ? (development area link, not available to all)
-
If that's stock 6.03 with dodgy stderr showing wrong version number... maybe?
most of inconclusives are GPU -9 and some diverging signal reports plus a few where signal reported match, so something the validator checks that isn't in stderr?
alltogether lots of inconclusives from that corner :(
-
If that's stock 6.03 with dodgy stderr showing wrong version number... maybe?
Yes, Richard recalled correctly; you need to look a few lines above where it says "Application version SETI@home Enhanced v6.03" to know the actual version number.
IIRC the only difference between 6.02 and 6.03 was an SSE folding variant which had to be commented out because it sometimes crashed.
most of inconclusives are GPU -9 and some diverging signal reports plus a few where signal reported match, so something the validator checks that isn't in stderr?
alltogether lots of inconclusives from that corner :(
Yes, even when running the intended software, the CUDA cards sometimes produce false result_overflow cases. For that matter, some CPU processing does too, though that's fairly rare. I'll attach an archive with text copies of a WU page and its five task detail pages which is mind-boggling and illustrative of the weird things which can happen.
Most inconclusives get resolved with a correct result being assimilated. This thread is about cases which are exceptions to that rule, plus cases where both of the first two results are almost certainly wrong but agree.
The only thing the Validator looks for in stderr is "result_overflow" and that's only used to set a flag when the canonical result is assimilated. Aside from that, stderr could be a quote from Nietzsche and it would make no difference to validation. It's some details of the signals in the uploaded result file which are checked by the Validator.
Joe