Seti@Home optimized science apps and information

Optimized Seti@Home apps => Discussion Forum => Topic started by: sunu on 02 Jun 2010, 05:52:39 pm

Title: When corrupted results get validated...
Post by: sunu on 02 Jun 2010, 05:52:39 pm: ... and valid results get thrown out of the window...

See this workunit http://setiathome.berkeley.edu/workunit.php?wuid=609263674

My result was the only valid result from all that garbage. And was marked as invalid :(
Title: Re: When corrupted results get validated...
Post by: Raistmer on 02 Jun 2010, 05:58:35 pm: Hm.... looks like our hope that incorrect overflow gives quite random pulses not fulfilled :(
So such disrupted GPU state even more dangerous for project than was thought before!
Title: Re: When corrupted results get validated...
Post by: perryjay on 02 Jun 2010, 06:12:44 pm: Wish I hadn't seen that. I've got a couple of pendings waiting for a match that look a lot like that.
Title: Re: When corrupted results get validated...
Post by: Claggy on 03 Jun 2010, 04:14:30 pm: Here's one on my host: workunit.php?wuid=618018953 (http://setiathome.berkeley.edu/workunit.php?wuid=618018953) :(

Claggy
Title: Re: When corrupted results get validated...
Post by: perryjay on 05 Jun 2010, 10:25:19 am: Just found this one today http://setiathome.berkeley.edu/workunit.php?wuid=618796380
Title: Re: When corrupted results get validated...
Post by: sunu on 05 Jun 2010, 10:46:03 am: Unfortunately they are many. Yesterday I had another one but today it must have been deleted from the database and I can't post the link.
Title: Re: When corrupted results get validated...
Post by: perryjay on 05 Jun 2010, 12:11:02 pm: I think they are trying to keep it a secret!! Looks like as soon as we post a link to one of them they erase it from the database. ;D Must be a conspiracy.
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 05 Jun 2010, 12:18:43 pm: Could people spotting / reporting this problem please check and report the hardware involved?

I have a horrible feeling that people who just throw a Fermi card into a host and attach, are being issued with the stock Cuda23 application and immediately start trashing WUs.

But we need robust reports from reliable witnesses....
Title: Re: When corrupted results get validated...
Post by: Jason G on 05 Jun 2010, 12:32:33 pm: Quote from: Richard Haselgrove on 05 Jun 2010, 12:18:43 pm
...I have a horrible feeling that people who just throw a Fermi card into a host and attach, are being issued with the stock Cuda23 application and immediately start trashing WUs....
From the few I saw, that was the case (470's & 480'wingmen trashing & validating against one another). I do suspect that was the 'errors with 2.3' situation that Eric alluded to a while back (around fermi release IIRC), which might suggest some sortof flag raised somewhere that let him know, like the noisy wus figure that used to show somewhere (?) . Further conjecturing (& hoping), some double -9 intercept may be in place, explaining the rapid result removal sooner than the normal 24 hour assimilation/deletion period. If that's the case, I hope they put those through again for reprocessing.
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 05 Jun 2010, 12:59:37 pm: Eric's comment (he actually said cuda24, which we're taking to be a typing error) was made on 20 May - well, actually late afternoon 19 May in his time zone - in the course of a conversation with David and me about Fermi issues at Beta. It came just after the corrected Fermi 6.10 application version was loaded for stock download at Beta.

The 'noisy WUs' figure is still showing on the Science status page (http://setiathome.berkeley.edu/sci_status.html). Since it's "science", I assume it's driven off the validated results transferred from the BOINC to the science database. Historically, it's been 'about 5%'. Last time I looked, it was down to 1.2%, which I took as a compliment to the Radar removal team. Now, it's showing as 4.8%, which probably reflects the scale of the "pseudo -9" problem.

I'm coming to the conclusion that nobody saw this one coming. I certainly hadn't thought about it until this afternoon, and yet I've been working closely with David / Eric / Jason on BOINC+Fermi issues. Even when I told David (much to his surprise) that the Fermi card wouldn't run the cuda23 app at Beta (during the quota overflow discussion), the penny didn't drop that the situation was already building up at Main.

I have now suggested - on boinc_dev, which is the wrong mailing list, but the only one we've got in the absence of an official seti_technical channel - that 6.10_fermi should be installed as a stock application at Main. I think that's the only sensible way to rescue the situation.

Let's hope that no eager young project puppy runs into the lab this afternoon and loads a pristine box of tapes.....
Title: Re: When corrupted results get validated...
Post by: sunu on 05 Jun 2010, 01:13:52 pm: Quote from: Richard Haselgrove on 05 Jun 2010, 12:18:43 pm
Could people spotting / reporting this problem please check and report the hardware involved?

I have a horrible feeling that people who just throw a Fermi card into a host and attach, are being issued with the stock Cuda23 application and immediately start trashing WUs.

But we need robust reports from reliable witnesses....

Well, some of the reported workunits have a fermi card involved while others don't. The workunit from the first post here had 3-4 corrupted results, only one, if I remember correctly, was from a fermi with a cuda23 app. The other workunit I mentioned above had two 2xx cards involved with massive amounts of corrupted -9 results.
Title: Re: When corrupted results get validated...
Post by: Josef W. Segur on 05 Jun 2010, 03:48:53 pm: Suggestion: make copies of the WU and Task detail pages before BOINC purges them. Even better would be to find examples at SETI Beta where purging is disabled, and probably file deletion too. I started looking there, but no luck in finding any. Besides I got distracted by some of the nonsensical credit granting, one of Tetsuji's hosts recently did a set of reissued 0.448 tasks on 6.09 CUDA 23 with claims of 94.12 and grants ranging from 5.87 to 52.01 :o
Joe
Title: Re: When corrupted results get validated...
Post by: perryjay on 05 Jun 2010, 03:58:59 pm: Sorry, I didn't notice what my wingmen were running but I'm pretty sure the WUs were showing as 6.09s. I'm running a GT9500 on my vista X86 machine.

edit: Forgot to add I'm running the renamed cudart32_30_14 and cufft32_30_14 DLLs with the 197.45 driver.
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 05 Jun 2010, 04:07:02 pm: Quote from: Josef W. Segur on 05 Jun 2010, 03:48:53 pm
Suggestion: make copies of the WU and Task detail pages before BOINC purges them. Even better would be to find examples at SETI Beta where purging is disabled, and probably file deletion too. I started looking there, but no luck in finding any. Besides I got distracted by some of the nonsensical credit granting, one of Tetsuji's hosts recently did a set of reissued 0.448 tasks on 6.09 CUDA 23 with claims of 94.12 and grants ranging from 5.87 to 52.01 :o
Joe

Unfortunately, searching at Beta probably won't turn up many errors, because I've been leaning on David to get them fixed, and this particular problem (issuing work associated with a non-Fermi app, to a Fermi-equipped host) should no longer happen at Beta. There are just a few still visible on 12316 (http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=12316&offset=0&show_names=0&state=4).

All that's left is - why am I still trapped by yesterday's quota?
Code: [Select]
Max tasks per day 153 Number of tasks today 273
- and why is credit so erratic?
Title: Re: When corrupted results get validated...
Post by: Raistmer on 05 Jun 2010, 05:00:40 pm: Quote from: Josef W. Segur on 05 Jun 2010, 03:48:53 pm
Suggestion: make copies of the WU and Task detail pages before BOINC purges them. Even better would be to find examples at SETI Beta where purging is disabled, and probably file deletion too. I started looking there, but no luck in finding any. Besides I got distracted by some of the nonsensical credit granting, one of Tetsuji's hosts recently did a set of reissued 0.448 tasks on 6.09 CUDA 23 with claims of 94.12 and grants ranging from 5.87 to 52.01 :o
Joe
[offtopic]
Credit granting on beta absolutely screwed. If you look on granting for AP tasks it will be even more obviously
[/offtopic]

And ontopic: cause first listed WU in this thread had no 2 Fermi GPUs, this problem not only Fermi-related (unfortunately). Looks like _any_ invalid overflow has some probability to be validated :(
Title: Re: When corrupted results get validated...
Post by: Claggy on 05 Jun 2010, 05:29:44 pm: I had a look for hosts matched with my E8500 / 9800GTX+ / HD5700 that were producing inconclusive/Invalid work:

GeneralFrost hostid=5356245 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5356245) NVIDIA GeForce GTX 470 (1248MB) driver: 19775

Arles hostid=5355863 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5355863) NVIDIA GeForce GTX 480 (1503MB) driver: 25715

Balmer hostid=5384948 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5384948) [2] NVIDIA GeForce GTX 480 (1503MB) driver: 19775

djwhu hostid=5424576 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5424576) NVIDIA GeForce GTX 480 (1503MB) driver: 19775

Andrew Bazhaw hostid=5423129 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5423129) NVIDIA GeForce GTX 480 (1503MB) driver: 19775

Ollie hostid=5371034 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5371034) NVIDIA GeForce GTX 480 (1503MB) driver: 19741

smithwr3 hostid=5293938 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5293938) [2] NVIDIA GeForce GTX 480 (1493MB) driver: 25715

Chris hostid=5423967 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5423967) NVIDIA GeForce GTX 480 (1503MB) driver: 25715

Anonymous hostid=4946291 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=4946291) [2] NVIDIA GeForce GTX 275 (895MB) driver: 19745

D. McQueen hostid=4846359 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=4846359) NVIDIA GeForce GTX 260 (895MB) driver: 19713

Rory Isenberg hostid=5255297 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5255297) NVIDIA GeForce GTX 260 (877MB) driver: 19745

NEG hostid=1931164 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=1931164) [2] NVIDIA GeForce 9600 GT (495MB) driver: 19745

Michael Sangs hostid=5354486 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5354486) [3] NVIDIA GeForce GTX 295 (895MB) driver: 19107

Tim Lee hostid=5301365 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5301365) [2] NVIDIA GeForce 9800 GTX/9800 GTX+ (1024MB) driver: 19621

and there were a few more non Fermi hosts,

Claggy >:(

Edit: another five Fermi's:

Bittkau hostid=5336843 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5336843) NVIDIA GeForce GTX 480 (1503MB) driver: 19741

William hostid=5414447 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5414447) NVIDIA GeForce GTX 470 (1248MB) driver: 19775

Anonymous hostid=5419662 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5419662) NVIDIA GeForce GTX 470 (1248MB) driver: 19745

Setiman hostid=5227589 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5227589) NVIDIA GeForce GTX 470 (1248MB) driver: 19775

basti84 hostid=5391741 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5391741) NVIDIA GeForce GTX 470 (1248MB) driver: 25715

edit 2: added more Fermi:

Aaron Danbury hostid=5373696 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5373696) NVIDIA GeForce GTX 480 (1503MB) driver: 25715

Anonymous hostid=5025277 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5025277) NVIDIA GeForce GTX 480 (1503MB) driver: 19741

Edit 3: added more Fermi:

My9t5Talon hostid=5419671 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5419671) NVIDIA GeForce GTX 470 (1248MB) driver: 19775

simi_id hostid=5419256 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5419256) NVIDIA GeForce GTX 480 (1503MB) driver: 19741
Title: Re: When corrupted results get validated...
Post by: Jason G on 05 Jun 2010, 05:32:07 pm: Quote from: Claggy on 05 Jun 2010, 05:29:44 pm
...
Claggy >:(

I think we should send David A. around to tell them off! :o
Title: Re: When corrupted results get validated...
Post by: Raistmer on 05 Jun 2010, 05:39:07 pm: And most important thing I see in this list - they not only FERMI GPUS!
But there can be 2 independent problems still:
1) corrupted GPU state of pre-FERMI GPU that produces random pulses and programmatic error in early CUDA app that produces non-random, but invalid pulses on FERMI GPUs.
Second one will always pass into database if 2 FERMI with broken apps meet together.
But first most probably should not pass Validator (our database Guardian in some sense became Blind Guardian ;D ;D ;D )
Problem with broken app for FERMI is resolvable. But will 1) let invalid results go into database too or not - it's hard question that needs some more evidencies IMO. If yes.... ::)
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 05 Jun 2010, 05:39:51 pm: Of the Fermi cards (the first 8 in Claggy's list), seven are running the stock v6.09_cuda23 application.

Just one - smithwr3 - is using an app_info, and he's got one of Raistmer's v12 builds. I don't know enough about the std_err to be able to tell whether it's the special one he did for Fermi, and since smithwr3 hasn't posted in the forums since he was struggling with a Mac almost 4 years ago, there's not much to go on.
Title: Re: When corrupted results get validated...
Post by: Raistmer on 05 Jun 2010, 05:45:59 pm: Quote from: Richard Haselgrove on 05 Jun 2010, 05:39:51 pm
whether it's the special one he did for Fermi, and since smithwr3 hasn't posted in the forums since he was struggling with a Mac almost 4 years ago, there's not much to go on.
Very low probability he could take that version. And even so, IMO Jason showed already that initial CUDA MB code has programmatic error that "silent" on pre-FERMI GPUs but leads to invalid computations on FERMI . I used same codebase, just rebuilt app with new SDK. That is, V12 in no way FERMI compatible.
Title: Re: When corrupted results get validated...
Post by: perryjay on 05 Jun 2010, 07:01:02 pm: Just checked my validation inconclusive and found three or four where my wingman turned in .01. I should be ok on most because the third wingmen are running on their CPUs. I followed out the .01 wingmen and all seem to be running either a 470 or 480. They are also getting a lot of .01 credit claims validated. Again, tracing out their wingmen, they are also running 470/480s. I wonder just how much we are missing because the third man turns in good also. :o
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 05 Jun 2010, 08:09:00 pm: Quote from: Richard Haselgrove on 05 Jun 2010, 05:39:51 pm

Of the Fermi cards (the first 8 in Claggy's list), seven are running the stock v6.09_cuda23 application.

And of the five additional Fermis that Claggy has listed, every one is running stock v6.09_cuda23.

Raistmer is absolutely right to say that there are two distinict problems:

1) Random state corruption of older cards
2) Fermi cards running incompatible applications

The point is, the second problem could be solved at a stroke by deploying the v6.10 Fermi app which has been tested - and has passed the test - at Beta.

That's an incredibly easy solution, and would reomove, on Claggy's figures, a hugely significant part of the problem.

The random failures would remain, to be dealt with as we understand the problem further. But that problem has existed for months, without reaching critical mass. If we remove the Fermi co-validators, it should remain insignificant: but the rise of the Fermi means we can't ingore problem #2 any longer.
Title: Re: When corrupted results get validated...
Post by: Raistmer on 06 Jun 2010, 03:59:18 am: Sure, I though yesterday night project maintenance will bring 6.10 to SETI main, still not ?
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 06 Jun 2010, 05:58:27 am: No sign of it. But something's going on: looking at Claggy's list, only one (djwhu, 5424576 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5424576)) is still blowing away significant numbers oi WUs - and incidentally confirming that mid-AR suffer the same fate. The latest addition (Aaron Danbury, 5373696 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5373696)) has done a few, but with the new work supply, it would normally have many more. So it seems that they may have put some sort of limiter into the system, but it's not obvious what.

David has got the message (off-list response), but hasn't got a reply from Eric yet.

And the problem is about to get worse - Fermi GTX 465s have landed in the shops, and are already being discounted: I was offered one for £215.99 in a mailshot. Won't interest the hard-core crunchers, but will certainly attract a few into the fit-and-forget segment.
Title: Re: When corrupted results get validated...
Post by: _heinz on 06 Jun 2010, 10:21:27 am: It is to mention, I get no work for fermi application since yesterday.
All wu's are coming are for cpu only.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I think it is blocked till the situation is solved.
:)
Title: Re: When corrupted results get validated...
Post by: Raistmer on 06 Jun 2010, 10:28:21 am: Quote from: _heinz on 06 Jun 2010, 10:21:27 am
It is to mention, I get no work for fermi application since yesterday.
All wu's are coming are for cpu only.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I think it is blocked till the situation is solved.
:)
Hard but probably correct decision from Berkeley's side. Hope they will be able to solve this soon cause all they need id to follow Richard's suggestion about 6.10 on main.
Looks not very hard to do actually.
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 06 Jun 2010, 10:37:24 am: It's not blocked, because I've got four new tasks today. But I think it may be very severely throttled.

Raistmer, people on the main board are still recommending your V12b_FERMI. Has anybody actually tested it, and posted any results? If not, I've just downloaded a copy and I'll run a bench when the 470 is free (GPUGrid task due to finish in ~2 hours). If it doesn't work, as you suggested last night, I suggest you remove the download archive.
Title: Re: When corrupted results get validated...
Post by: Raistmer on 06 Jun 2010, 12:09:24 pm: Quote from: Richard Haselgrove on 06 Jun 2010, 10:37:24 am
It's not blocked, because I've got four new tasks today. But I think it may be very severely throttled.

Raistmer, people on the main board are still recommending your V12b_FERMI. Has anybody actually tested it, and posted any results? If not, I've just downloaded a copy and I'll run a bench when the 470 is free (GPUGrid task due to finish in ~2 hours). If it doesn't work, as you suggested last night, I suggest you remove the download archive.
AFAIK Todd Hebert tested it and found uncompatible with FERMI.
It was at the beginning of corresponding thread.
Later this info somehow modified... So better do short test and close this topic completely. Surely I will remove it if it not compatible indeed.
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 06 Jun 2010, 01:18:19 pm: Ran the bench with a range of Joe's full-length WUs. One of them worked, but three failures: given the ARs involved, I think this needs withdrawing, pronto. Could you remove the archive, please, and get the Mods to lock your thread after a suitable explanation?

While it was running, I looked through Todd Hebert's posts. He says a couple of times that he got the files from you, and even posts an app_info for MB_6.09_CUDA_V12b_FERMI.exe (message 990059 (http://setiathome.berkeley.edu/forum_thread.php?id=59666&nowrap=true#990059)). But I don't see anywhere where he posts, or even describes, a test result advising a change of direction. Yet other people, like ScimanStev, describe downloading files from Todd which - it turns out - have the stock app included.

I can't say I'm very impressed by the integrity of this process.
Title: Re: When corrupted results get validated...
Post by: Raistmer on 06 Jun 2010, 01:29:11 pm: Quote from: Richard Haselgrove on 06 Jun 2010, 01:18:19 pm
Ran the bench with a range of Joe's full-length WUs. One of them worked, but three failures: given the ARs involved, I think this needs withdrawing, pronto. Could you remove the archive, please, and get the Mods to lock your thread after a suitable explanation?

While it was running, I looked through Todd Hebert's posts. He says a couple of times that he got the files from you, and even posts an app_info for MB_6.09_CUDA_V12b_FERMI.exe (message 990059 (http://setiathome.berkeley.edu/forum_thread.php?id=59666&nowrap=true#990059)). But I don't see anywhere where he posts, or even describes, a test result advising a change of direction. Yet other people, like ScimanStev, describe downloading files from Todd which - it turns out - have the stock app included.

I can't say I'm very impressed by the integrity of this process.
Ok, I will recommend to use stock 6.10 from beta then.

EDIT: done.
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 06 Jun 2010, 01:40:02 pm: Thanks. I've now set it to run live on the main project - host 2901600 (http://setiathome.berkeley.edu/results.php?hostid=2901600). No problem fetching work - just a shortlist for the moment, because (a) I run a short cache, and (b) DCF hasn't settled yet - still estimating three hours!

All those pseudo -9s that we started this thread with will have driven DCF way low. I think we may have encountered another of BOINC's safety features - IIRC BOINC cuts down on work fetch if DCF ever gets into 'insane' territory, either high or low. There's a lot of very sound engineering practice in the original BOINC design, but I fear we're in danger of losing it with all these hurried, on-the-fly, bodges to cope with evovling technologies like GPUs.
Title: Re: When corrupted results get validated...
Post by: Raistmer on 06 Jun 2010, 01:47:22 pm: Yeah, life too fast to properly think about it, BOINC not escaped this :) But some block with fast reaction time to stop invalid overflows would be good thing IMO.
They damage project in too many ways.
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 06 Jun 2010, 02:29:09 pm: Already got a wingmate to add to Claggy's list:

Pieter hostid=5431046 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5431046) NVIDIA GeForce 9800 GT (1005MB) driver: 19745

Host created today, downloaded 564 tasks, got two of them to validate at 0.01 credits, I'm too depressed to look-see how many pages-full he's wasted.
Title: Re: When corrupted results get validated...
Post by: Josef W. Segur on 06 Jun 2010, 03:30:28 pm: Quote from: Richard Haselgrove on 06 Jun 2010, 02:29:09 pm
Already got a wingmate to add to Claggy's list:

Pieter hostid=5431046 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5431046) NVIDIA GeForce 9800 GT (1005MB) driver: 19745

Host created today, downloaded 564 tasks, got two of them to validate at 0.01 credits, I'm too depressed to look-see how many pages-full he's wasted.

220 pending, 2 validated, all teensie claims.

One of the two "valid" is good evidence, text captures attached as WU619984348.7z, also attaching text captures for paired 4xx case noted by Sutaru as WU619465291.7z.
Joe
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 07 Jun 2010, 04:23:22 pm: A cuda_fermi application, v6.10, was loaded about 30 minutes ago. No-one will have any WUs yet, of course, because the splitters haven't been restarted.

I'd prefer not to test the stock download process myself if I can avoid it, because I'm rigged with an app_info and still have some VLARs waiting for optimised CPU handling. But if we could keep an eye on Claggy's list, and see if the Fermis start producing valid work, that would be good news.
Title: Re: When corrupted results get validated...
Post by: Jason G on 07 Jun 2010, 04:26:16 pm: Please check md5 (binary equivalent) of exe against beta ones
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 07 Jun 2010, 04:43:00 pm: Quote from: Jason G on 07 Jun 2010, 04:26:16 pm
Please check md5 (binary equivalent) of exe against beta ones

E448A1489782723161EFAF99B9494661

in both cases. Binary FC says the same, too.

So this will be the one which describes itself as 6.09 in stderr, then ;D
Title: Re: When corrupted results get validated...
Post by: Jason G on 07 Jun 2010, 04:44:20 pm: LoL, thanks. :D
Title: Re: When corrupted results get validated...
Post by: _heinz on 08 Jun 2010, 03:42:02 am: berkeley switched off all data distribution, a message is on the front site. --->
"We are experiencing a problem such that some GPU platforms are quickly overflowing on all workunits that they receive. Rather than burn through a great deal of data that we would have to redistribute, we are turning off data distribution until we get this debugged."

08.06.2010 09:27:38   SETI@home   update requested by user
08.06.2010 09:27:41   SETI@home   Fetching scheduler list
08.06.2010 09:27:43   SETI@home   Master file download succeeded
08.06.2010 09:27:48   SETI@home   Sending scheduler request: Requested by user.
08.06.2010 09:27:48   SETI@home   Reporting 6 completed tasks, requesting new tasks for CPU and GPU
08.06.2010 09:27:51      Project communication failed: attempting access to reference site
08.06.2010 09:27:51   SETI@home   Scheduler request failed: Couldn't connect to server
08.06.2010 09:27:53      Internet access OK - project servers may be temporarily down.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

thats good, so they are working on it.

heinz
Title: Re: When corrupted results get validated...
Post by: Claggy on 08 Jun 2010, 04:17:58 am: Eric reports in the 'Scheduler problems' (http://setiathome.berkeley.edu/forum_thread.php?id=60248) News thread:

"We're having difficulty getting a new scheduler running that handles cuda_fermi applications properly. We'll be down until we get it sorted out."

Claggy
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 08 Jun 2010, 05:19:28 am: This may be considered a case of "be careful what you wish for" - I sent Eric an email just as the lab opened yesterday, drawing attention to the scale of the problem and the list in this thread. Probably not what he wanted to read on a Monday morning, and deploying a new scheduler probably wan't his plan for the day either.

But it had to be done - shame it didn't go smoothly.
Title: Re: When corrupted results get validated...
Post by: Raistmer on 08 Jun 2010, 05:27:23 am: LoL, I did the same and recived answer that they worked on it right now (about the time 6.10 was spotted on main). Interesting, what was wrong with 6.10 ?...
Title: Re: When corrupted results get validated...
Post by: Josef W. Segur on 08 Jun 2010, 03:04:47 pm: Quote from: Raistmer on 08 Jun 2010, 05:27:23 am
LoL, I did the same and recived answer that they worked on it right now (about the time 6.10 was spotted on main). Interesting, what was wrong with 6.10 ?...

Perhaps a host which got in a bad state trying to run 6.08 on a GTX 4xx will need a reboot to clear the problem. If so, automatically updating a host to 6.10 wouldn't help.

Quote from: Richard Haselgrove on 08 Jun 2010, 05:19:28 am
...I sent Eric an email just as the lab opened yesterday, drawing attention to the scale of the problem and the list in this thread.
...

User "korpela" has not been active in the last 24 hours, though this thread is visible to all and he might have viewed it as a guest.
Joe
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 08 Jun 2010, 04:18:28 pm: Quote from: Josef W. Segur on 08 Jun 2010, 03:04:47 pm
Perhaps a host which got in a bad state trying to run 6.08 on a GTX 4xx will need a reboot to clear the problem. If so, automatically updating a host to 6.10 wouldn't help.

I've not seen anything like that, and my GTX 470 has tried them all (6.09, 6.08, v12b). They just fail in their various ways, and move on to the next task. It's very different from the 'sporadic error state' on older GPUs, where the failure persists from task to task until reboot.
Title: Re: When corrupted results get validated...
Post by: Claggy on 08 Jun 2010, 04:24:01 pm: I started going through my resends again this morning, no new Fermi's, just these hosts:

Sigurd G.Schinke hostid=5372764 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5372764) NVIDIA GeForce GTX 260 (881MB) driver: 19713

Marc Jarry hostid=4247889 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=4247889) NVIDIA GeForce 9600 GT (511MB) driver: 19745

BabelAbu hostid=5374194 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5374194) NVIDIA GeForce GTX 260 (1792MB) driver: 19732

malycc hostid=5386713 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5386713) NVIDIA GeForce 9500 GT (511MB) driver: 19745

The Beef hostid=5289552 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5289552) [2] NVIDIA GeForce GTX 295 (896MB) driver: 19562

Anonymous hostid=5049618 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5049618) [2] NVIDIA GeForce GTX 295 (895MB) driver: 19038

k.pieschl hostid=3192436 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=3192436) NVIDIA GeForce 9800 GT (1023MB) driver: 18634

Claggy
Title: Re: When corrupted results get validated...
Post by: Josef W. Segur on 08 Jun 2010, 09:56:10 pm: Quote from: Josef W. Segur on 08 Jun 2010, 03:04:47 pm
Perhaps a host which got in a bad state trying to run 6.08 on a GTX 4xx will need a reboot to clear the problem. If so, automatically updating a host to 6.10 wouldn't help.

Quote from: Richard Haselgrove on 08 Jun 2010, 04:18:28 pm
I've not seen anything like that, and my GTX 470 has tried them all (6.09, 6.08, v12b). They just fail in their various ways, and move on to the next task. It's very different from the 'sporadic error state' on older GPUs, where the failure persists from task to task until reboot.

Thanks, I had missed that very important difference. Now that the project is flowing data again perhaps we'll be able to figure out what they had to modify.
Joe
Title: Re: When corrupted results get validated...
Post by: perryjay on 08 Jun 2010, 10:35:49 pm: Well, this should be interesting.All the work units I've downloaded since we came back up have been showing as "not in DB" on my SETI website tasks page instead of anonymous platform. I mentioned it in a post on the NC forum and Skildude posted he is getting it too. Mine are showing up as 6.03s and 6.08s on my Manager's task list though. It will be awhile before I get to them so I hope they will run ok.
Title: Re: When corrupted results get validated...
Post by: sunu on 09 Jun 2010, 01:58:36 am: Quote from: perryjay on 08 Jun 2010, 10:35:49 pm
Well, this should be interesting.All the work units I've downloaded since we came back up have been showing as "not in DB" on my SETI website tasks page instead of anonymous platform. I mentioned it in a post on the NC forum and Skildude posted he is getting it too. Mine are showing up as 6.03s and 6.08s on my Manager's task list though. It will be awhile before I get to them so I hope they will run ok.

Yes, I'm getting that too. On another note, another valid result got invalidated by two fermi cards using wrong app. Link and printed.

http://setiathome.berkeley.edu/workunit.php?wuid=615573439
Title: Re: When corrupted results get validated...
Post by: Raistmer on 09 Jun 2010, 02:39:54 am: Same here. Anonymous platform replaced with new (in my case it's even translated to "Нет в ДБ") message. For CPU-assigned tasks too.
Title: Re: When corrupted results get validated...
Post by: _heinz on 09 Jun 2010, 02:56:03 am: On my machine all looks like normal
get 6.03 wu's and one 6.10 (cuda fermi)
~~~~~~~~~~~~~~~~~~~~~~~~~
On my result page I see 10 entries not in DB
Title: Re: When corrupted results get validated...
Post by: sunu on 09 Jun 2010, 04:42:25 am: Much more disturbing message:

Your app_info.xml doesn't have a usable version of Seti@hpme enhanced.
Reached daily quota of 100 tasks

Well, until yesterday, berkeley liked my app_info.xml.
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 09 Jun 2010, 05:09:41 am: They're obviously making rather wider changes to the work scheduling process, incorporating some of the improvements which have been tested at SETI Beta recently. But it doesn't seem to be going very smoothly.

There's not supposed to be any change to the app_info format, so I suggest you treat that as a server error: don't mess around with app_info in a vain attempt to get it working!

There's no work available today anyway, so I suggest we all just sit back and watch until the dust settles.
Title: Re: When corrupted results get validated...
Post by: sunu on 16 Jun 2010, 06:30:22 pm: We've got again those validate errors with no apparent reason http://setiathome.berkeley.edu/workunit.php?wuid=559969383
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 16 Jun 2010, 07:16:20 pm: Quote from: sunu on 16 Jun 2010, 06:30:22 pm
We've got again those validate errors with no apparent reason http://setiathome.berkeley.edu/workunit.php?wuid=559969383

The reason is it was reported on 7 February. I don't see how you can call that "again": I would accept "still". It's those AWOL wingmates who stopped any work happening from January to April that we should be concerned about.
Title: Re: When corrupted results get validated...
Post by: sunu on 16 Jun 2010, 08:29:04 pm: Quote from: Richard Haselgrove on 16 Jun 2010, 07:16:20 pm
The reason is it was reported on 7 February. I don't see how you can call that "again": I would accept "still". It's those AWOL wingmates who stopped any work happening from January to April that we should be concerned about.

No, it was invalidated after that June 12 (or maybe 16) result. Before that date it wasn't in my invalid list.
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 16 Jun 2010, 08:40:53 pm: Quote from: Richard Haselgrove on 16 Jun 2010, 07:16:20 pm
The reason is it was reported on 7 February. I don't see how you can call that "again": I would accept "still". It's those AWOL wingmates who stopped any work happening from January to April that we should be concerned about.

Quote from: sunu on 16 Jun 2010, 08:29:04 pm
No, it got invalidated after that June 12 (or maybe 16) result. Before that date it wasn't in my invalid list.

It wouldn't even have been looked at until 12 June. When that second 'success' report came in, the validator will have tried to find the uploaded result file that should have preceded the 7 February report - and presunably it wasn't there.

Without a result file, that one could never be validated: presumably the file associated with the 12 June report was findable, but there was nothing to compare it with, hence 'inconclusive' and 'pending'.
Title: Re: When corrupted results get validated...
Post by: sunu on 16 Jun 2010, 08:55:53 pm: Why it wasn't there?
Title: Re: When corrupted results get validated...
Post by: Claggy on 03 Jul 2010, 06:07:03 am: I had a look at the list of hosts i posted earlier in this thread, ~~some of the Fermi's are still trying to do Cuda23 work and don't have the Fermi app~~,
then there's smithwr3 who still using the optimised V12 app.

Claggy

Edit: I retract that, i was talking out of my A*se, they had done Cuda23 work at the beginning of June, and haven't done any since.
Title: Re: When corrupted results get validated...
Post by: Claggy on 23 Aug 2010, 03:53:07 pm: Got a false Invalid task today, wingmen are smithwr3 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5293938) and Fabio Chimienti (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5440744), both running V12 still, both produced -9's, :(

wuid=646519895 (http://setiathome.berkeley.edu/workunit.php?wuid=646519895)

Claggy
Title: Re: When corrupted results get validated...
Post by: perryjay on 23 Aug 2010, 04:09:18 pm: That smithwr has 18,000 tasks on his list. It's scary thinking how many of those are returning -9.
Title: Re: When corrupted results get validated...
Post by: Claggy on 23 Aug 2010, 04:18:17 pm: Quote from: perryjay on 23 Aug 2010, 04:09:18 pm
That smithwr has 18,000 tasks on his list. It's scary thinking how many of those are returning -9.
Quota system doesn't work very well does it?

Every invalid task/error resets his quota to 100, but with Seti's *8 Multiplier, he still gets 800 tasks a day,

I think the quota should go a lot lower than 100, or take the Multiplier away.

Claggy
Title: Re: When corrupted results get validated...
Post by: Raistmer on 23 Aug 2010, 04:41:58 pm: If all his results are invalid his quota should go to 1 eventually, i.e. 8 GPU tasks per day.
[BTW, it looks like very same case new quota system should protect from.
Even if he return good CPU results - GPU should be inhibited.
If not - current quota system still flawed.
]
[And GPU multiplier should be removed indeed. If GPU works fine it will have no effective limits w/o any multiplier, but if it broken this multiplier just multiplies trashed tasks... ]
Title: Re: When corrupted results get validated...
Post by: perryjay on 23 Aug 2010, 06:54:59 pm: Here's his invalid list.. http://setiathome.berkeley.edu/results.php?hostid=5293938&offset=0&show_names=0&state=4 Anyone want to try to count them? I got four pages into it and decided not! :-(
Title: Re: When corrupted results get validated...
Post by: Miep on 23 Aug 2010, 07:02:09 pm: Quote from: perryjay on 23 Aug 2010, 06:54:59 pm
Here's his invalid list.. http://setiathome.berkeley.edu/results.php?hostid=5293938&offset=0&show_names=0&state=4 Anyone want to try to count them? I got four pages into it and decided not! :-(

1223 atm. put something big in offset, then work your way up or down by say 100 - or larger depinding on expectations. If you go too high it'll tell no tasks to display.
Handy to move to the oldest pending/in progress task too.
Title: Re: When corrupted results get validated...
Post by: Josef W. Segur on 23 Aug 2010, 07:10:20 pm: The smithwr3 host 5293938 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5293938) several minutes ago had 147 "valid" tasks (http://setiathome.berkeley.edu/results.php?hostid=5293938&offset=140&show_names=0&state=3), not all of those are false overflows, as v12 apparently does not do that on VHAR or VLAR tasks and the host is also doing CPU work. It also has 1224 invalid tasks (http://setiathome.berkeley.edu/results.php?hostid=5293938&offset=1220&show_names=0&state=4) which probably are all false overflows, though I certainly didn't check more than a tiny sample.

The BOINC quota mechanism is, and always has been, only capable of protecting against totally bad processing, and even so the protection is delayed; a host which goes bad with thousands of tasks already cached is penalized too late to save those tasks.

Is there a chance the "Notices" in 6.11+ might be considered obvious enough that the servers could send a Notice that the host appears to have failed along with a command which causes BOINC to not start any more tasks for the project until the user has read the Notice and believes the problem fixed? I don't know if the BOINC devs would consider something like that, discussions about quota and related things on the boinc_dev list seem to always end inconclusively.
Joe
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 30 Dec 2010, 10:29:31 am: Could I ask all Lunatics to review http://setiathome.berkeley.edu/forum_thread.php?id=62573 ?

The old problem of false -9 overflows, caused by outdated applications running on Fermi GPUs, is still with us, and still polluting the database with junk science.

Back in June when this thread was started, the major problem was the stock applications. We got the project to clean up its act, and the stock apps are now working properly for the 'set-and-forget' types.

Which means that the remaining problems are attributable, almost exclusively, to third-party applications: V12 vlar-autokill, in the linked thread.

As I've written in that thread, vlar-autokill was a fit and proper app in its time, and I have no criticism of Raistmer for releasing it. But it's now an embarrassment.

So, how can we cork the beastie back into its bottle? Unfortunately, I can't think of a way - short of the nuclear option of the project blocking all anonymous platform apps. Maybe some bespoke programming could be put into the scheduler to selectively block known bad app/hardware pairings, but I can't imagine project staff being happy about diverting scarce development time into doing that.

But I think there are two things we ought to consider doing.

The first is to be much harder on the ill-informed message board "advisors" - people like Sutaru and skildude - who advocate optimised applications as the cure-all for everything, but consistently fail to pass on the associated responsibility for understanding and long-term monitoring.

And secondly, how about building a 'suicide pill' into the apps themselves? Maybe in the first place for Beta apps - nobody should run a Beta app for longer than, say, one month: and if they are still actively testing after that time, an enforced re-install isn't too much of a problem.

The trouble is that I don't think that anything short of a physical block (suicide pills are common for trialware) will catch the sort of users I've linked in that thread - no message board activity, no team membership. And I'm sorry, but NO: I'm not going to start sending out unsolicited PMs and emails.

The thing that worries me most of all is that I can't see users like that coming here and collecting optimised apps on their own (even though they would at least see the warnings if they did). I'm beginning to wonder how many re-hosting websites there might be out there - overclockers, BOINC team sites, that sort of thing - which might be distributing Lunatics apps with no 'best practice' advice whatsoever.

Postscript: while previewing, I saw that the previous post in this thread concerned the very same host 5293938 that also featured today. So that's over four months the problem has continued unchecked.
Title: Re: When corrupted results get validated...
Post by: Jason G on 30 Dec 2010, 11:49:43 am: Perhaps redefining/updating to a new version/planclass, disabling work for all existing ones is an option. Forcing a stock app update, and matching planclass update for newer opt apps. I don't know enough about that Boinc app distribution mechanism to know if & how that would work.

I wouldn't mind developing an autoupdater for future production releases here. There will be those that will still circumvent both the stock & opt update anyway, some 'legitimately' , some to defiantly run what they want anyway. The science process itself needs to catch these with the validation & quota mechanisms (and subsequent science process of course), since user specific configuration & 'jiggering' might be considered as having similar destructive potential as anywhere from a cosmic ray bit-flip to a massive hardware failure. That goes for any app, not just GPUs. I'm sure there are brand machines that just shouldn't crunch at all, people that just should not be allowed near computers. Unfortunately we're not the PC police, though maybe we should be ;)

Promoting use of outdated known buggy builds, old drivers & outright 'jiggering' has gone on in the past. Especially when directed toward inexperienced users I've always found it more than a bit frustrating, and had to put a stop to it in one specific occasion I've seen it here. In one particular instance massive argument ensued & only ended with me banning the user to think about it, which sadly escalated the argument, forcing Admins hand (not mine) to permanently delete the users' account. Along with security concerns, that also resulted, in part, in the tightening of beta participation requirements & restrictions here to more select group.

While we aren't the computer police, we don't have to put up with bad advice here, and can do our best to correct faulty advice where we spot it, and try to come up with ways to encourage doing the right thing. Unfortunately in the case of problems inherited from the Fermi incompatibility, I don't see a lot of ways to encourage that other than simply making newer releases better, more widely compatible, AND faster, which is proving to be quite long road.

Jason
Title: Re: When corrupted results get validated...
Post by: Raistmer on 30 Dec 2010, 12:36:22 pm: I agree, we are not computer policy, not M$ and sometime our development time scarce too, btw ;)

Effectively disabling malfunctioning participants is BOINC (I repeat, BOINC, not project staff) prerogative. We need framwork for doing common things with it, not just bloatware as new BOINC versions become more and more alike. I seriously thinking sometime to write perl script to process all tasks in directory W/O BOINC and launch it only for network communications.

If plan class/version limits not effective - new means should be integrated in BOINC IMO. It's impossible to create app that will work on every still not even existed hardware where some idiots would like it to run. I'm truly can't understand how someone can use not FERMI-compatible app on FERMI GPU if it gives errors alomost constantly, people just never look in result page maybe?...

About "suicide pill" - if it's implemented as library with enough easy to use interface I'm ready to include it in my builds. But have no intentions to develop such thing.

P.S. And about bad advices... It's true problem, IMHO, but recently I'm just tired to argument with bad advice. I'm just trying to give more correct answer to original poster w/o discussions with uneducated but active ones. Life is short....
Title: Re: When corrupted results get validated...
Post by: Jason G on 30 Dec 2010, 12:43:00 pm: Quote from: Raistmer on 30 Dec 2010, 12:36:22 pm
About "suicide pill" - if it's implemented as library with enough easy to use interface I'm ready to include it in my builds. But have no intentions to develop such thing.

I'll 'consider' it as possibility for future release, though it doesn't, of course, solve either the outdated release, or the intentionally 'jiggered' environments, so I'm approaching it (the whole idea) with some scepticism (probably a good thing).

Quote
I'm truly can't understand how someone can use not FERMI-compatible app on FERMI GPU if it gives errors alomost constantly, people just never look in result page maybe?...
Yes, partly that. And now add to that certain people espousing overriding the stock app with stock Cuda23 (via app_info) insisting that's the fastest... got it ? ( Problem Logical conclusions evolving from that may include that v12 VLArKill would be a good idea to use on Fermi ... IT ISN'T for anyone reading, don't do it! )
Title: Re: When corrupted results get validated...
Post by: Raistmer on 30 Dec 2010, 12:57:50 pm: BTW, if I recall right, there was some similar problem with Einstein project and one of Akosf (not sure I reproduced nickname right, Akos ) opt builds. It failed to process new data correctly.
Also same problem appeared at least once on MW, again, with 3-rd party app.
What they (project admins) did in those cases? Maybe SETI project can learn from it ?
Title: Re: When corrupted results get validated...
Post by: Jason G on 30 Dec 2010, 01:11:57 pm: Quote from: Raistmer on 30 Dec 2010, 12:57:50 pm
What they (project admins) did in those cases? Maybe SETI project can learn from it ?

Don't know. Some point of responsibility lies with the 3rd party developers, IMO of course, and as we take pains to improve validation & stability at every step, we endeavour to meet those responsibilities. But let's face it, if erroneous app results can get through 'the system', then so can hardware faults, bad configurations, and sheer vandalism. the 'science' catches that, not validators or mySQL queries (unless relational databases have become more sophisticated than I remember, it's a reduced database, not a knowledgebase ;) Thise are project staff. NTPCKR & RFI systems to handle that further on.)

For our purposes here, I think we need to devalue the integrity of a single detection in our minds, when the reality says it would go through a whole persistence & re-observation process before publication of any WoWness. We don;t need another WoW like signal, we have one of those and it has proven insufficient to scientifically confirm the presence of an extraterrestrial civilisation.

Jason
Title: Re: When corrupted results get validated...
Post by: Raistmer on 30 Dec 2010, 01:13:56 pm: BTW, there is another possibility for such long unmaintained host setup.
Sometime host can escape grasp of initial installer. I have such host in my fleet for example. It still produces correct data, but even if it will go wrong - I will not able to do anything with it.
Perhaps host deletion/blocking from SETI web site should be added....
Title: Re: When corrupted results get validated...
Post by: Jason G on 30 Dec 2010, 01:16:12 pm: Quote from: Raistmer on 30 Dec 2010, 01:13:56 pm
Perhaps host deletion/blocking from SETI web site should be added....

Joking, Automated system to contact ISP, requesting to connect 240 Volts down the cable ? ;D

More seriously, updated version/planclass described earlier should cut work off from that host, though as mentioned I don't know the practicalities of that.
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 30 Dec 2010, 01:32:52 pm: Quote from: Raistmer on 30 Dec 2010, 12:36:22 pm

About "suicide pill" - if it's implemented as library with enough easy to use interface I'm ready to include it in my builds. But have no intentions to develop such thing.

When I was asked to produce a 'trial' version of one of my programs, many years ago, I found some shareware which could be applied to the completed, compiled .exe file - a sort of wrapper. That, of course, meant that the trial version of the program was identical to the paid-for version, and the wrapper was completely independent of the development environment used to produce it. But IIRC, it popped up a nag screen saying how many more days it could be used - not advisable for an app with no other UI ::). There were other restriction policies available, but that's the only one I used.

It's only installed on a now-retired development machine - I could fire that up and retrieve it if anyone wants. Similarly, newer/better products should be available?
Title: Re: When corrupted results get validated...
Post by: Jason G on 30 Dec 2010, 01:42:55 pm: Quote from: Richard Haselgrove on 30 Dec 2010, 01:32:52 pm
It's only installed on a now-retired development machine - I could fire that up and retrieve it if anyone wants. Similarly, newer/better products should be available?

OK, a dumb idea along those lines from me. Induce an update cycle & reset the project (via bonccmd.exe or similar mechanism) if not current.

[Later:] Something looking potentially relevant to today's discussion... NTPCKR seems to be configured to ignore spikes (my guess, nothing more, from the command line).
Quote
+ <daemon>
+ <host>maul</host>
+ <cmd>ntpckr.x86_64 -nospikes -mod 4 0 -hpsd -dayscool 5 -summarize -projectdir /home/boincadm/projects/sah</cmd>
+ <output>ntpckr1.log</output>
+ <pid_file>ntpckr1.pid</pid_file>
+ <disabled>1</disabled>
+ </daemon>
...
<task>
+ <host>maul</host>
+ <cmd>update_candidate_counts.x86_64 -projectdir /home/boincadm/projects/sah</cmd>
+ <output>update_candidate_counts.log</output>
+ <period>1 hours</period>
+ <disabled>0</disabled>
+ </task>
Title: Re: When corrupted results get validated...
Post by: perryjay on 30 Dec 2010, 02:41:49 pm: I thought they said they had something to catch false valids. I guess that entry in nitpkr about nospikes is what I was thinking about. I try to make sure the people I recommend optimized apps to understand the risks involved but I do forget occasionally. I also ask that they check their work and come back if they have any problems. I also send PMs now and then but as Richard said in the SETI thread, I don't like to do that. Another problem with that is so many of the problem machines are anonymous users or have email notification turned off so they never see they have a PM. I guess it comes down to we all do what we can and if that is not enough, so be it, we tried.
Title: Re: When corrupted results get validated...
Post by: Jason G on 30 Dec 2010, 02:57:36 pm: Quote from: perryjay on 30 Dec 2010, 02:41:49 pm
I thought they said they had something to catch false valids. I guess that entry in nitpkr about nospikes is what I was thinking about. .... I guess it comes down to we all do what we can and if that is not enough, so be it, we tried.

Exactly. We can do our best within 'reasonable' efforts, but there will always be those situations & personalities that escape or intentionally avoid the 'right thing'. It is really the ultimate duty of the project in question, and as Raistmer indicates the Boinc framework itself, to ensure the integrity of any results is adequate to support the claims made in any published announcement or material.

I ask people to step back & take a look for a minute. This is part of the science of distributed computing, and a worthy challenge to make sure we are doing what we can, and that the system can be robustified to adequetly handle as many possibilities as we can, moving forward. We want, as developers, to strive toward perfection, whatever that is, but it is not a realistic goal to use absolute measures. The universe is NOT digital.

Jason
Title: Re: When corrupted results get validated...
Post by: Raistmer on 30 Dec 2010, 04:35:37 pm: Quote from: Jason G on 30 Dec 2010, 02:57:36 pm
The universe is NOT digital.

Jason

Just can't restrain ;D : only if we all live not on 13th floor... ;) (http://www.imdb.com/title/tt0139809/plotsummary)
Title: Re: When corrupted results get validated...
Post by: Jason G on 30 Dec 2010, 04:46:12 pm: Quote from: Raistmer on 30 Dec 2010, 04:35:37 pm
Quote from: Jason G on 30 Dec 2010, 02:57:36 pm
The universe is NOT digital.

Jason

Just can't restrain ;D : only if we all live not on 13th floor... ;) (http://www.imdb.com/title/tt0139809/plotsummary)

LoL, It reminds me of when 'the Matrix' came out. Many presented theories suggesting religious & scientific connections. I proposed it was a movie made to make money & was lambashed for that ::) oh well...
Title: Re: When corrupted results get validated...
Post by: Raistmer on 30 Dec 2010, 04:48:28 pm: Hehe, second and, especially, third parts - definitely ;)
[
And I was "killed" by their idea to use mens as energy source, quite dumb idea IMHO. Why not read good books and take some ideas from where, Dan Simmons "Endymion" for example, AI used mens brains much more cleaver there IMHO :)
]
Title: Re: When corrupted results get validated...
Post by: Jason G on 30 Dec 2010, 05:36:35 pm: Quote from: Raistmer on 30 Dec 2010, 04:48:28 pm
Hehe, second and, especially, third parts - definitely ;)
[
And I was "killed" by their idea to use mens as energy source, quite dumb idea IMHO. Why not read good books and take some ideas from where, Dan Simmons "Endymion" for example, AI used mens brains much more cleaver there IMHO :)
]
There's probably some element to my opinions that could be connected with scifi writing, more precisely some Asimov style 'anarchy' or 'fate' through statistical inevitability or chaos ( Hari Seldon style ). I still find Doc EE Smith's notions of overcoming the laws of nature appealing, allowing us to throw stars & planets about like billiard balls should the need arise (never know when one might need to throw a planet around), but I don't see the two ideas as completely mutually exclusive.

Jason
Title: Re: When corrupted results get validated...
Post by: _heinz on 30 Dec 2010, 06:27:13 pm: I think we should write a errorfile similar like this in our project app.
whit it's help we can count the errors and avoid misfunctionality
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#if undefined _maxerr
   define _maxerr=20
#endif
if errorfile exists
{
read errorfile into errcount
add 1 to errcount
write errcount to errorfile
   if errcount==_maxerr {
   here can we do what we want todo
   delete *.exe
   resetting the project
   resetting the machine
   }
we counted errcount up, but it is smaller than maxerr
if a crash occured now, we have alredy counted it
} <--end file exists
else
{
   (errorfile does not exist)
errcount=0
create file errorfile
write errcount to errorfile
}

....normal programm code
....
crash or no crash
....

no crash occured we come to label ende:
ende:
if errcount==0
{ no error occured, or no exit before normal end
delete errfile
}
exit(0) -->next job

heinz
Title: Re: When corrupted results get validated...
Post by: Raistmer on 30 Dec 2010, 06:39:06 pm: Heinz, the whole problem with old app and FERMI - app doesn't crash, it just produces trash result.
Title: Re: When corrupted results get validated...
Post by: _heinz on 30 Dec 2010, 07:04:55 pm: Quote from: Raistmer on 30 Dec 2010, 06:39:06 pm
Heinz, the whole problem with old app and FERMI - app doesn't crash, it just produces trash result.

This is really some more complex task to sort it out.
isn't it this -9 result overflow ?
its a validate problem.
We must also ask how can we find misconfigured machines ? (by BOINC ? ) and reset the project on it.
Title: Re: When corrupted results get validated...
Post by: Claggy on 30 Dec 2010, 07:28:41 pm: Quote from: _heinz on 30 Dec 2010, 07:04:55 pm
Quote from: Raistmer on 30 Dec 2010, 06:39:06 pm
Heinz, the whole problem with old app and FERMI - app doesn't crash, it just produces trash result.

This is really some more complex task to sort it out.
isn't it this -9 result overflow ?
its a validate problem.
We must also ask how can we find misconfigured machines ? (by BOINC ? ) and reset the project on it.

Resetting the project on such host won't help, app_info and optimised apps still kept in this situation, only a detach will get rid of optimised apps,
Can the project force hosts to detach?, i've had all my Wu's reported as detached, (when server under stress and issueing ghosts),
but apps and app_info still there,

Claggy
Title: Re: When corrupted results get validated...
Post by: Jason G on 31 Dec 2010, 03:33:48 am: At the moment, toward development of future applications, part of my goal is some robustification. While this won't address existing zombie hosts being discussed here, higher performance would be a fairly strong incentive for at least some of those to update. So while existing X-series builds don't have the familiar Fermi-issues, some safeguards can be put in place to ensure a repeat of this scenario, or a similar one, cannot occur again. I feel that development in that direction would be more rewarding than added kill switch or other disabling mechanisms, yet still handle extreme cases of misconfiguration/incorrect installation or other hardware or driver failures. ( x33 prototype already has the primitive & limited effectiveness CPU-fall-back reinstated with the slowest possible code, but I've yet to see it actually fall-back, so some trigger would be needed to test this part...)

Just bouncing things around for now. How about, as an example for addressing the '-9 overflow' scenario directly:
- An overflow on spikes is found in any single, or sequence of CFFT pairs
- That Overflow is to be treated with suspicion, enter a fail-safe sequence
- the failsafe sequence records some indicators to stderr, reinitialises the Cuda context entirely, and reprocesses the same cfft sequence.
- If this reprocessing of the data now yields No overflow, then the processing can continue, on the presumption we have recovered from some catastrophic driver error or other issues (Some forms of recovery won't apply under XPDM, but will happily recover on WDDM )
- If reprocessing the data yielded another overflow, then it could be a genuine overflow, or catastrophic / unrecoverable lower level failure ... reprocess with generic (slow) CPU code, indicating via stderr there are possible significant issues going on (if CPU code did not indicate an overflow).

I don't know about others' feelings on the issues, but to me this kind of fail-safe behaviour, effectively reverting to an alternative fail-safe / recovery sequence, sits much better with me than other alternatives so far. I am reasonably certain that most kernel hard failures, along with many problem issues could be handled with similar techniques, and that it won't really be 'that hard' to implement them. (Though much more thorough & useful than the original CPU-fall-back sequence triggered by memory allocation problems)

Just thoughts for now, but I thought I'd throw this into the current discussion before we decide to venture down development avenues that might be somewhat tangential to longer term development goals, or interest.

Jason
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 31 Dec 2010, 08:48:56 am: I'd be fully in favour of this sort of "self-validation", or at least an internal sanity check on the result.

Only two caveats:

* Can we predict the possible failure modes on next-generation (and subsequent) hardware? A nice healthy crash or abend is fine: BOINC copes with that with no problems. Even complete garbage at all ARs wouldn't be too bad. But nobody expected or predicted the Fermi 'works at some ARs, garbage at others' scenario, which is the most poisonous of all (the few good results allow continuous quota for processing, with the majority going to waste). Can we presume that the next fault will manifest with excessive spikes?

* How much time will developing a fail-safe mechanism add to the development process? The longer we go without a new release, the longer it will be before the release publicity prompts people to look at their rigs again.

On a positive note, one of the faulty hosts that Joe found overnight has been corrected (now running stock). Reading between the lines, I suspect that a discussion on the team SETI.Germany forum may have helped. Those alternative information channels could be helpful with the immediate problem, pending technical resolution.
Title: Re: When corrupted results get validated...
Post by: Jason G on 31 Dec 2010, 10:19:05 am: Quote from: Richard Haselgrove on 31 Dec 2010, 08:48:56 am
* Can we predict the possible failure modes on next-generation (and subsequent) hardware?
Yes, We do so using the real world example we have as a template, which is a worst case total Kernel failure with no detected error codes or other problems to indicate that it failed, other than data corruption. That's a rather extreme case made by a quite special convergence of tenuously related & unique conditions involving dubious coding practice and technology changes, but there are added in Cuda 3.0 onwards multiple added mechanisms to prevent such convergence again... though since it happened, we use this worst case, relatively hard to detect but easy to handle, example as a template.

Quote
* How much time will developing a fail-safe mechanism add to the development process? The longer we go without a new release, the longer it will be before the release publicity prompts people to look at their rigs again.
Not an inconsiderable amount of time, but IMO less work than built in self destruct mechanisms etc, taking the holistic view maintenance efforts involved along both directions. Basically for every Cuda invocation that exists it gets recovery & redundancy through some 'different vendor' hardware & code, aerospace engineering style. Since recovery mechanisms exist via drivers to varying degrees, the fallback/redundancy with proven code handles all failures from that, which in turn can generate a rare hard error (Rather than reporting success with corrupted data)
Title: Re: When corrupted results get validated...
Post by: perryjay on 31 Dec 2010, 05:20:25 pm: Just my two cents worth, I agree with the fallback idea. That way the work will get done or prove itself to be a real -9, whichever. If this is done for every -9 the owner should notice the slowdown if he is watching at all and do something about it.

One thing I've noticed is a couple of my wingmen running the new 570s have been turning in -9s even when running stock. Well, one was stock the other was running 32f. I've been trying to send PMs to those I know are running the wrong app but not sure what to tell these guys. Another is a wingman running a 295 that is only half bad. One half turning in good work, the other -9s. If he's just a casual cruncher he may just see his credit rising and his RAC stable and figure he's reached the best he can do without finding out he has a problem. I think this is probably heat related and a good cleaning may get him going again but I'm afraid to try sending him a PM on the off chance I'm wrong.
Title: Re: When corrupted results get validated...
Post by: Jason G on 31 Dec 2010, 06:40:11 pm: Quote from: perryjay on 31 Dec 2010, 05:20:25 pm
Just my two cents worth, I agree with the fallback idea. That way the work will get done or prove itself to be a real -9, whichever. If this is done for every -9 the owner should notice the slowdown if he is watching at all and do something about it.

One thing I've noticed is a couple of my wingmen running the new 570s have been turning in -9s even when running stock. Well, one was stock the other was running 32f. I've been trying to send PMs to those I know are running the wrong app but not sure what to tell these guys. Another is a wingman running a 295 that is only half bad. One half turning in good work, the other -9s. If he's just a casual cruncher he may just see his credit rising and his RAC stable and figure he's reached the best he can do without finding out he has a problem. I think this is probably heat related and a good cleaning may get him going again but I'm afraid to try sending him a PM on the off chance I'm wrong.

Hi Perryjay,
Either app, stock cuda_fermi or x32f, should be fine. Could be dealing with immature drivers or IMO more likely overeager OC, time will tell. Yes the more I think about it, falling back to the slowest, most reliable & proven possible code for -9's and other obvious problems seems like the best way (for the moment) to enforce some kind of sanity. I don't mind the extra work for that kindof development, so will gear up in that direction as I move toward adding performance improvements we already isolated.

Jason
Title: Re: When corrupted results get validated...
Post by: sunu on 31 Dec 2010, 07:17:56 pm: Quote from: Jason G on 31 Dec 2010, 06:40:11 pm
Yes the more I think about it, falling back to the slowest, most reliable & proven possible code for -9's and other obvious problems seems like the best way (for the moment) to enforce some kind of sanity. I don't mind the extra work for that kindof development, so will gear up in that direction as I move toward adding performance improvements we already isolated.

Jason

I don't think I like it. So if we are in the middle of a high AR storm the optimized app will be slower even from the stock app since the work will be done twice? Unless I didn't understand well.
Title: Re: When corrupted results get validated...
Post by: perryjay on 31 Dec 2010, 07:34:09 pm: Sunu,
as I understand it, a bad -9 overflow only runs a few seconds . What is being talked about is falling back to the CPU to try running it again just like those that give out of memory messages. Though I am probably wrong about that. It will only effect -9s and will keep a faulty machine from sending in hundreds of them. Those of us with clean running machines shouldn't have any problem with this approach.
Title: Re: When corrupted results get validated...
Post by: Jason G on 31 Dec 2010, 07:35:47 pm: Quote from: sunu on 31 Dec 2010, 07:17:56 pm
I don't think I like it. So if we are in the middle of a high AR storm the optimized app will be slower even from the stock app since the work will be done twice? Unless I didn't understand well.

Lol, no I wouldn't bother going to effort if it was going to make regular crunching slower ;). I would of course just throw a hard error code instead (which likewise avoids contaminating the results, but damages quota & wastes crunch time in another way)

For the most part we're really talking about properly handling situations that shouldn't really ever occur on properly configured, entact hardware. The genuine -9's are the exception, for which at most the 1 whole CFFT pair where the overflow appeared, rather than the whole task, would be reprocessed (fractions of a second, rather than 100's of seconds).
Title: Re: When corrupted results get validated...
Post by: Miep on 07 Jan 2011, 05:30:16 am: Who's keeping the list?

[Edit: Oh sorry, all already there ::) missed the lastest list over the holidays... - still wondering about the 6.02 though]

I think I found two more hosts with V12 on a GTX460 after a complaint of inconclusives against GPU on NC (http://setiathome.berkeley.edu/forum_thread.php?id=62698).
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5305178
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5257703

pulling some more from the database, probably duplicates from when we last checked.
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5293938
5472266
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5149058

Also host 5508489 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5508489) is running '6.02' ?????? http://setiathome.berkeley.edu/result.php?resultid=1766879380
And doing inconclisives againd x32f - just found another host with 6.02. Ouch.

also quite a few very different counts between x32f and 6.09 - how often should that happen?! I'll better stop looking through inconclusives now...
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 07 Jan 2011, 05:49:34 am: Joe Segur posted a list in number crunching - I've linked from the new thread. I think all your Fermis are already known, though 6.02 is a new (or newly identified) problem.

Edit - the app details for http://setiathome.berkeley.edu/host_app_versions.php?hostid=5508489 indicate it's actually running stock v6.03. Do I vaguely remember that Eric forgot to bump the internal version number on that build, just as stock v6.10 Fermi reports v6.09 in stderr_txt? In any event, although the host clearly has problems, it isn't a mis-use of anonymous platform that's causing it.
Title: Re: When corrupted results get validated...
Post by: Raistmer on 07 Jan 2011, 05:55:18 am: Quote from: Miep on 07 Jan 2011, 05:30:16 am
also quite a few very different counts between x32f and 6.09 - how often should that happen?! I'll better stop looking through inconclusives now...

Hehe :)
Usually we all stop to looking for inconclusives right after app release.... And maybe it's very bad practice :)
Title: Re: When corrupted results get validated...
Post by: Raistmer on 07 Jan 2011, 05:58:36 am: And more seriously - we have some fancy statistic from SETI servers but few very important pieces are missed completely.
For example, counters that describe inconclusives and invalids rates per host per app version.
If we would have such we could do app "profiling" on quite different level of quality.
Title: Re: When corrupted results get validated...
Post by: Miep on 07 Jan 2011, 06:04:33 am: Quote from: Richard Haselgrove on 07 Jan 2011, 05:49:34 am
Joe Segur posted a list in number crunching - I've linked from the new thread. I think all your Fermis are already known, though 6.02 is a new (or newly identified) problem.

Edit - the app details for http://setiathome.berkeley.edu/host_app_versions.php?hostid=5508489 indicate it's actually running stock v6.03. Do I vaguely remember that Eric forgot to bump the internal version number on that build, just as stock v6.10 Fermi reports v6.09 in stderr_txt? In any event, although the host clearly has problems, it isn't a mis-use of anonymous platform that's causing it.

Yes, thank Richard, saw your reply there, that's when I amended my post here.

'That build' has a problem then - there were quite a few CPU to GPU inconclusives over multiple hosts showing up with 6.02 on CPU - crosschecking

ok, difficult to say what it's valid against, with results being purged so quickly atm, but hosts with this build have difficulties against 6.09 and x32f - I've seen valids against V12 :P
Also valids against 6.09 ::). should have opend a new thread...
Title: Re: When corrupted results get validated...
Post by: Richard Haselgrove on 07 Jan 2011, 06:56:23 am: Isn't that what we're already talking about in http://lunatics.kwsn.net/gpu-crunching/08jn10ad-4151-19449-3-10-56-test-case.0.html ? (development area link, not available to all)
Title: Re: When corrupted results get validated...
Post by: Miep on 07 Jan 2011, 07:40:37 am: If that's stock 6.03 with dodgy stderr showing wrong version number... maybe?

most of inconclusives are GPU -9 and some diverging signal reports plus a few where signal reported match, so something the validator checks that isn't in stderr?
alltogether lots of inconclusives from that corner :(
Title: Re: When corrupted results get validated...
Post by: Josef W. Segur on 08 Jan 2011, 12:13:21 am: Quote from: Miep on 07 Jan 2011, 07:40:37 am
If that's stock 6.03 with dodgy stderr showing wrong version number... maybe?

Yes, Richard recalled correctly; you need to look a few lines above where it says "Application version SETI@home Enhanced v6.03" to know the actual version number.

IIRC the only difference between 6.02 and 6.03 was an SSE folding variant which had to be commented out because it sometimes crashed.

Quote
most of inconclusives are GPU -9 and some diverging signal reports plus a few where signal reported match, so something the validator checks that isn't in stderr?
alltogether lots of inconclusives from that corner :(

Yes, even when running the intended software, the CUDA cards sometimes produce false result_overflow cases. For that matter, some CPU processing does too, though that's fairly rare. I'll attach an archive with text copies of a WU page and its five task detail pages which is mind-boggling and illustrative of the weird things which can happen.

Most inconclusives get resolved with a correct result being assimilated. This thread is about cases which are exceptions to that rule, plus cases where both of the first two results are almost certainly wrong but agree.

The only thing the Validator looks for in stderr is "result_overflow" and that's only used to set a flag when the canonical result is assimilated. Aside from that, stderr could be a quote from Nietzsche and it would make no difference to validation. It's some details of the signals in the uploaded result file which are checked by the Validator.
Joe