Author Topic: When corrupted results get validated... (Read 59875 times)

Claggy · « **Reply #60 on:** 23 Aug 2010, 04:18:17 pm »

Quote from: perryjay on 23 Aug 2010, 04:09:18 pm

That smithwr has 18,000 tasks on his list. It's scary thinking how many of those are returning -9.

Quota system doesn't work very well does it?

Every invalid task/error resets his quota to 100, but with Seti's *8 Multiplier, he still gets 800 tasks a day,

I think the quota should go a lot lower than 100, or take the Multiplier away.

Claggy

Raistmer · « **Reply #61 on:** 23 Aug 2010, 04:41:58 pm »

If all his results are invalid his quota should go to 1 eventually, i.e. 8 GPU tasks per day.
[BTW, it looks like very same case new quota system should protect from.
Even if he return good CPU results - GPU should be inhibited.
If not - current quota system still flawed.
]
[And GPU multiplier should be removed indeed. If GPU works fine it will have no effective limits w/o any multiplier, but if it broken this multiplier just multiplies trashed tasks... ]

perryjay · « **Reply #62 on:** 23 Aug 2010, 06:54:59 pm »

Here's his invalid list.. http://setiathome.berkeley.edu/results.php?hostid=5293938&offset=0&show_names=0&state=4 Anyone want to try to count them? I got four pages into it and decided not! :-(

Miep · « **Reply #63 on:** 23 Aug 2010, 07:02:09 pm »

Quote from: perryjay on 23 Aug 2010, 06:54:59 pm

Here's his invalid list.. http://setiathome.berkeley.edu/results.php?hostid=5293938&offset=0&show_names=0&state=4 Anyone want to try to count them? I got four pages into it and decided not! :-(

1223 atm. put something big in offset, then work your way up or down by say 100 - or larger depinding on expectations. If you go too high it'll tell no tasks to display.
Handy to move to the oldest pending/in progress task too.

Josef W. Segur · « **Reply #64 on:** 23 Aug 2010, 07:10:20 pm »

The smithwr3 host 5293938 several minutes ago had 147 "valid" tasks, not all of those are false overflows, as v12 apparently does not do that on VHAR or VLAR tasks and the host is also doing CPU work. It also has 1224 invalid tasks which probably are all false overflows, though I certainly didn't check more than a tiny sample.

The BOINC quota mechanism is, and always has been, only capable of protecting against totally bad processing, and even so the protection is delayed; a host which goes bad with thousands of tasks already cached is penalized too late to save those tasks.

Is there a chance the "Notices" in 6.11+ might be considered obvious enough that the servers could send a Notice that the host appears to have failed along with a command which causes BOINC to not start any more tasks for the project until the user has read the Notice and believes the problem fixed? I don't know if the BOINC devs would consider something like that, discussions about quota and related things on the boinc_dev list seem to always end inconclusively.
Joe

Richard Haselgrove · « **Reply #65 on:** 30 Dec 2010, 10:29:31 am »

Could I ask all Lunatics to review http://setiathome.berkeley.edu/forum_thread.php?id=62573 ?

The old problem of false -9 overflows, caused by outdated applications running on Fermi GPUs, is still with us, and still polluting the database with junk science.

Back in June when this thread was started, the major problem was the stock applications. We got the project to clean up its act, and the stock apps are now working properly for the 'set-and-forget' types.

Which means that the remaining problems are attributable, almost exclusively, to third-party applications: V12 vlar-autokill, in the linked thread.

As I've written in that thread, vlar-autokill was a fit and proper app in its time, and I have no criticism of Raistmer for releasing it. But it's now an embarrassment.

So, how can we cork the beastie back into its bottle? Unfortunately, I can't think of a way - short of the nuclear option of the project blocking all anonymous platform apps. Maybe some bespoke programming could be put into the scheduler to selectively block known bad app/hardware pairings, but I can't imagine project staff being happy about diverting scarce development time into doing that.

But I think there are two things we ought to consider doing.

The first is to be much harder on the ill-informed message board "advisors" - people like Sutaru and skildude - who advocate optimised applications as the cure-all for everything, but consistently fail to pass on the associated responsibility for understanding and long-term monitoring.

And secondly, how about building a 'suicide pill' into the apps themselves? Maybe in the first place for Beta apps - nobody should run a Beta app for longer than, say, one month: and if they are still actively testing after that time, an enforced re-install isn't too much of a problem.

The trouble is that I don't think that anything short of a physical block (suicide pills are common for trialware) will catch the sort of users I've linked in that thread - no message board activity, no team membership. And I'm sorry, but NO: I'm not going to start sending out unsolicited PMs and emails.

The thing that worries me most of all is that I can't see users like that coming here and collecting optimised apps on their own (even though they would at least see the warnings if they did). I'm beginning to wonder how many re-hosting websites there might be out there - overclockers, BOINC team sites, that sort of thing - which might be distributing Lunatics apps with no 'best practice' advice whatsoever.

Postscript: while previewing, I saw that the previous post in this thread concerned the very same host 5293938 that also featured today. So that's over four months the problem has continued unchecked.

Jason G · « **Reply #66 on:** 30 Dec 2010, 11:49:43 am »

Perhaps redefining/updating to a new version/planclass, disabling work for all existing ones is an option. Forcing a stock app update, and matching planclass update for newer opt apps. I don't know enough about that Boinc app distribution mechanism to know if & how that would work.

I wouldn't mind developing an autoupdater for future production releases here. There will be those that will still circumvent both the stock & opt update anyway, some 'legitimately' , some to defiantly run what they want anyway. The science process itself needs to catch these with the validation & quota mechanisms (and subsequent science process of course), since user specific configuration & 'jiggering' might be considered as having similar destructive potential as anywhere from a cosmic ray bit-flip to a massive hardware failure. That goes for any app, not just GPUs. I'm sure there are brand machines that just shouldn't crunch at all, people that just should not be allowed near computers. Unfortunately we're not the PC police, though maybe we should be

Promoting use of outdated known buggy builds, old drivers & outright 'jiggering' has gone on in the past. Especially when directed toward inexperienced users I've always found it more than a bit frustrating, and had to put a stop to it in one specific occasion I've seen it here. In one particular instance massive argument ensued & only ended with me banning the user to think about it, which sadly escalated the argument, forcing Admins hand (not mine) to permanently delete the users' account. Along with security concerns, that also resulted, in part, in the tightening of beta participation requirements & restrictions here to more select group.

While we aren't the computer police, we don't have to put up with bad advice here, and can do our best to correct faulty advice where we spot it, and try to come up with ways to encourage doing the right thing. Unfortunately in the case of problems inherited from the Fermi incompatibility, I don't see a lot of ways to encourage that other than simply making newer releases better, more widely compatible, AND faster, which is proving to be quite long road.

Jason

Raistmer · « **Reply #67 on:** 30 Dec 2010, 12:36:22 pm »

I agree, we are not computer policy, not M$ and sometime our development time scarce too, btw

Effectively disabling malfunctioning participants is BOINC (I repeat, BOINC, not project staff) prerogative. We need framwork for doing common things with it, not just bloatware as new BOINC versions become more and more alike. I seriously thinking sometime to write perl script to process all tasks in directory W/O BOINC and launch it only for network communications.

If plan class/version limits not effective - new means should be integrated in BOINC IMO. It's impossible to create app that will work on every still not even existed hardware where some idiots would like it to run. I'm truly can't understand how someone can use not FERMI-compatible app on FERMI GPU if it gives errors alomost constantly, people just never look in result page maybe?...

About "suicide pill" - if it's implemented as library with enough easy to use interface I'm ready to include it in my builds. But have no intentions to develop such thing.

P.S. And about bad advices... It's true problem, IMHO, but recently I'm just tired to argument with bad advice. I'm just trying to give more correct answer to original poster w/o discussions with uneducated but active ones. Life is short....

Jason G · « **Reply #68 on:** 30 Dec 2010, 12:43:00 pm »

Quote from: Raistmer on 30 Dec 2010, 12:36:22 pm

About "suicide pill" - if it's implemented as library with enough easy to use interface I'm ready to include it in my builds. But have no intentions to develop such thing.

I'll 'consider' it as possibility for future release, though it doesn't, of course, solve either the outdated release, or the intentionally 'jiggered' environments, so I'm approaching it (the whole idea) with some scepticism (probably a good thing).

Quote

I'm truly can't understand how someone can use not FERMI-compatible app on FERMI GPU if it gives errors alomost constantly, people just never look in result page maybe?...

Yes, partly that. And now add to that certain people espousing overriding the stock app with stock Cuda23 (via app_info) insisting that's the fastest... got it ? ( Problem Logical conclusions evolving from that may include that v12 VLArKill would be a good idea to use on Fermi ... IT ISN'T for anyone reading, don't do it! )

Raistmer · « **Reply #69 on:** 30 Dec 2010, 12:57:50 pm »

BTW, if I recall right, there was some similar problem with Einstein project and one of Akosf (not sure I reproduced nickname right, Akos ) opt builds. It failed to process new data correctly.
Also same problem appeared at least once on MW, again, with 3-rd party app.
What they (project admins) did in those cases? Maybe SETI project can learn from it ?

Jason G · « **Reply #70 on:** 30 Dec 2010, 01:11:57 pm »

Quote from: Raistmer on 30 Dec 2010, 12:57:50 pm

What they (project admins) did in those cases? Maybe SETI project can learn from it ?

Don't know. Some point of responsibility lies with the 3rd party developers, IMO of course, and as we take pains to improve validation & stability at every step, we endeavour to meet those responsibilities. But let's face it, if erroneous app results can get through 'the system', then so can hardware faults, bad configurations, and sheer vandalism. the 'science' catches that, not validators or mySQL queries (unless relational databases have become more sophisticated than I remember, it's a reduced database, not a knowledgebase

Thise are project staff. NTPCKR & RFI systems to handle that further on.)

For our purposes here, I think we need to devalue the integrity of a single detection in our minds, when the reality says it would go through a whole persistence & re-observation process before publication of any WoWness. We don;t need another WoW like signal, we have one of those and it has proven insufficient to scientifically confirm the presence of an extraterrestrial civilisation.

Jason

Raistmer · « **Reply #71 on:** 30 Dec 2010, 01:13:56 pm »

BTW, there is another possibility for such long unmaintained host setup.
Sometime host can escape grasp of initial installer. I have such host in my fleet for example. It still produces correct data, but even if it will go wrong - I will not able to do anything with it.
Perhaps host deletion/blocking from SETI web site should be added....

Jason G · « **Reply #72 on:** 30 Dec 2010, 01:16:12 pm »

Quote from: Raistmer on 30 Dec 2010, 01:13:56 pm

Perhaps host deletion/blocking from SETI web site should be added....

Joking, Automated system to contact ISP, requesting to connect 240 Volts down the cable ?

More seriously, updated version/planclass described earlier should cut work off from that host, though as mentioned I don't know the practicalities of that.

Richard Haselgrove · « **Reply #73 on:** 30 Dec 2010, 01:32:52 pm »

Quote from: Raistmer on 30 Dec 2010, 12:36:22 pm

About "suicide pill" - if it's implemented as library with enough easy to use interface I'm ready to include it in my builds. But have no intentions to develop such thing.

When I was asked to produce a 'trial' version of one of my programs, many years ago, I found some shareware which could be applied to the completed, compiled .exe file - a sort of wrapper. That, of course, meant that the trial version of the program was identical to the paid-for version, and the wrapper was completely independent of the development environment used to produce it. But IIRC, it popped up a nag screen saying how many more days it could be used - not advisable for an app with no other UI

. There were other restriction policies available, but that's the only one I used.

It's only installed on a now-retired development machine - I could fire that up and retrieve it if anyone wants. Similarly, newer/better products should be available?

Jason G · « **Reply #74 on:** 30 Dec 2010, 01:42:55 pm »

Quote from: Richard Haselgrove on 30 Dec 2010, 01:32:52 pm

It's only installed on a now-retired development machine - I could fire that up and retrieve it if anyone wants. Similarly, newer/better products should be available?

OK, a dumb idea along those lines from me. Induce an update cycle & reset the project (via bonccmd.exe or similar mechanism) if not current.

[Later:] Something looking potentially relevant to today's discussion... NTPCKR seems to be configured to ignore spikes (my guess, nothing more, from the command line).

Quote

+ <daemon>
+ <host>maul</host>
+ <cmd>ntpckr.x86_64 -nospikes -mod 4 0 -hpsd -dayscool 5 -summarize -projectdir /home/boincadm/projects/sah</cmd>
+ <output>ntpckr1.log</output>
+ <pid_file>ntpckr1.pid</pid_file>
+ <disabled>1</disabled>
+ </daemon>
...
<task>
+ <host>maul</host>
+ <cmd>update_candidate_counts.x86_64 -projectdir /home/boincadm/projects/sah</cmd>
+ <output>update_candidate_counts.log</output>
+ <period>1 hours</period>
+ <disabled>0</disabled>
+ </task>

Author Topic: When corrupted results get validated... (Read 59875 times)

Claggy

Re: When corrupted results get validated...

Raistmer

Re: When corrupted results get validated...

perryjay

Re: When corrupted results get validated...

Miep

Re: When corrupted results get validated...

Josef W. Segur

Re: When corrupted results get validated...

Richard Haselgrove

Re: When corrupted results get validated...

Jason G

Re: When corrupted results get validated...

Raistmer

Re: When corrupted results get validated...

Jason G

Re: When corrupted results get validated...

Raistmer

Re: When corrupted results get validated...

Jason G

Re: When corrupted results get validated...

Raistmer

Re: When corrupted results get validated...

Jason G

Re: When corrupted results get validated...

Richard Haselgrove

Re: When corrupted results get validated...

Jason G

Re: When corrupted results get validated...