Author Topic: SETI MB OUTAGE-3/24/09 (Read 21066 times)

arkayn · « **on:** 24 Mar 2009, 05:56:30 pm »

There is a problem with the Thumper machine and according to Matt's Post- "...don't expect SETI@home to be generating any new work or assimilating anything for a week. We'll at least try to keep Astropulse working during this time, so computers that can run Astropulse will be kept busy."

Updates will be posted as they are received.

arkayn · « **Reply #1 on:** 24 Mar 2009, 06:10:39 pm »

From Matt

Quote

To help alleviate panic, here's the gist of our general plan:

1. get thumper back up and running with a three-way root mirror. If all goes well, this will be done enough sometime tomorrow (Wednesday), i.e. we'll have a two-way root mirror and let the third one sync up in the background while we bring the system up, then during next week's outage do more drive swapping to install grub/finish the resync on this third drive.

Splitting/assimilating will be completely off for all projects until thumper is back up.

2. as soon as thumper is back up (tomorrow?) we can turn splitting/assimilating on for AP and get to work on the pulse table reconfiguration (which we can only do if the system/database is up). The plan (in simplest terms) is: create new database chunks, copy the current pulse table to these new chunks, then drop the old table and rename the new one. We estimate at least 24 hours for that.

So if we time things right we may be fully functional before the end of thursday, maybe friday. However considering the lost time this morning and the usual unexpected hurdles that crop up.. that's why I give it a week if only to keep expectations realistic, yet leave room for potential pleasant surprises.

- Matt

arkayn · « **Reply #2 on:** 25 Mar 2009, 09:33:02 pm »

Quote

Mmm-kay. So where are we at with the science database...? The morning today was much like yesterday: me, Eric, and Jeff shouting over the deafening noise of the server closet, taking turns hunched over a monitor attached directly to thumper (the kvm monitor was having separate issues). Lots of reboots and unexpected (and unpleasant) results. Lots of thinking we found the problem only to reboot and (five minutes later) finding we were wrong, then having to reboot again off of DVD (taking another five minutes).

Basically our discussions were along the lines of: Why does the boot metadevice disappear when booting off of DVD? And why does the root metadevice disappear when coming up via grub? Didn't we resync these two drives yesterday? Oh look - the grub device map is referring to /dev/sdm, which was how the root drive was ennumerated when there were only 24 drives in the system - it should be referring to /dev/sdy now that we have 48 - so this must be at least one of our problems! Nope. Changing that did nothing. Etc. etc. etc. etc.

Well, whatever. It's been a two-day-long game like a demented version Towers-of-Hanoi - swapping drives, installing/reinstalling grub, resyncing devices, reconfiguring mdadm, then going back to step one and trying a different permutation. On hindsight it probably would have been easier to just install a new OS from scratch (though we would have had to recreate a web of informix configuration which also exists on the root drives). Right now the system is actually up (finally) and resyncing one mirror (again) and will have to sync another once that's finished. So we're offline for another day, and we haven't even gotten to the pulse table problems yet. I will stil try to get Astropulse running in some form later on today/tonight.

Funny thing: Oliver and Bernd of Einstein@home have been visiting from Germany, collaborating with Dave on some general BOINC stuff. They left just a couple hours ago, but we did discuss how when SETI@home is having issues such as this, Einstein@home certainly gets a huge "bump" from the suddenly influx of free CPU time. We joked how the these thumper issues strangely coincided with their arrival last week.

Meanwhile, I'm back on radar blanking detail. We're now trying cross-correlations to match radar patterns using fftw.

- Matt

elec999 · « **Reply #3 on:** 26 Mar 2009, 03:40:53 pm »

I've enabled astropulse still getting no astro work.
Thank you

arkayn · « **Reply #4 on:** 26 Mar 2009, 08:44:22 pm »

Quote

So the focus is still on thumper, the science database/raw data server. Last night we finished resyncing all the root drives (a three drive mirror). We still have to do some swapping to install grub on the third and final drive - we'll do this during the outage next week. Until then we're officially resuming normal operations, at least at the server level. Phew. I started up several raw data transfer jobs since that's been backed up for a week.

Now we can turn our attention to the database. We're dumping the entire pulse table to a file so we can recreate the table in a larger set of db spaces. This is basically all you can do when you run out of extents - unload the table, then reload into new db spaces. I roughly estimate the unload will take at least 24 more hours.

Since we couldn't insert pulses until we got more extents, the assimilator queue grew fairly large. So why stop now? There's really no reason not to split/create new multibeam workunits - we can still insert workunits into the science database. So I started a single multibeam splitter if only to satisfy some workunit demand until we can assimilate again. Of course, if we can't assimilate, we can't delete - and we've been running low on space to store workunits. But being that we've been running only astropulse for a day that actually helped push a lot of ap workunits/results through the validation/assimilation/deletion queues, which in turn cleared up a fair amount of storage. So we're good for the moment, at least storage-wise (seems like even the one splitter is sensitive to the current heavy load on thumper).

Tomorrow is actually an official university holiday (the staff gets its one day of spring break). However, like always, Jeff, Eric, Bob and I will be poking and prodding at the servers remotely over the weekend.

- Matt

elec999 · « **Reply #5 on:** 27 Mar 2009, 02:11:43 pm »

Do astropulse count as regular seti credit.
Thank you

sunu · « **Reply #6 on:** 27 Mar 2009, 02:26:28 pm »

Quote from: elec999 on 27 Mar 2009, 02:11:43 pm

Do astropulse count as regular seti credit.
Thank you

Yes, of course.

bronan · « **Reply #7 on:** 31 Mar 2009, 11:28:46 am »

Sadly have to report that i still get no units after the crash appeared,
I have reset the project, changed the opti apps and so on but until now no work at all since 24-03
So no clue why it won't get any work even though i have set to receive all kinds of units.

arkayn · « **Reply #8 on:** 31 Mar 2009, 08:45:46 pm »

I have been getting plenty to keep my 3 machines busy, of course I also run Milkyway on them as well. ABC runs on my quad for the heck of it.

bronan · « **Reply #9 on:** 02 Apr 2009, 10:14:22 am »

Now i am really confused i turned on my little lappy (Acer Aspire 7520) and was amazed to see that when i added s./s.beta as project it started to run cuda units, while my main monster is not getting units since the outage.
I have deleted all opti apps & reset project s./s.beta and restarted boinc and still nothing, anyone any idea what causes just this project to fail to work.

Vyper · « **Reply #10 on:** 03 Apr 2009, 10:04:44 am »

Try to clear the preferences and help the machine to receive new results..

After awhile boinc learns and start to auto-fill up the queue.. But i needed to spam the server yesterday when i saw that berkeley had 100k wu's in stock..

Now they seem out..

Good luck

Kind regards Vyper

Author Topic: SETI MB OUTAGE-3/24/09 (Read 21066 times)

arkayn

SETI MB OUTAGE-3/24/09

arkayn

Re: SETI MB OUTAGE-3/24/09

arkayn

Re: SETI MB OUTAGE-3/24/09

elec999

Re: SETI MB OUTAGE-3/24/09

arkayn

Re: SETI MB OUTAGE-3/24/09

elec999

Re: SETI MB OUTAGE-3/24/09

sunu

Re: SETI MB OUTAGE-3/24/09

bronan

Re: SETI MB OUTAGE-3/24/09

arkayn

Re: SETI MB OUTAGE-3/24/09

bronan

Re: SETI MB OUTAGE-3/24/09

Vyper

Re: SETI MB OUTAGE-3/24/09

Author Topic: **SETI MB OUTAGE-3/24/09** (Read 21066 times)

elec999

elec999

bronan

bronan

Author Topic: SETI MB OUTAGE-3/24/09 (Read 21066 times)