To help alleviate panic, here's the gist of our general plan:1. get thumper back up and running with a three-way root mirror. If all goes well, this will be done enough sometime tomorrow (Wednesday), i.e. we'll have a two-way root mirror and let the third one sync up in the background while we bring the system up, then during next week's outage do more drive swapping to install grub/finish the resync on this third drive.Splitting/assimilating will be completely off for all projects until thumper is back up.2. as soon as thumper is back up (tomorrow?) we can turn splitting/assimilating on for AP and get to work on the pulse table reconfiguration (which we can only do if the system/database is up). The plan (in simplest terms) is: create new database chunks, copy the current pulse table to these new chunks, then drop the old table and rename the new one. We estimate at least 24 hours for that.So if we time things right we may be fully functional before the end of thursday, maybe friday. However considering the lost time this morning and the usual unexpected hurdles that crop up.. that's why I give it a week if only to keep expectations realistic, yet leave room for potential pleasant surprises.- Matt
Mmm-kay. So where are we at with the science database...? The morning today was much like yesterday: me, Eric, and Jeff shouting over the deafening noise of the server closet, taking turns hunched over a monitor attached directly to thumper (the kvm monitor was having separate issues). Lots of reboots and unexpected (and unpleasant) results. Lots of thinking we found the problem only to reboot and (five minutes later) finding we were wrong, then having to reboot again off of DVD (taking another five minutes).Basically our discussions were along the lines of: Why does the boot metadevice disappear when booting off of DVD? And why does the root metadevice disappear when coming up via grub? Didn't we resync these two drives yesterday? Oh look - the grub device map is referring to /dev/sdm, which was how the root drive was ennumerated when there were only 24 drives in the system - it should be referring to /dev/sdy now that we have 48 - so this must be at least one of our problems! Nope. Changing that did nothing. Etc. etc. etc. etc.Well, whatever. It's been a two-day-long game like a demented version Towers-of-Hanoi - swapping drives, installing/reinstalling grub, resyncing devices, reconfiguring mdadm, then going back to step one and trying a different permutation. On hindsight it probably would have been easier to just install a new OS from scratch (though we would have had to recreate a web of informix configuration which also exists on the root drives). Right now the system is actually up (finally) and resyncing one mirror (again) and will have to sync another once that's finished. So we're offline for another day, and we haven't even gotten to the pulse table problems yet. I will stil try to get Astropulse running in some form later on today/tonight.Funny thing: Oliver and Bernd of Einstein@home have been visiting from Germany, collaborating with Dave on some general BOINC stuff. They left just a couple hours ago, but we did discuss how when SETI@home is having issues such as this, Einstein@home certainly gets a huge "bump" from the suddenly influx of free CPU time. We joked how the these thumper issues strangely coincided with their arrival last week.Meanwhile, I'm back on radar blanking detail. We're now trying cross-correlations to match radar patterns using fftw.- Matt
So the focus is still on thumper, the science database/raw data server. Last night we finished resyncing all the root drives (a three drive mirror). We still have to do some swapping to install grub on the third and final drive - we'll do this during the outage next week. Until then we're officially resuming normal operations, at least at the server level. Phew. I started up several raw data transfer jobs since that's been backed up for a week.Now we can turn our attention to the database. We're dumping the entire pulse table to a file so we can recreate the table in a larger set of db spaces. This is basically all you can do when you run out of extents - unload the table, then reload into new db spaces. I roughly estimate the unload will take at least 24 more hours.Since we couldn't insert pulses until we got more extents, the assimilator queue grew fairly large. So why stop now? There's really no reason not to split/create new multibeam workunits - we can still insert workunits into the science database. So I started a single multibeam splitter if only to satisfy some workunit demand until we can assimilate again. Of course, if we can't assimilate, we can't delete - and we've been running low on space to store workunits. But being that we've been running only astropulse for a day that actually helped push a lot of ap workunits/results through the validation/assimilation/deletion queues, which in turn cleared up a fair amount of storage. So we're good for the moment, at least storage-wise (seems like even the one splitter is sensitive to the current heavy load on thumper).Tomorrow is actually an official university holiday (the staff gets its one day of spring break). However, like always, Jeff, Eric, Bob and I will be poking and prodding at the servers remotely over the weekend.- Matt
Do astropulse count as regular seti credit.Thank you