Topic: LuLa Server down for almost 24hrs! (Read 10492 times)

Rob C · « **Reply #20 on:** December 03, 2010, 05:22:49 pm »

It's sun spots.

My own website needs me to go to Weebly to access it in case I want to make alterations or check traffic: I couldn't get in.

Worse, with this one down, there will probably have been no traffic!

;-) or, alternatively, ;-(

Nonetheless, thanks to you guys in Mission Control for getting us out of the warp.

Rob C

Mark D Segal · « **Reply #21 on:** December 03, 2010, 05:25:45 pm »

Quote from: Rob C on December 03, 2010, 05:22:49 pm

It's sun spots.

Aw shucks, ya mean there's a scientific explanation? That spoils all the fun.

bobtowery · « **Reply #22 on:** December 03, 2010, 06:25:26 pm »

Well, there was this picture in one of the wiki leaks documents. But really, I think it is just a case of mistaken identity?

michael · « **Reply #23 on:** December 03, 2010, 07:03:48 pm »

The fact that a lot of other sites had problems yesterday may indeed be related. Our server is maintained at a large hosting farm in Texas (The Planet). They host thousands of servers and sites, and the problems may have manifested across a lot of machines.

Vincent and Mark are both catching up on their sleep, so I won't have a full post mortem until later in the weekend. If I learn anything relevant I'll post it here.

Michael

Slobodan Blagojevic · « **Reply #24 on:** December 03, 2010, 07:26:10 pm »

Quote from: Chris Sanderson on December 03, 2010, 01:17:29 pm

All seems pretty much back to normal now. If readers find any problems, please let us know.

My avatar is missing and it seems impossible to attach a new one. Also noticed some other members' avatars missing too.

Eric Myrvaagnes · « **Reply #25 on:** December 03, 2010, 09:20:37 pm »

Quote from: Slobodan Blagojevic on December 03, 2010, 07:26:10 pm

My avatar is missing and it seems impossible to attach a new one. Also noticed some other members' avatars missing too.

So it was all a plot by the folks at Wikileaks to steal LuLa avatars!
If yours is missing, watch for it to appear soon in the NY Times.

Justan · « **Reply #26 on:** December 04, 2010, 10:19:52 am »

Quote from: michael on December 03, 2010, 02:07:09 pm

Time to start planning our new recovery strategy for the next time. With computers, there's always a next time.

Michael

If you were interested in installing fail-over capability, there are a number of ways to mirror SQL databases and any related software. The goal would be to have a 2nd site, managed by a different group. The on-line site would send regular or real-time updates to the 2nd site. In the event that the primary site goes off line, all that’s needed is to change your dns values so that they point to the 2nd site and you’re back up and running in a few minutes.

It’s not the most trivial of task to establish, but offers many advantages and isn't all that expensive to implement or maintain. Do a Google search on “how to mirror sql servers.”

Christopher Sanderson · « **Reply #27 on:** December 04, 2010, 10:56:02 am »

Yes, this is already 'in the works' - but thanks for the suggestion!

mguertin · « **Reply #28 on:** December 04, 2010, 04:02:45 pm »

Quote from: Slobodan Blagojevic on December 03, 2010, 07:26:10 pm

My avatar is missing and it seems impossible to attach a new one. Also noticed some other members' avatars missing too.

I'm not sure why some went missing but I will further investigate this. They all appear to exist so it might be a permissions problem. You should now be able to upload avatars again, there was a missing PHP module that is now installed.

ErikKaffehr · « **Reply #29 on:** December 04, 2010, 04:38:48 pm »

Congratulations to handling an unexpected problem in reasonable time!

Best regards
Erik

Ps. I have worked with a Mr. Merik Guertin of L3 Maps, no relative of yours?

Quote from: Mark Guertin on December 04, 2010, 04:02:45 pm

I'm not sure why some went missing but I will further investigate this. They all appear to exist so it might be a permissions problem. You should now be able to upload avatars again, there was a missing PHP module that is now installed.

mguertin · « **Reply #30 on:** December 04, 2010, 04:41:35 pm »

Quote from: ErikKaffehr on December 04, 2010, 04:38:48 pm

Congratulations to handling an unexpected problem in reasonable time!

Best regards
Erik

Ps. I have worked with a Mr. Merik Guertin of L3 Maps, no relative of yours?

Thanks Erik. Nope, no relation.

Mark

K.C. · « **Reply #31 on:** December 04, 2010, 06:15:30 pm »

As an IT professional for 25+ years I'm trying to understand why a site, with the level of demand this one has, is being run on a single box and maintained by a couple of guys. No matter how competent you may be that's an old school approach.

If you're using your own server in a colo then you really need to be running RAID and the colo should have another box ready to hot swap to. With all due respect, 24 hrs down time and the need for a manual rebuild is pretty amateur with the options you have available to you.

At the very least write a script and ftp it off site several times a day.

# Dump SQL data
/usr/bin/mysqldump -uUSER -pPASS --all-databases --opt -l --result-file=/backup/mysql/mysqld ump.sql

# Compress sql dump
tar zcf /backup/mysqldump.sql.tar.gz /backup/mysql

# UPLOAD TO FTP (DD deletes on successful upload)
ncftpput -f ftplogin.cfg -DD /remote_path /backup/2010_12_4.tar.gz

# EMAIL TO MAILBOX
uuencode /home/user/backup/$DATE.tar.gz Some_Hosting_SQL_Dbases.$DATE.t ar.gz | mail -s "Some Hosting SQL Database Backup" recipient@domain.com

mguertin · « **Reply #32 on:** December 04, 2010, 07:34:36 pm »

Quote from: K.C. on December 04, 2010, 06:15:30 pm

As an IT professional for 25+ years I'm trying to understand why a site, with the level of demand this one has, is being run on a single box and maintained by a couple of guys. No matter how competent you may be that's an old school approach.

If you're using your own server in a colo then you really need to be running RAID and the colo should have another box ready to hot swap to. With all due respect, 24 hrs down time and the need for a manual rebuild is pretty amateur with the options you have available to you.

<snip>

K.C.:

The length of downtime had nothing to do with us not having backups -- in fact we had backups right down to the last minute we were online. It had everything to do with hardware failure and response times in first diagnosing and then rectifying the problem at the DC end of the equation, and I can assure you that we are taking this up with our provider. Also I'm not really sure where you get the idea that we performed a manual rebuild of the server or data. As stated we had full and complete backups of anything remotely considered essential right up to the last minute the old server was online to work from and we restored from these backups.

A very small portion of the actual downtime was required for the actual data restore. We also have offsite backups as well but had we made the decision to take that route it's likely that it would have ultimately taken longer at the end of the day than it did to wait the (unacceptably long) time it took the DC to get it's act together and get us back onto functional hardware. RAID would not have helped us in this situation -- this was not a hard drive failure -- and in fact had we had RAID to deal with for the hardware changeover it likely would have slowed the process down yet again. I'm also not really sure how you think that having more people maintaining the site and server would have sped up the process (at least on our end of things), you can't restore data if you have nothing to restore it to...

Lastly I have to say that while emailing uuencdoded data is an interesting backup approach, for a dataset the size we are talking about here it wouldn't even be remotely feasible.

Rest assured there are plans underway that will make sure this type of a failure won't require this kind of turnaround time again.

Mark

K.C. · « **Reply #33 on:** December 04, 2010, 07:51:38 pm »

Mark you describe a much different picture than the thread let me to believe was the case.

Sounds like a familiar scenario. You don't realize the competency, or lack there of, of the people you're relying on until the worst case happens. Time for a new host/colo.

Emailing gigs of data unsecured is common. You dump tables in random order. Nobody sniffing it can get enough info at once for it to be useful.

mguertin · « **Reply #34 on:** December 04, 2010, 08:56:26 pm »

It's not the unsecured data part of the emailing that bothers me as much as the size of said emails

Christoph C. Feldhaim · « **Reply #35 on:** December 05, 2010, 07:00:21 am »

I'd put such a server in a virtualized movable environment, like vmware or Xen.
Just my 0.02

Justan · « **Reply #36 on:** December 05, 2010, 08:54:52 am »

Quote from: fredjeang on December 04, 2010, 06:38:26 pm

my idea is: how to automatically and constantly backup files in a server with zero intervention (automatized) from a HD?

And I'd like an automatized backup that runs all that in case of crash. Is that possible? (I'm not tech as you can see)

Look into the osql utility program or it’s newer incarnation the sqlcmd util.

There are some ftp programs that will do as you wish and which have their own scheduler, or you can use the ftp command line and use the system scheduler, at least in windows boxes.

Justan · « **Reply #37 on:** December 05, 2010, 09:17:15 am »

Quote from: Mark Guertin on December 04, 2010, 07:34:36 pm

The length of downtime had nothing to do with us not having backups -- in fact we had backups right down to the last minute we were online. It had everything to do with hardware failure and response times in first diagnosing and then rectifying the problem at the DC end of the equation, and I can assure you that we are taking this up with our provider. [snip]

...you can't restore data if you have nothing to restore it to...

Disaster recovery is a thorny topic and a troublesome thing to implement. Few will spend the time or $$ to implement a fail-over system due to cost, complexity. It takes this kind of problem to motivate and show the value of a fail-over solution.

This appears a classic case where it takes a series of failures to identify the nature of the infrastructure’s (the data center) shortcomings. It sounds like the core issue is that the data center was not quick to identify or resolve their hardware problems (

) and from what you wrote, didn't have a ready solution (

). And added to that, the site’s management did not plan for or expect the data center to let them down. (

)

The good news is that the backups worked ( HURRAY

) so little or nothing was lost but time, and gave the site’s management the opportunity to see where the recovery scheme could be improved.

Bravo on the diligence and getting the site up and running in short order!

Craig Arnold · « **Reply #38 on:** December 05, 2010, 09:17:56 am »

Quote from: Christoph C. Feldhaim on December 05, 2010, 07:00:21 am

I'd put such a server in a virtualized movable environment, like vmware or Xen.
Just my 0.02

Yup downtime would have likely been zero (if the hardware was failing with a detectable failure just Vmotion it automatically) or down to a few minutes at most if you needed to spin up an new instance.

Check out something like the Rackspace Cloud web hosting solutions (there are of course other providers too - but starting with Rackspace gives you an idea of what is possible). No single point of failure anywhere. Essentially infinitely scalable too.

Author Topic: LuLa Server down for almost 24hrs! (Read 10492 times)

Rob C

Re: LuLa Server down for almost 24hrs!

Mark D Segal

Re: LuLa Server down for almost 24hrs!

bobtowery

Re: LuLa Server down for almost 24hrs!

michael

Re: LuLa Server down for almost 24hrs!

Slobodan Blagojevic

Re: LuLa Server down for almost 24hrs!

Eric Myrvaagnes

Re: LuLa Server down for almost 24hrs!

Justan

Re: LuLa Server down for almost 24hrs!

Christopher Sanderson

Re: LuLa Server down for almost 24hrs!

mguertin

Re: LuLa Server down for almost 24hrs!

ErikKaffehr

Re: LuLa Server down for almost 24hrs!

mguertin

Re: LuLa Server down for almost 24hrs!

K.C.

Re: LuLa Server down for almost 24hrs!

mguertin

Re: LuLa Server down for almost 24hrs!

K.C.

Re: LuLa Server down for almost 24hrs!

mguertin

Re: LuLa Server down for almost 24hrs!

Christoph C. Feldhaim

Re: LuLa Server down for almost 24hrs!

Justan

Re: LuLa Server down for almost 24hrs!

Justan

Re: LuLa Server down for almost 24hrs!

Craig Arnold

Re: LuLa Server down for almost 24hrs!