Luminous Landscape Forum

Site & Board Matters => About This Site => Topic started by: Christopher Sanderson on December 03, 2010, 12:00:36 pm

Title: LuLa Server down for almost 24hrs!
Post by: Christopher Sanderson on December 03, 2010, 12:00:36 pm
Our server hosting company 'The Planet' in Texas had a major meltdown with our server yesterday - Dec2.

As you are probably aware, the site has now been in and out of service for 24 hours. One problem cascaded into another and basically the whole server OS & site has had to be re-built from back-ups. There does not appear to be any data loss thus far.The problems are almost sorted out but work continues.  Stay tuned.

And of course, our apologies!
Title: Re: LuLa Server down for 24hrs!
Post by: Christopher Sanderson on December 03, 2010, 12:10:48 pm
The main Home page still appears inaccessible. 'Our very best' are working on the problem to get the site back to normal

Basically it seems that what occurred was a hardware problem that corrupted the server OS. This was initially mis-diagnosed as an OS failure and of course as soon as a re-install happened, the OS was once again corrupted by the faulty hardware....

I believe new machine parts are now in place and the copying of data has begun and continues...
Title: Re: LuLa Server down for almost 24hrs!
Post by: John.Murray on December 03, 2010, 12:51:59 pm
Glad to see you folks up and running :)
Title: Re: LuLa Server down for almost 24hrs!
Post by: Christopher Sanderson on December 03, 2010, 01:17:29 pm
All seems pretty much back to normal now. If readers find any problems, please let us know.
Title: Re: LuLa Server down for almost 24hrs!
Post by: digitaldog on December 03, 2010, 01:35:10 pm
I was going through LuLa withdrawals, glad to see you back up.
Title: Re: LuLa Server down for almost 24hrs!
Post by: Christoph C. Feldhaim on December 03, 2010, 01:36:26 pm
Cold Turkey ......
Title: Re: LuLa Server down for almost 24hrs!
Post by: rothberg on December 03, 2010, 01:54:11 pm
Hooray for the IT guys!

Title: Re: LuLa Server down for almost 24hrs!
Post by: michael on December 03, 2010, 02:07:09 pm
Hooray for the IT guys!


You bet!

Mark Guertin and Vincent Roman, our IT team, both worked for 24 hours straight to get the server back up and running properly. They deserve more than praise.

The folks at the server farm, The Planet, not so much. :'(

In our eleven years online this is the worst outage that we've had. Previously it was never for more than a hour or two.

This failure was a combination of hardware and software, but fortunately our database was well backed up (all 25GB of it).

Time to start planning our new recovery strategy for the next time. With computers, there's always a next time.

Michael

Title: Re: LuLa Server down for almost 24hrs!
Post by: Mark D Segal on December 03, 2010, 02:12:01 pm
Indeed.

I was beginning to think maybe the CIA mistook LULA for Julian Asaange. :-)

Glad it's all back to normal.
Title: Re: LuLa Server down for almost 24hrs!
Post by: fredjeang on December 03, 2010, 02:41:14 pm
I thought that was one of these solar eruption again that isolated Canada from the rest of the world.

Mark Guertin and Vincent Roman deserve a big siesta and congrats for fixing.
Title: Re: LuLa Server down for almost 24hrs!
Post by: Stephen Starkman on December 03, 2010, 03:55:33 pm
Wow. I was thinking maybe you had decided to host WikiLeaks. That would have explained it. :)

Good to see you back.

Stephen
Title: Re: LuLa Server down for almost 24hrs!
Post by: digitaldog on December 03, 2010, 03:59:00 pm
Maybe a conspiracy theory amiss, but a number of sites I visit were/are down last 24 hours (Photo.net is down now). Is it me? Or some one has something against photographers?
Title: Re: LuLa Server down for almost 24hrs!
Post by: Mark D Segal on December 03, 2010, 04:07:03 pm
Maybe a conspiracy theory amiss, but a number of sites I visit were/are down last 24 hours (Photo.net is down now). Is it me? Or some one has something against photographers?

Just because you're paranoid, doesn't mean they aren't out to get you! :-)
Title: Re: LuLa Server down for almost 24hrs!
Post by: digitaldog on December 03, 2010, 04:08:01 pm
Just because you're paranoid, doesn't mean they aren't out to get you! :-)

My site is next? Crap! And I was only worried about full body scanners.
Title: Re: LuLa Server down for almost 24hrs!
Post by: Mark D Segal on December 03, 2010, 04:09:21 pm
My site is next? Crap! And I was only worried about full body scanners.

Well, you've got your priorities right - I mean what are they going to see that they haven't seen before? :-)
Title: Re: LuLa Server down for almost 24hrs!
Post by: EduPerez on December 03, 2010, 04:39:03 pm
Well, you've got your priorities right - I mean what are they going to see that they haven't seen before? :-)

Perhaps everybody has seen yours, but they have not seen mine!
Title: Re: LuLa Server down for almost 24hrs!
Post by: Mark D Segal on December 03, 2010, 04:42:11 pm
Nope! I haven't traveled through a US airport in quite a long while - thank goodness! :-)
Title: Re: LuLa Server down for almost 24hrs!
Post by: digitaldog on December 03, 2010, 04:42:43 pm
Perhaps everybody has seen yours, but they have not seen mine!

So you’ve seen Mark’s film? Mark Wahlberg‘s got nothing on our Mark!

http://www.nerve.com/archived/blogs/the-ten-greatest-prosthetics-in-movie-history-part-1
Title: Re: LuLa Server down for almost 24hrs!
Post by: Mark D Segal on December 03, 2010, 04:49:13 pm
Well Andrew, we'll never know because the best evidence has been EXPUNGED - did you see that? "This video is no longer available because the YouTube account associated with this video has been terminated". Now is THAT something to be paranoid about! You see they ARE out to get us after all!
Title: Re: LuLa Server down for almost 24hrs!
Post by: Eric Myrvaagnes on December 03, 2010, 05:14:24 pm
I thought that was one of these solar eruption again that isolated Canada from the rest of the world.

Mark Guertin and Vincent Roman deserve a big siesta and congrats for fixing.
+1000!

I needed my fix so bad I was ready to head for the ER.

OooooH, Thank you Mark and Vincent, and have a great nap!!!

Eric
Title: Re: LuLa Server down for almost 24hrs!
Post by: Rob C on December 03, 2010, 05:22:49 pm
It's sun spots.

My own website needs me to go to Weebly to access it in case I want to make alterations or check traffic: I couldn't get in.

Worse, with this one down, there will probably have been no traffic!

;-)   or, alternatively, ;-(

Nonetheless, thanks to you guys in Mission Control for getting us out of the warp.

Rob C
Title: Re: LuLa Server down for almost 24hrs!
Post by: Mark D Segal on December 03, 2010, 05:25:45 pm
It's sun spots.


Aw shucks, ya mean there's a scientific explanation? That spoils all the fun.
Title: Re: LuLa Server down for almost 24hrs!
Post by: bobtowery on December 03, 2010, 06:25:26 pm
Well, there was this picture in one of the wiki leaks documents.  But really, I think it is just a case of mistaken identity?

Title: Re: LuLa Server down for almost 24hrs!
Post by: michael on December 03, 2010, 07:03:48 pm
The fact that a lot of other sites had problems yesterday may indeed be related. Our server is maintained at a large hosting farm in Texas (The Planet). They host thousands of servers and sites, and the problems may have manifested across a lot of machines.

Vincent and Mark are both catching up on their sleep, so I won't have a full post mortem until later in the weekend. If I learn anything relevant I'll post it here.

Michael
Title: Re: LuLa Server down for almost 24hrs!
Post by: Slobodan Blagojevic on December 03, 2010, 07:26:10 pm
All seems pretty much back to normal now. If readers find any problems, please let us know.

My avatar is missing and it seems impossible to attach a new one. Also noticed some other members' avatars missing too.
Title: Re: LuLa Server down for almost 24hrs!
Post by: Eric Myrvaagnes on December 03, 2010, 09:20:37 pm
My avatar is missing and it seems impossible to attach a new one. Also noticed some other members' avatars missing too.
So it was all a plot by the folks at Wikileaks to steal LuLa avatars!
If yours is missing, watch for it to appear soon in the NY Times.
Title: Re: LuLa Server down for almost 24hrs!
Post by: Justan on December 04, 2010, 10:19:52 am
Time to start planning our new recovery strategy for the next time. With computers, there's always a next time.

Michael



If you were interested in installing fail-over capability, there are a number of ways to mirror SQL databases and any related software. The goal would be to have a 2nd site, managed by a different group. The on-line site would send regular or real-time updates to the 2nd site. In the event that the primary site goes off line, all that’s needed is to change your dns values so that they point to the 2nd site and you’re back up and running in a few minutes.

It’s not the most trivial of task to establish, but offers many advantages and isn't all that expensive to implement or maintain. Do a Google search on “how to mirror sql servers.”
Title: Re: LuLa Server down for almost 24hrs!
Post by: Christopher Sanderson on December 04, 2010, 10:56:02 am
Yes, this is already 'in the works' - but thanks for the suggestion!
Title: Re: LuLa Server down for almost 24hrs!
Post by: mguertin on December 04, 2010, 04:02:45 pm
My avatar is missing and it seems impossible to attach a new one. Also noticed some other members' avatars missing too.

I'm not sure why some went missing but I will further investigate this.  They all appear to exist so it might be a permissions problem.  You should now be able to upload avatars again, there was a missing PHP module that is now installed.
Title: Re: LuLa Server down for almost 24hrs!
Post by: ErikKaffehr on December 04, 2010, 04:38:48 pm
Congratulations to handling an unexpected problem in reasonable time!

Best regards
Erik

Ps. I have worked with a Mr. Merik Guertin of L3 Maps, no relative of yours?



I'm not sure why some went missing but I will further investigate this.  They all appear to exist so it might be a permissions problem.  You should now be able to upload avatars again, there was a missing PHP module that is now installed.
Title: Re: LuLa Server down for almost 24hrs!
Post by: mguertin on December 04, 2010, 04:41:35 pm
Congratulations to handling an unexpected problem in reasonable time!

Best regards
Erik

Ps. I have worked with a Mr. Merik Guertin of L3 Maps, no relative of yours?


Thanks Erik.  Nope, no relation.

Mark
Title: Re: LuLa Server down for almost 24hrs!
Post by: K.C. on December 04, 2010, 06:15:30 pm
As an IT professional for 25+ years I'm trying to understand why a site, with the level of demand this one has, is being run on a single box and maintained by a couple of guys. No matter how competent you may be that's an old school approach.

If you're using your own server in a colo then you really need to be running RAID and the colo should have another box ready to hot swap to. With all due respect, 24 hrs down time and the need for a manual rebuild is pretty amateur with the options you have available to you.

At the very least write a script and ftp it off site several times a day.

# Dump SQL data
/usr/bin/mysqldump -uUSER -pPASS --all-databases --opt -l --result-file=/backup/mysql/mysqld­ ump.sql

# Compress sql dump
tar zcf /backup/mysqldump.sql.tar.gz /backup/mysql

# UPLOAD TO FTP (DD deletes on successful upload)
ncftpput -f ftplogin.cfg -DD /remote_path /backup/2010_12_4.tar.gz

# EMAIL TO MAILBOX
uuencode /home/user/backup/$DATE.tar.gz Some_Hosting_SQL_Dbases.$DATE.t  ar.gz | mail -s "Some Hosting SQL Database Backup" recipient@domain.com


Title: Re: LuLa Server down for almost 24hrs!
Post by: mguertin on December 04, 2010, 07:34:36 pm
As an IT professional for 25+ years I'm trying to understand why a site, with the level of demand this one has, is being run on a single box and maintained by a couple of guys. No matter how competent you may be that's an old school approach.

If you're using your own server in a colo then you really need to be running RAID and the colo should have another box ready to hot swap to. With all due respect, 24 hrs down time and the need for a manual rebuild is pretty amateur with the options you have available to you.

<snip>



K.C.:

The length of downtime had nothing to do with us not having backups -- in fact we had backups right down to the last minute we were online.  It had everything to do with hardware failure and response times in first diagnosing and then rectifying the problem at the DC end of the equation, and I can assure you that we are taking this up with our provider.  Also I'm not really sure where you get the idea that we performed a manual rebuild of the server or data.  As stated we had full and complete backups of anything remotely considered essential right up to the last minute the old server was online to work from and we restored from these backups. 

A very small portion of the actual downtime was required for the actual data restore.  We also have offsite backups as well but had we made the decision to take that route it's likely that it would have ultimately taken longer at the end of the day than it did to wait the (unacceptably long) time it took the DC to get it's act together and get us back onto functional hardware.  RAID would not have helped us in this situation -- this was not a hard drive failure -- and in fact had we had RAID to deal with for the hardware changeover it likely would have slowed the process down yet again.  I'm also not really sure how you think that having more people maintaining the site and server would have sped up the process (at least on our end of things), you can't restore data if you have nothing to restore it to...

Lastly I have to say that while emailing uuencdoded data is an interesting backup approach, for a dataset the size we are talking about here it wouldn't even be remotely feasible.

Rest assured there are plans underway that will make sure this type of a failure won't require this kind of turnaround time again.

Mark
Title: Re: LuLa Server down for almost 24hrs!
Post by: K.C. on December 04, 2010, 07:51:38 pm
Mark you describe a much different picture than the thread let me to believe was the case.

Sounds like a familiar scenario. You don't realize the competency, or lack there of, of the people you're relying on until the worst case happens. Time for a new host/colo.

Emailing gigs of data unsecured is common. You dump tables in random order. Nobody sniffing it can get enough info at once for it to be useful.

Title: Re: LuLa Server down for almost 24hrs!
Post by: mguertin on December 04, 2010, 08:56:26 pm
It's not the unsecured data part of the emailing that bothers me as much as the size of said emails ;)
Title: Re: LuLa Server down for almost 24hrs!
Post by: Christoph C. Feldhaim on December 05, 2010, 07:00:21 am
I'd put such a server in a virtualized movable environment, like vmware or Xen.
Just my 0.02
Title: Re: LuLa Server down for almost 24hrs!
Post by: Justan on December 05, 2010, 08:54:52 am
my idea is: how to automatically and constantly backup files in a server with zero intervention (automatized) from a HD?

And I'd like an automatized backup that runs all that in case of crash. Is that possible? (I'm not tech as you can see)

Look into the osql utility program or it’s newer incarnation the sqlcmd util.

There are some ftp programs that will do as you wish and which have their own scheduler, or you can use the ftp command line and use the system scheduler, at least in windows boxes.
Title: Re: LuLa Server down for almost 24hrs!
Post by: Justan on December 05, 2010, 09:17:15 am

The length of downtime had nothing to do with us not having backups -- in fact we had backups right down to the last minute we were online.  It had everything to do with hardware failure and response times in first diagnosing and then rectifying the problem at the DC end of the equation, and I can assure you that we are taking this up with our provider.  [snip]

...you can't restore data if you have nothing to restore it to...


Disaster recovery is a thorny topic and a troublesome thing to implement. Few will spend the time or $$ to implement a fail-over system due to cost, complexity. It takes this kind of problem to motivate and show the value of a fail-over solution.

This appears a classic case where it takes a series of failures to identify the nature of the infrastructure’s (the data center) shortcomings. It sounds like the core issue is that the data center was not quick to identify or resolve their hardware problems ( :o ) and from what you wrote, didn't have a ready solution ( :o :o). And added to that, the site’s management did not plan for or expect the data center to let them down. ( :o )

The good news is that the backups worked ( HURRAY ;D ;D ;D) so little or nothing was lost but time, and gave the site’s management the opportunity to see where the recovery scheme could be improved.

Bravo on the diligence and getting the site up and running in short order!
Title: Re: LuLa Server down for almost 24hrs!
Post by: Craig Arnold on December 05, 2010, 09:17:56 am
I'd put such a server in a virtualized movable environment, like vmware or Xen.
Just my 0.02

Yup downtime would have likely been zero (if the hardware was failing with a detectable failure just Vmotion it automatically) or down to a few minutes at most if you needed to spin up an new instance.

Check out something like the Rackspace Cloud web hosting solutions (there are of course other providers too - but starting with Rackspace gives you an idea of what is possible). No single point of failure anywhere. Essentially infinitely scalable too.