Server down
Incident Report for FRUITION

We're Sorry for the Downtime

Fruition's server admin, its response team, and all team members sincerely apologize for this evenings outage. This incident likely caused extra stress for website owners because of the upcoming ecommerce shopping season. There is never a good time for an outage and one right before a major shopping event certainly causes additional stress because of the fear of lost holiday shopping revenue.

Assurances

We want to assure you that we are fully aware of the stakes involved with hosting and take all outages seriously. This particular server, with this outage included, has over 99.9% up-time for this year. We take that high up-time history seriously and strive to never have down time.

Timing

At Fruition we've been hosting websites for 15 years and we know things break, configurations go bad, and unknown issues pop up..All to often these issues pop up at in-opportune times.

The outage tonight coincided with a major hardware upgrade that was scheduled for just 30 minutes after this incident started. That caused some confusion for customers. who were anticipating the planned downtime. We had plenty of hands available to complete the planned upgrade and fix the down server thus the planned maintenance proceeded. .

The planned maintenance was necessary for a major hardware upgrade to mitigate future scalability, security, and stability of Fruition's hosting infrastructure. The upgrade was completed as scheduled and on-time.

Cause of Downtime

The initial cause of this downtime was an errant reboot. Then the server did not reboot because of a corrupted boot sector. After the boot sector was repaired additional issues were discovered with filesystem and disks. Those issues were fixed.which took a significant amount of time because of the size of the storage and automated disk checks which run when issues appear to ensure data integrity.. After the scans completed the server was rebooted. The server rebooted fine. After reboot individual websites were checked and confirmed to be fully operations. The server is now fully operational.

Worst Case Scenario (which did not happen)

In the event that the server did not reboot after the filesystem and boot fixes it would have been restored using Idera Bare Metal recovery disk..Here, the issue was local and not related to a specific geographic datacenter. Thus, a new server would have been turned on and the restore would have began. The full restore would have taken ~8 hours. Again, with the hardware upgrade that was unrelated to this outage future downtime and restores will be less.

Future Downtime

Part of the planned upgrade was to migrate servers to a more redundant setup. Some downtime will be schedule in the future to facilitate that move.

Again everyone here at Fruition appreciates your business and we sincerely apologize for this evenings downtime.

Posted almost 3 years ago. Nov 25, 2014 - 22:33 MST

Resolved
This incident has been resolved.
Posted almost 3 years ago. Nov 25, 2014 - 21:45 MST
Update
A disk was previously incorrectly added to ext2 instead of ext3. This triggered quotacheck on boot. This is being fixed now as is a filesystem issue. We believe all issues have been identified and we are nearing a resolution. The final processes take 30-40 minutes for completion...
Posted almost 3 years ago. Nov 25, 2014 - 20:54 MST
Update
Grub and kernel issues have been repaired. We are now rebooting again to check the repair. Filesystem issues were detected on boot and need to be addressed before public connectivity is restored to the server. This process usually takes 30-60 minutes.
Posted almost 3 years ago. Nov 25, 2014 - 19:53 MST
Update
Kernel and boot loaders are corrupted. Repairs are ongoing..
Posted almost 3 years ago. Nov 25, 2014 - 19:40 MST
Identified
We are investigating a server outage on a co-located server in Dallas.
Posted almost 3 years ago. Nov 25, 2014 - 16:49 MST