Category: "Chicago DC"

Emergency Hypervisor Maintenance

November 16, 2014 at 4:54 PM

We are currently experiencing an issue on one of our VPS hypervisors that requires immediate maintenance. This will not impact all VPS customers, however a handful will see intermittent service during this time. We expect this issue to be resolved within 15-30 minutes. We are very sorry for the inconvenience. Please look to this post for further updates as the maintenance proceeds.

Update: Service has been restored for all VMs directly affected by the maintenance this evening. We are going through all VMs to ensure everyone is back online.

Incoming DDoS Attack - Helios (11/1/2014)

November 1, 2014 at 5:14 PM

Our admin Geeks have been dealing with a number of DDoS attacks targeting the Helios system that began last night. The DDoS attacks have caused significant downtime for users on the primary shared IP of Helios. We do have a solution, and most customers on Helios have received e-mails pertaining to this. Let's look at how we handle a DDoS attack.

DDoS Mitigation

These are the steps we take to mitigate incoming and frequent DDoS attacks:

Step 1: Nullroute. 

Our first line of defense against DDoS is to immediately halt incoming traffic to the attacked IP address(es). This is called a "nullroute." Our upstream provider nullroutes the attacked IP as soon as they see an attack and see that it meets specific criteria (namely, if it will impact other users significantly).

A nullroute does two things: it prevents other IPs in our own network from going down due to the incoming attack, and it also prevents routing of the attacked IP so the attacker can't reach the IP. This means they can't attack anything until our IP is back online. We leave the IP offline for a little while, and bring it back up once the attacker has lost interest. Often times this is the end of the attacks, but sometimes they pick back up later.

Step 2: Identify the attacked website.

Incoming DDoS attacks always have a target. The attacker has a grudge with a particular domain on the server, or maybe they just don't like the server. The underlying reason is rarely known; most attackers never reach out and let us know why.

If we can identify the attacked website, we can segregate the associated traffic to a different IP address. This way, we can ultimately just nullroute the IP hosting the targeted site instead of the IP hosting a few hundred websites. This limits service interruption to one account, and keeps other customers online during DDoS attacks. If the DDoS attacks never subside, then the account holder can choose to employ DDoS mitigation services to keep their site online through an attack. Mitigation services can be very costly.

Step 3: Account IP Dispersion.

If the attacked website is unable to be identified, we resort to something we call Account IP Dispersion. You may be familiar with your account IP address. The same IP address that hosts your account may also host anywhere from one to a few hundred websites. We try to bring that number down as far as possible by dispersing accounts among a dozen or more new IPs, and doing this relies on a bit of custom code we built here at GeekStorage. In short, the code scans each website on the attacked IP to determine whether the hosted domains on that account are using our nameservers. If all of the account domains point to our nameservers, then we can change the hosted IP reliably and without downtime. We do this by using both the old and new IPs to host the website while the IP transitions (which can take up to 24 hours).

The first Helios account IP dispersion is complete; no IP is host to more than 30 accounts (excluding the IP holding domains that do not point to our nameservers). Within 24 hours, if attacks continue, we will narrow down which account is being attacked to a subset of 30, and disperse those accounts once again until we determine the attacked website. Fewer accounts will be impacted by the DDoS attacks with each iteration of the account IP dispersion until ultimately only one account sees downtime.

Helios Status

The Helios server is currently 100% online. It has seen four downtime-causing DDoS attacks over the last 24 hours, but our Geeks are hard at work keeping an eye on incoming traffic and preparing for the next step should the attacks continue. We hope to post a quick resolution to this issue within the next 24 hours.

If you are concerned about additional downtime, there are a couple things to keep in mind:

  • Migration to a new server is almost as effective as migration to a new IP. Most customers are already on a new IP. The likelihood of additional downtime for any particular account has been reduced by about 85% by the account IP dispersion. There is also a small chance that your site is the target, in which case a migration will not help.
  • Enabling CloudFlare via cPanel can improve your site speed and reliability. This holds true whether your account is being affected by DDoS attacks or otherwise.

We hope this sheds some light on the actions we are taking to improve availability on the Helios server despite recent DDoS attacks. If you have any questions for our support team, please don't hesitate to ask. We are available 24/7 via our support desk or at [email protected].

Emergency Maintenance - Atlas Server (8/27/2014)

August 27, 2014 at 8:19 PM

We have identified an issue on the Atlas server that requires immediately attention and will initiate downtime shortly. Downtime should be less than 30 minutes and there should be no other adverse effects of this maintenance.

This maintenance has been completed and service accessibility and optimum speeds should be restored on the Atlas server at this time.

Goliath Issue (Resolved)

July 31, 2014 at 8:25 AM

The Goliath server has crashed this morning. We are looking in to the cause and working towards a resolution at this time. We will bring Goliath back online as soon as possible, however we do not yet have an ETA for resolution. Further updates will be posted here as they become available.

Update 9:30AM CDT: Goliath is running a Filesystems Check (FSCK) and should be back online within 15-30 minutes.
Update 9:36AM CDT: Goliath is back online at this time and services should again be accessible.
Final Update 9:52AM CDT:  We do not have a specific cause of the crash however we can attribute the most likely cause to an I/O error. As such, we are tweaking configurations to help ensure this does not occur again, and enabling additional logging to monitor the server more closely for some time (all RAID arrays are in tact and optimal, and a FSCK was completed successfully, as well). We apologize for the downtime and would like to thank everyone for their patience this morning.

Scheduled Maintenance - OnApp Platform - 06/15/2014 - 06/16/2014 (Completed)

June 11, 2014 at 12:25 PM

Start Date: 06/15/2014

Start Time: 09:00 PM Central Time

Estimated End Date: 06/16/2014

Estimated End Time: 02:00 AM Central Time

Service: OnApp Platform Upgrade

Type of Work: Software Upgrade and Testing

Impact of Work: Intermittent service availability of VPS servers during upgrade and testing during the maintenance window

This notification is to inform you that on Sunday, June 15th, 2014, beginning at 09:00 PM Central Time, with an estimated ending on Monday, June 16th, 2014, at 2:00 AM Central Time, the OnApp Platform will have intermittent inaccessibility during a planned software upgrade and necessary testing. All support tickets will be answered as usual. We do ask that you refrain from submitting a ticket regarding your VPS during this maintenance window as any issue(s) you encounter are likely a direct result of this maintenance.

Update #1 - 9:58 PM Central Time, 06/15/2014

The start of the maintenance has been delayed until 10 PM Central Time. We still hope to have the maintenance fully completed by the 2 AM Central Time estimated ending.

Update #2 - 2:17 PM Central Time, 06/16/2014

We are wrapping up some maintenance work on the backend, but otherwise the maintenance has been completed successfully. Most VPS experienced less than 5 minutes of downtime (if any) and there were only a handful that experienced a slightly longer downtime, so we're thankful this maintenance went quite smoothly overall.

We will need to perform about 1 more hour of maintenance on the platform at 11 PM Central Time 06/16/2014. This is a pretty minor maintenance operation and we don't expect more than 5 minutes of downtime (if any) for all VPS on the OnApp platform.