Outage Reports From MLH Prime Spring 2016 Finale

This past weekend at the MLH Prime Spring 2016 Finale we had two major outages: (1) the internet went out for a large portion of Saturday and (2) the food was late or not sufficient to feed everyone at the hackathon.

In this post we’re going to talk about what actually happened to cause these issues, how they were resolved during the hackathon, and what we’ll be doing in the future to make sure they don’t happen again. We believe in transparency, and it is important for all of you, our community, to understand what is going on behind the scenes when outages occur and how we deal with them immediately and in future.

As hackers ourselves, we understand how difficult it is to hack without sufficient internet or sustenance, and we apologize to our attendees for not meeting the standard of excellence that we know you expect from Major League Hacking events. Our entire team admired our hackers’ and sponsors’ ability to adapt to and make the best of a bad situation, from the impromptu hacker scavenger hunt organized by Mario at Dell to the hackers who pivoted to do offline hacks that were superbly impressive. We thank you for working with us and understanding that while mistakes happen, we hold ourselves to a higher standard than was shown at this event and will be implementing processes to prevent these outages from reoccurring in the future.

Internet Outage Summary and Impact

On Saturday, August 6, there was a full internet outage beginning at around 12:30pm in the midst of opening ceremonies. Over the next 10 hours, the MLH team and our networking vendor worked diligently to both diagnose and rectify the underlying problem. The steps that were taken are described below, along with a clear picture of how we may avoid this in the future.

Timeline

  • 12:30 PM – The networking team on site immediately began investigating the issue when the outage began.  The first detected problem was packet loss when pinging the default gateway. They proceeded to explore network configuration and connected devices to ensure that there was not a second router or other device that was interfering with or mimicking the default gateway.
  • 2:30 PM – After exhausting that possibility, the networking team began to investigate potential equipment failures. It was determined that all equipment was running as expected. The Wi-Fi signals at the event were still strong and supporting all of the necessary load and the LAN ethernet was working as expected.
  • 4:00 PM – The networking team decided to swap out their DHCP server to rule out the possibility that it was misconfigured or simply broken.
  • 5:30 PM – MLH staff began working on a contingency plan in case the Internet did not come back online.  The plan was to purchase enough mobile hotspots to support the hackers who did not have devices that were capable of tethering.
  • 7:00 PM – After ruling out the DHCP server, the networking team re-tested the internet handoff from a test device and detected no issues with a direct connection. However, under load, the network continued to drop packets and fail.
  • 8:30 PM – After a conversation with the upstream network bandwidth provider, it was decided that the only remaining possibility was an issue in the upstream equipment. The CEO of our upstream provider took a break from his vacation, drove to the venue, and swapped us over to different upstream equipment. This corrected the underlying issue at around 9:30pm.
  • 9:30 PM – After the issue was corrected, the networking team immediately reconfigured the network and set up a new NAT server for the connection and the wired and wireless internet were back up and running around 10:30pm on Saturday.
  • 10:30 PM – Normal Internet access resumes as planned.

We are confident that the networking team and upstream provider took the proper debugging steps to exhaust each possibility.  However, the MLH team is responsible for putting preventative and preparative measures in place to ensure that as many potential issues as possible are detected before outages occur.  This includes making sure our vendors take these steps as well.

In future, we will be establishing new systems to prevent these types of issues:

  • All networking providers will be asked to do a user load test in addition to the already required bandwidth test of wireless and wired internet before attendees arrive to ensure that there are no load issues or upstream bandwidth provider issues.
  • We will be implementing a survey that all upstream providers and networking vendors must answer before we employ their services. We will be ensuring additional double checking of upstream providers and their equipment as well as the equipment provided by our networking vendor.
  • We will be preparing backup solutions (such as MLH-provided hotspots) in the event that the internet has an unexpected outage.

Networking setups vary with each venue, event, and provider but we must require all our vendors comply to our own internal screening procedures that will detect more hackathon edge cases and prepare us and them for the unique environment of these events. It is also our responsibility to provide backup services in case our primary vendor has an outage.

Food Outage Summary and Impact

During lunch and dinner on August 6 and breakfast on August 7, we had two major issues:

  • There was not enough food to feed all attendees at the same time
  • Food arrived late, leading to a low quality experience for attendees and overall schedule interruptions

We understand that hackers expect the most of our events and that having sufficient, healthy, and on-time food is integral to that experience.

The root cause of this was insufficient qualification of our caterers for the event by the MLH team. We did not properly assess whether our vendors were prepared for the amount of food and service requirements for a 500 person event.  Hackathons are an unusual environment, especially for a catering company, and it is our responsibility to ensure that our vendors are prepared for the environment and requirements necessary to feed hackathon attendees in a time-sensitive manner.

While we were able to supplement our catering orders on the fly with secondary food providers when our staff realized meals would be late or insufficient, we did not have those backup options prepared before the event began.

In future, we will be preparing two things to prevent these issues from reoccurring:

  • All MLH events will have a primary contracted catering company that can provide healthy and sufficient food for each meal, and a secondary backup food option such as a quick-service bagel, pizza, or sandwich/salad shop nearby that we are confident could provide food for all of our attendees on short notice if an outage occurs
  • We will be expanding our in-depth screening questionnaire for catering companies to better prepare them for our environment and better qualify if a vendor is a good fit for a hackathon

There is a lot of variability in services provided by catering companies, and our responsibility to our hackers and partners is to make sure any vendor we work with has the resources they need to be adequately prepared for our events. When unexpected outages do occur, it is our responsibility to have a clear and well-defined backup plan to make sure the experience is not severely impacted by an outage.

The MLH team apologizes for the impact that these outages had on all of the parties at MLH Prime. We can’t wait for the next MLH Prime, and we will be working hard to implement steps that ensure higher quality service in the future.

Please reach out to us at prime@mlh.io if you have any further questions or concerns

– Swift & the MLH Team