2011-09 internet outage

From Hypertwins Community
Revision as of 17:36, 25 May 2012 by Woozle (talk | contribs) (→‎Anonymous Cast: User3)
Jump to navigation Jump to search

Anonymous Cast

  • BizA: a small business customer of mine; they have an office with 3-4 Windows desktops and 2 servers (WinServer and Julia)
  • People:
    • BizBoss: the business owner/founder
    • BizMgrPast: the former office manager
    • BizMgrNow: the office manager as of this event
    • User3: the third member of the office team, who has been with the business since it started
  • Computers:
    • BossPC: the desktop computer used by BizBoss
    • WinServer: a Win2k server with surprisingly current specs, despite being at least 5 years old
    • Julia: BizA's Linux server (which I purchased and installed a year or two earlier)
  • WinServDom: The Windows domain served internally by WinServer

Overview

This is a technical account of the internet outage at BizA which began on Thursday 2011-09-08 and took several days to resolve. The purpose of this document is primarily to have a record of why this problem took so long to resolve, both in calendar days and in terms of the many hours of work needed after basic internet service was restored.

This document was prepared solely on my own time and my own expense.

Causes

  • The initial cause of the outage was a bad DSL router.
  • The replacement router took about two hours for the Frontier technician to configure; this appears to be due at least in part to misleading configuration information available to the technician, compounded somewhat by unclear communication between Frontier and BizA and also to some extent because BizA did not relay Frontier's service upgrade notices to me.
  • The cause of further problems after the modem was installed and working was twofold:
    • the new modem was not a drop-in replacement for the old one, requiring additional configuration to restore the network to its original state
    • the new modem had absolutely no documentation or help files available for it, requiring two calls to Frontier's HSI support group to get configuration instructions

Final Status

Everything is now working except certain outgoing emails from MS Outlook on BossPC (see Appendix A for details). I have installed Mozilla Thunderbird as a replacement for Outlook. QuickBooks at first insisted on using Outlook for emailing invoices, but I was able to reconfigure it through the Windows Registry.

Prevention

BizA should have a secondary internet connection to use as a backup. BizBoss confirmed with me that this seemed like a good idea. Given the frequency of internet outages on Frontier, the reduction in recovery time (calendar days and technician-hours) will probably more than pay for the cost of an additional internet connection (which should be less than $50/month).

At various times over the past few years, I had discussed with BizBoss and BizMgrPast the idea of replacing Frontier's internet with another service, and generally gotten positive responses to this. More recently, BizMgrNow had agreed to research the existing Frontier service to see what BizA was paying for it.

Up until this incident, however, we had been talking about replacing the service, possibly including phones. Two things have now become clear:

  • It makes sense to have multiple internet connections, so that vital internet services (especially email) can be brought back up as quickly as possible
  • We do not want a phone service that requires functional internet, at least while we have only one internet service connection.
    • Unfortunately, due to monopolistic regulatory circumstances in the telecomm industry, wired phone service in this area is only available from Frontier. So we will need to maintain Frontier as a phone provider, though it may make sense to purchase some additional internet-based phone services.
    • After we have set up a backup internet service and have some experience switching services when one goes out (at least 2-3 years), we can revisit the idea of switching to internet-based phones if doing so would create a significant expense reduction.

Chronological Summary

  • On the morning of 9/8, BizBoss alerted me to what turned out to be a complete lack of internet service. On site at BizA I eliminated BizA's LAN as a possible problem and then called Frontier tech support.
  • The technician arrived late on the morning of 9/9, installed a new DSL modem, and spent approximately 2 hours getting the modem properly configured. After that, I re-pointed the necessary domains and temporarily configured desktops to work while the domain changes were still propagating (which takes up to 4 hours).
  • On 9/10, it became apparent that the new modem was not satisfactory as configured. I solved the problem with a replacement modem from Intrex.
  • On 9/11, the new modem was no longer working the way it had been working the day before. After trying several different things, I worked out with Frontier tech support that the best solution was to reconfigure the new modem to be more like the old one -- disabling its routing feature and using a standalone router instead (as before). The rest of the day was spent essentially configuring the network back to where it had been.

FAQ

Q: Why was the internet down?
A: The DSL modem died shortly after midnight on 9/7. This modem may have been dying for some time, which would explain the poor internet connectivity apparent for several weeks before this incident. (See also Appendix B.)
Q: Why did it take so long to restore the internet?
A: A combination of things:
  • Frontier's failure to clearly communicate essential technical details to us:
(a) the fact that the service upgrade would mean changing to a static IP address
(b) the fact that they were going to do the upgrade without any further warning
(c) the fact that the upgrade had in fact been accomplished
  • Frontier's failure to clearly communicate essential technical details to to its technicians (essentially the same list as above)
  • To some extent, the fact that I had not been made aware of the impending upgrade before it happened, so I would be more prepared for any consequences and would be more likely to have my facts straight.
Q: Why did it take so long to get everything working again after internet was restored?
A: The following circumstances were the main contributors:
  • The new modem was not (initially) a drop-in replacement for the old one, requiring extensive reconfiguration.
  • The new modem had absolutely no documentation available for it, online or off, requiring extra guesswork, experimentation, and two additional calls to tech support.
  • My initial solution to the problem (replacement modem) seemed to work but then unexpectedly failed, requiring a new solution

Sequential Narrative

Thursday 9/8

On the morning of 9/8, BizBoss called to alert me that she was unable to send or receive emails using their primary domain (Julia serves email for that domain). This ultimately turned out to be because the internet was down. I discovered that a scheduled email had arrived from Julia at midnight, but an hourly check-in at 12:45 was missed.

Check-in log showing the outage:
2011-09-07 22:45:58	50.105.7.13	Julia
2011-09-07 23:45:58	50.105.7.13	Julia
2011-09-09 12:02:29	50.52.160.120	Julia

I spent some time at BizA making sure that the problem was not with the LAN and then called Frontier tech support, who promised to send out a technician probably between 1 and 5 p.m. I gave them my cellphone number to call so that I could meet the technician there as soon as possible.

At this point, the question of whether the DSL service used a static or dynamic IP address came up in passing; this became significant the next day, but at the time seemed like a settled matter.

Our address clearly had been dynamic, because we had several times received a new IP address after prolonged outages (presumably long enough for the network router's IP address lease to expire while it was powered down after the UPS's battery was exhausted).

The information I got from Frontier on 9/8 was that their paperwork said it was static, but also that this was actually an error. The first tech ("Bob") said that the records showed a static IP, but the second tech -- "George", presumably at the HSI support group in WV -- said that it was in fact dynamic, and that the paperwork for the change hadn't "caught up" with the account.

Before I called Frontier, BizMgrNow made a point of showing me emails from Frontier indicating that a service upgrade (to 3 mbps downstream) was imminent, but that there was a question of whether the upgrade had already happened or not. One email seemed to indicate that the new service was in effect, but it was unclear -- and there was no email specifically stating when the change was supposed to happen or that it had definitely already occurred.

Friday 9/9

The technician did not arrive until late morning the next day (9/9). (He also did not call ahead, but since I was already there this did not contribute to any delays.) He almost immediately installed a new DSL modem, but then had a great deal of difficulty getting it to communicate with Frontier's service.

This is the point at which the nature of BizA's IP address became an issue. I relayed to him several times the information -- which I thought at the time was correct -- that it was dynamic, and explained about the paperwork being wrong. My guess is that he wasted considerable time trying to get the modem to work in that configuration, and of course it wouldn't.

So I was relaying this information to the tech -- and it turned out to be wrong, because the Frontier account upgrade also included a static IP address, as BizMgrNow finally discovered in yet another email from Frontier (while diligently going through Frontier emails trying to find any clues to clarify the situation).

By this time, unfortunately, the technician had already spent well over an hour working on the problem and had contacted his backup support at Frontier, who was able to state definitively that the IP is now static. The tech had the modem configured within a few minutes after that, I confirmed that we were able to access the internet from the network, and he left.

After that, it was necessary for me to reconfigure the network so all the various services would work again. This involved the following items:

  • updating domains (mainly the email server subdomain) to point to the new IP address
  • learning my way around the new modem, which was not a drop-in replacement for the old one -- it included router functionality
  • trying to find documentation for the new modem, and discovering that there simply isn't any
  • using the new modem's router to route incoming requests to Julia so that email would work: SMTP, IMAP, HTTP, SSH
  • trying different configurations in different combinations -- many of which required rebooting of various devices and generally took several minutes to set up -- to see what worked best

After two hours or so of exploring menus and figuring out where various settings were and trying various configurations, email would only work if I pointed email clients at the server by using its LAN name (Julia) rather than the domain name, but that could easily have been due to propagation delay on the domain name -- which BlueHost refuses to set any lower than 4 hours.

On top of that, the router did not seem to have any way of assigning a fixed internal IP address to machines on the network (which is standard router functionality -- if you're going to have the router point a service port at a particular address, it's nice if the router doesn't go assigning that address to a different machine later on... which happened several times as machines lost their old address leases and got assigned addresses in the new router's address space, which was inexplicably set to a nonstandard range... requiring further rebooting when WinServer got moved to a new address and suddenly nobody could see the WinServDom domain).

Thinking that people might want to continue using email, I set up all 3 email clients to point directly to Julia rather than the domain, tested everyone's email, and planned to come back the next day after the domain name had finished propagating.

Saturday 9/10

I returned to BizA at 9 a.m. (as soon as possible after our regular trip to the Farmers' Market) and determined that the router was refusing to route self-referential domain-name requests from within the network.

Let me explain the term "self-referential domain-name requests" little more clearly.

From within the network, the email server is known as "Julia", and it has a local IP address (192.168.1.something -- see wikipedia:network address translation). Anyone in on the network can talk directly to Julia.

From outside the network, you can't make requests directly to machines inside the network; you have to talk to the firewall and ask it for a particular service (where each service has an assigned port -- so sending a request to a particular port is the same as asking for a particular service).

If the firewall has been configured to allow requests for that service to go through, then it will forward them to the machine it has been told to forward them to. On the BizA LAN, the only machine which handles any outside services is Julia, who serves HTTP, IMAP, SMTP, SSH, and Webmin.

From outside the network, however, you cannot refer to "julia"; you have to use either an IP address or a domain name; the best practice is to use a domain name, since IP addresses can change, so I had pointed "office.[BizA].com" at the BizA LAN's internet address, and configured all outside devices (iPhones, laptops, users' home PCs) to use that domain when fetching email.

While I could have configured inside devices (desktops on the BizA network) to refer to "Julia" instead of office.[BizA].com, this caused problems.

MS Outlook and some other email programs give warning messages if they are configured to send a password insecurely over the network, so I had set up these programs to use encrypted connections (the risk of interception was minimal on a small local network, but it's better to be safe than to spend a lot of time trying to defeat an overzealous security precaution). Encrypted connections require SSL, and SSL requires a certificate, and Outlook tends to complain if it doesn't recognize the certificate-signing authority (self-signed is not acceptable), so earlier I had purchased a certificate for office.[BizA].com from a known issuing authority and installed it on Julia.

So we were kind of stuck with using office.[BizA].com if we didn't want to get certificate errors.

The new router was doing just fine handling requests from outside the network -- external users could send and receive email through Julia.

The problem I discovered on the morning of 9/10 was that the new router was not correctly routing such requests from inside the network; it seemed to be blocking them, as they would simply hang indefinitely.

I spent considerable time trying to figure out what combination of settings might get around this problem, and then decided to go get a new DSL modem just so I would have one with documentation (and a brand whose configuration screens I was more familiar with, at that). It took an hour to get this $30 replacement modem because Intrex in Durham did not have any, and I had to drive to Chapel Hill Intrex.

At this point, I had three different devices -- Frontier modem, TP-Link modem, standalone router -- which might be configured to do what I wanted in several different ways:

  1. Frontier modem provides routing (other 2 devices redundant) won't work due to self-referential routing problem... unless I can figure out how to fix that
  2. Frontier modem in "dumb bridge" mode, standalone router (TP-Link modem redundant)
  3. TP-Link modem provides routing (other 2 devices redundant)
  4. TP-Link modem in "dumb bridge" mode, standalone router (Frontier modem redundant)

After running a lot of tests and trying several combinations, I settled on #3, which seemed to be working well. This took many hours.

I also fixed some network printing problems User3 was having, and then went home.

Sunday 9/11

At some point during the night, the new TP-Link modem started blocking all inbound signals. It took me some time to establish that this was what was happening, and I called Frontier tech support to verify that they weren't somehow doing the blocking.

They confirmed that they were not, and we discussed possible solutions to the problem; they suggested that they could walk me through reconfiguring the new modem into "dumb bridge" mode. I didn't feel that I could make the necessary changes right then, as this would require rebooting the modem and disrupting the network, and User3 was doing some work on her computer which might be interrupted by that. So I confirmed the basic information I would need in order to accomplish the reconfiguration and told Frontier I would call them back if I was unable to figure out how to do it.

As it turned out, I was unable to do it, so I called them back and they walked me through the necessary steps, which took about 5 minutes. I also confirmed the basic information I would need in order to configure the standalone router to work with this new modem setup.

After that, it was necessary to configure the standalone router:

  • for static IP, with the appropriate public address, gateway, and DNS, and rebooting
  • to serve DHCP to the other devices on the LAN
  • to reserve a fixed address for Julia
  • to forward the appropriate ports to Julia

And then:

  • WinServer had to be rebooted along with all the desktop machines so that they would all be using the same gateway again and could see the WinServDom domain.
  • Various services (especially email) needed to be tested from both inside and outside the network.

A problem arose with BossPC, as explained in Appendix A; I resolved the problem by installing Mozilla Thunderbird, which is consistent with my long-term plan of eliminating Microsoft products as much as possible and Outlook in particular.

Wednesday 9/14

BizBoss reported that the invoice-emailing function in QuickBooks was insisting on using Outlook for its emails, which was (of course) not working very well. I was able to reconfigure QuickBooks to use its internal emailer, which seemed to be able to send email to outside addresses (I sent one test invoice to myself, and BizBoss sent another test invoice to a customer; at the time, however, BizMgrNow was not able to reach the customer to confirm that the invoice was received.)

Appendixes

A: BossPC email mystery

On BossPC alone, Outlook was able to send email to [BizA].com but not to any address outside the network. A single email addressed to staff@[BizA].com, test@hypertwins.org and woozalia@gmail.com, would arrive at the first but not the others.

BizMgrPC, which also runs Outlook, did not have this problem.

As of 9/17, I have a theory about this which I will test the next time I am on-site; it has to do with how Outlook is authenticating with the SMTP server. I thought I had compared settings between BossPC and BizMgrPC, but I may have overlooked a detail.

B: alternative theory for outage

Although the tech said the modem definitely was not working, it seems to me that if the upgrade was completed, the modem would need to be reconfigured in order to keep working -- so it doesn't seem unreasonable to think that the upgrade actually caused the outage.

Arguments in favor:

  • the upgrade was completed at around the same calendar-time as the internet service went out
  • the timing seems awfully close for this to be a coincidence

Arguments against:

  • the on-site technician said that the modem was definitely dead, and that the upgrade did not cause the problem
  • the outage happened in the middle of the night; surely an upgrade would take place during business hours
    • ...unless the modem still had a lease on the old IP address for something like 8-24 hours (24 hour IP leases are common)