Difference between revisions of "2010-09-01 crisis"

From Hypertwins Community
Jump to navigation Jump to search
(anonymized version complete)
 
m (category update)
 
Line 1: Line 1:
{{datecat|2010-09-01}}
{{datecat|2010-09-01}}
[[category:incident reports]]
[[category:incidents]]
[[category:reports]]
[[category:antivirus]]
[[category:antivirus]]
[[category:BizA]]
[[category:BizA]]
[[category:internet]]
[[category:Frontier]]
{| align=right
{| align=right
|-
|-

Latest revision as of 21:47, 28 June 2015

Anonymous Cast

  • BizA: a small business customer of mine since 2007; they have an office with 3-4 Windows desktops and 2 servers (WinServer and LinServer)
  • People:
    • BizBoss (BB): the business owner/founder
    • BizAdmin (BA): the business owner's main assistant, a family member
    • BizMgr: the office manager
  • Computers:
    • BossPC: BizBoss's WinXP desktop computer; installed before 2007
    • AdminPC: BizAdmin's reconditioned WinXP desktop which I purchased and installed last year
    • WinServer: a Win2k server with surprisingly current specs, despite being at least 5 years old; installed before 2007
    • LinServer: BizA's Linux server, which I purchased and installed a year or two earlier
  • Windows networking stuff:
    • WinDom: The Windows domain served on the LAN by WinServer
    • SambaWG: The Windows workgroup served on the LAN by LinServer

Non-Anonymous Cast

Overview

This is an anonymized version of a document I gave to BizBoss and BizAdmin in the aftermath of the event it details. I have omitted, as far as I can determine, any details that might make it possible for anyone who was not working there at the time to be able to identify the business involved.

The remainder of the document is essentially as I gave it to them, aside from anonymization.

Introduction

This is a detailed account of how the crisis of 9/1 occurred, and what actions I took towards fixing it. The purpose of this document is twofold:

  • so that you can hopefully avoid similar incidents in the future
  • so that you can properly evaluate my actions in responding to it

I encourage you to share this document with anyone who may have sufficient knowledge or experience to evaluate the problems I ran into and how I handled them. You are welcome, however, to distribute it via whatever means and to whomever you wish.

I stand behind my work, and I stand behind my statement that I made no obvious mistakes in handling this situation. I welcome rational, informed criticism of my actions in dealing with this crisis, and I will be happy to answer questions about it from anyone.

Causes

The extent and duration of this incident are largely due to two things:

  1. Multiple problems happening simultaneously (see "Sequence of Events (Summary)" and "Issues"), triggering further concurrent problems, where conflicting priorities combined with the large number of resolution paths interfered with each other
  2. Poor communication (see "Communication")

Also, in at least three cases, my own reluctance to prematurely make significant changes caused me not to take steps which might have prevented this problem. See "Mea Culpa", below.

Final Status

My understanding of things is that the only remaining problem is that PayClock is not running. I will not be able to diagnose that problem, much less fix it, without access to BossPC.

BizAdmin has mentioned verbally that there are apparently other remaining problems -- "everything is all messed up" -- but has not described any particular problem to me in sufficient detail to permit a solution. The one specific problem BA did mention was dissatisfaction with the new printer, but I am not yet clear on what the problem is, and I certainly can't address it without this understanding.

There is also a certain amount of cleanup and maintenance needed; the longer this is deferred, the more likely it will lead to new problems.

Sequence of Events (Summary)

  • On 9/1, I became aware of three problems which are discussed at length in the sections below named #email, #printing, and #BossPC/network. The first was solved on 9/1 (although due to propagation delay it took until 9/2 a.m. for the solution to completely take effect), the second on 9/2, and the third on 9/8.
  • On 9/2, I became aware of the problem described in the QuickBooks section below. This was temporarily resolved on that same day; permanent resolution required solving the BossPC/network problem, but has not been effected because BizBoss is nervous that any further work on BossPC may make things worse.
  • On 9/6 (Labor Day), I first became aware of the possible importance of Payclock, which became a problem on 9/7 as described in its section. This problem has not yet been resolved for the same reason the QuickBooks problem has not been permanently resolved.

See "Sequential Narrative" below for more details in chronological order.

The next section is a breakdown of the incident by technical issue, which should make it more easy to follow the thought processes.

Issues

email

problem

Email was not working.

causes

Verizon/Frontier went offline for several hours (starting at about 4 a.m.), and when it came back it was with a different IP address -- so all of the email programs configured to access the email server (LinServer) via internet domains were not able to talk to it, and LinServer was not able to receive email sent to those domains.

The cause of Verizon/Frontier going offline was that BizA had not paid their bill (conversation with BizAdmin, a.m. 9/27).

This problem does not seem to have been the cause of any other problems, though it was a distractive nuisance and complicated the diagnosis of the other problems which were first noted on the same day.

remedy

It unfortunately took several hours to determine the cause of the problem, mainly because I was trying to diagnose three different problems at once (see printing and BossPC/network), all of which were difficult and presented multiple avenues of inquiry (is it the mail server? is there something wrong with the physical network? is there something wrong with the Windows domain server?). It also might have provided a clue pointing in the right direction if someone had mentioned that the internet went out (there was a mention of the phones going out, but not the internet; I only determined the latter, at the time, by looking at server logs).

The solution was simple: determine the new IP address using tools already in place, and update the domains to point to the new address. The server was able to receive email and communicate with some email clients within a few minutes, and was back to normal by the next morning.

final status

This issue was partly resolved by end-of-day 9/1, and completely resolved by the morning of 9/2.

notes

Redirecting BizA-alt-domain.net was almost instantaneous, but for some reason "office.BizA.com" (registered via EarthLink) did not propagate out until the next morning -- which is strange, because you would think a major backbone provider like EarthLink would cause faster propagation than a smaller operation like BulkRegister. This may be an indication that we should move domain registration over to BulkRegister in order to avoid lag in future events where the IP address changes, or it may have been a fluke.

Better yet would be to switch to an ISP which provides a static IP address, so that the IP address never changes, avoiding the whole problem.

I have recommended this in the past, and received a generally positive response -- although there is a question about whether the ISP needs to be the same company as the phone service provider (I don't see why, except to save small amounts of money which would probably be dwarfed by the extra amounts spent on support as long as we stay with Frontier). I also have some concern over whether anything but wired phone service would be adequate, which would mean that we would need to stay with Frontier for phone service even if we change to another provider for internet. These are all issues which need to be discussed in some detail before there can be any forward motion.

printing

problem

Nobody was able to print.

cause

Obviously, with BossPC being off the network, nobody else could use its shared printer, which was the only one in the office. (BizBoss was able to print using BossPC, but this was a moot point due to other problems with BossPC -- see BossPC/network.)

remedy

Although fixing BossPC's network access was the favored solution, I began looking for other solutions when this proved to be a difficult problem in its own right (see BossPC/network). Ultimately, setting up another laser printer which was already in the office seemed like the most cost-effective solution, and also answered another long-time to-do item (changing the configuration so that printing did not depend on BossPC's proper functioning).

final status

A temporary resolution was in place by end-of-day 9/2, and a permanent solution was in place by end-of-day 9/3, with the caveat that faxlaser does need a tray for receiving printouts; possibly it arrived with one, but since I didn't take faxlaser out of the box or set it up originally, I do not know where the tray was left. If it is unlikely to be found, I can probably obtain one on Ebay or via some other means.

narrative

For many years now (at least since 2007 when I first started working for BizA -- it was one of the first ongoing issues mentioned to me), the office's only printer has been a laserjet connected via parallel port to BossPC. (I will call this printer "oldlaser" for short.) This means that when BossPC goes off the network, nobody can print.

As soon as I discovered this situation, I suggested that the printer should be moved and attached to a central server which would not be as likely to experience network interruptions; my understanding was that there was general agreement to this plan.

What with general busyness and other crises, it took until late 2009 before I was able to purchase LinServer (intended to meet this and other needs) and set it up. As of September 1, actually moving oldlaser's data connection from BossPC to LinServer had not yet risen to the top of the priority list, though it probably would have happened within a few more weeks barring any major crises.

Unfortunately, when I took a closer look at oldlaser on 9/1, it turned out to require a parallel port connection rather than a USB connection as I had assumed. Although it still would have been possible to make it independent of BossPC, the parallel interface made the task much more difficult, since parallel cables cannot be longer than about 12 feet (or possibly 20 feet, which might have been long enough -- but still would have been awkward) and are increasingly difficult to find (e.g. Intrex's online catalog lists "parallel cables" as a category, but does not actually list any such cables as being available) and support.

A networked printer hub might have been a solution, but my understanding is that this would have cost almost as much as a new printer.

Given that monochrome laser printers are now easily available for around $100, I was considering buying another laser printer as a solution (at the time, oldlaser's printouts had black streaks on them, and for all I knew it was going to be up to me to diagnose this problem) -- when it occurred to me to take a closer look at the fax/copier over by the window, which turned out to be a laser printer with a USB interface. (I'll call that printer faxlaser.)

I needed extra-long USB cable since faxlaser was/is some distance from LinServer. By the time I reached this point, it was quite late in the day (remember that I was trying to fix the printer issue in the background while trying various things to fix the other problems) so I bought the cable on my way over on 9/2 (Intrex having been already closed on my way back the night before). Setting it up to be available on the network was a matter of about 10 minutes, but then I ran into the fact that its driver CD didn't seem to have been put into the collection of driver CDs in the server closet, nor was it anywhere else I could find.

(Missing software CDs and manuals have been a recurring theme in this incident. Also, to the best of my recollection, faxlaser first appeared sometime in the past year, but I do not know who purchased it or set it up; it was done completely without my involvement. If I had been involved, I would have recommended installing the paper tray -- which now appears to be lost -- and also setting it up as a backup printer right away, before a crisis hit and while the driver CD's location was still known.)

Finding the necessary drivers online was a fairly minor task as things go, but it still required extra time and attention which might have been better spent on the other issues.

I set up BizAdmin's desktop, BizMgr's desktop, and the Staffing desktop to print to the new printer, and considered the problem solved.

After that time, BizAdmin reported that BA was having trouble printing; I rebooted AdminPC and it seemed to work just fine -- and as far as I knew until 9/27, that was the only printing problem encountered with faxlaser. On 9/27, BizAdmin mentioned having further problems with it, but did not describe them further and did not want me to do anything about it.

BossPC/network

Of the three concurrent problems I first became aware of on 9/1, this was by far the most intractable, and furthermore was indirectly the cause of at least three additional problems ("printer", "QuickBooks", and "PayClock").

problem

BossPC was refusing to see the network in any capacity -- internet or local. This prevented email, web browsing, printer sharing (see "printing" above), and the use of any applications which shared data on the network including QuickBooks (see "QuickBooks" below).

This also prevented BizBoss from using BB's Windows-domain-based login (to the WinDom domain on WinServer), where BB's usual desktop configuration was stored. My kluge to allow BB to log in at all presented BB with an unfamiliar desktop, which led to further problems (especially "PayClock").

causes

The ultimate cause of this problem turned out to be McAfee Firewall, which had inexplicably decided to block all network access (possibly due to an automatic update). Uninstalling McAfee completely restored access to the network, and with a small amount of further tinkering re-enabled BizBoss's domain-based login.

remedies

Before I disabled McAfee and discovered that this was the problem, I tried a number of solutions including the following:

  • booted Linux liveCD to verify that the problem was not hardware (Linux had no trouble seeing the Windows network)
  • attempted several times to run an anti-virus scan, but it kept taking too long while I needed to be trying other things (couldn't reboot while scan was in progress). Eventually ran it overnight, but it didn't find anything.
  • un-joining and re-joining the WinDom domain (refused to even see WinDom, could not re-join)
  • uninstalled drivers for old wireless card no longer present
  • uninstalled driver for LAN card
    • This was the first thing that showed any sign of progress -- I was able to see the internet after this, and could ping local machines, but could not access them via Windows Networking or join WinDom
  • uninstalled/reinstalled TCP/IP networking
  • purchased new hard drive and began setting up fresh system which would hopefully not have the same networking problem -- but then realized I did not have the install CD for the version of QuickBooks BizBoss was using. Since the data files are not backward-compatible, I would not be able to give BB access to BB's existing QuickBooks files on a new system no matter how well it worked otherwise; abandoned this project and reverted to original hard drive.
  • installed spare ethernet card, in case the problem was some obscure hardware problem that was only triggered by Windows drivers
  • tried several different configurations of the Samba server running on LinServer, in case it was clashing with WinServer's configuration
  • closely inspected WinServer's security configuration, to make sure it wasn't blocking BossPC for some unknown reason
  • rolled back BossPC's configuration to a date when it was known to be working
  • uninstalled any piece of software which I was sure either (a) I could reinstall if BizBoss needed it, or (b) was definitely unnecessary.
  • finally, a suggestion on a software forum led me to try disabling McAfee. It seemed a very long shot at the time.

One thing I didn't dare try was un-joining any of the other desktops from the WinDom domain (as a test to see if the problem was with WinServer or with BossPC), in case they too refused to rejoin. There were too many things to try and not enough time to go home and borrow an XP system to use, especially since I wasn't sure how fruitful that line of inquiry would be. If I had been thinking more clearly, I might have remembered the old Staff PC in the closet, and used it as a guinea-pig, which I could have done while keeping other tasks running concurrently.

After uninstalling all of McAfee (something I had been tempted to do some time ago -- I now wish I had, but caution prevailed at the time), I enabled Windows's built-in firewall and installed a free anti-virus program I'm familiar with (ClamAV for Windows). It promptly found an infected file and removed it. (Possibly this was a file McAfee had found earlier and quarantined.)

Although BossPC was able to rejoin WinDom at this point (9/7 late), allowing it to see WinServer and access shared network files files, I was not able to log in as BizBoss/WinDom.com. On impulse (9/8 a.m., after giving up on PayClock), I tried logging in as Administrator/WinDom.com -- which did work. On further impulse, I tried BizBoss/WinDom.com again -- which worked this time, inexplicably (although what's more inexplicable is why it didn't work the first time).

final status

This was partly solved 9/7 evening and completely fixed shortly before 10:46 on 9/8.

obstacles

I also bought a new hard drive and tried setting up a completely new system on it. This went okay until I got to QuickBooks. The latest QuickBooks version for which I have the CDs is QuickBooks 2003 -- and as far as I knew, that was the version installed on BossPC, so I installed it on the new hard drive. This went okay until I tried to load the BizA data files, when QB complained that they were corrupted. This worried me a great deal (had the files become corrupted somehow?) until I finally noticed that the QB on BossPC (original drive) was not 2003 but "QuickBooks Premier Edition 2009" -- so of course the files were not compatible.

This is another instance where communication failures cost time and made a solution impossible, as I should have (a) been aware that 2009 was now the version in use, and (b) should have had access to its install CD. Had I been informed of its purchase and installation, I would have made sure to make a copy of the CD and put it in a safe place.

Without access to the current QuickBooks install CD, then, it was not possible to set up a new system drive to get around the network problem. BizAdmin asked if I could just go out and buy a new copy -- but the 2009 version is no longer available in stores, having been replaced by the 2010 equivalent (minimum of $250 at Costco). This version presumably uses a newer, non-backward-compatible file format, so I was hesitant to purchase it -- and, as it turned out, it wouldn't have helped prevent the really big problem which I didn't yet know about, i.e. PayClock.

It probably would have solved the problem as I understood it on 9/5 -- but then BizBoss could not have used the new system for payroll since it did not include PayClock, which I only became aware of on 9/6 and for which the install CDs have yet to be found. PayClock is generally not available off-the-shelf and in any case is quite expensive (the upgrade alone is $400), so the only reasonable option appears to be obtaining a replacement CD from the manufacturer, unless BizBoss or BizAdmin can locate the install CD.

It should be noted, also, that the intractability of this problem cost an enormous amount of time in terms of trying different things towards figuring out what the problem was. A sample sequence from my notes:

  • uninstalled Intel Pro/100 driver (LAN card)
  • rebooted
  • internet magically started working
  • (installed Java updates that kept nagging to be installed)
  • won't connect to WinDom -- maybe I just need to rejoin WinDom domain now?
    • no -- "My Computer" claims it is already joined
  • unjoined WinDom domain; joined "SambaWG" workgroup (on LinServer)
  • rebooted as "Administrator/BossPC" (fortunately same password as "Administrator/WinDom")
  • refuses to join WinDom
  • uninstalled LAN card driver (Intel Pro/100) again
  • powered down & rebooted
  • Windows finds LAN card driver, reinstalls without asking... ok...
  • ping LinServer ok
  • NET VIEW command does not work (times out)
  • ping WinServer ok
  • Google search (for the error message when trying to rejoin WinDom) seems to indicate that the domain server (WinServer, in this case) needs to be entered as the client's DNS server, but this is not true for any other machine on the network
    • tried it anyway, and several other tests suggested by it
  • did a point-by-point comparison with TCP/IP settings on AdminPC
  • tried several variations in case it was a fluke that those settings worked elsewhere
  • information on Microsoft tech forum indicated that the problem may be due to a virus
    • ran McAfee virus scan, which took longer than half an hour and prevented me from rebooting while it was running (thus hindering other tests I was wanting to try)
  • with extreme reluctance, rebooted WinServer (hoping to force browser election, because there were some indications that this might be part of the problem) -- nothing bad happened, but nothing good either

QuickBooks

problem

QuickBooks was not loading properly on BossPC.

cause

The root cause of this problem was the "BossPC/network" problem. Because BossPC was unable to access the Windows network, it could not load the QuickBooks data files, which are located on WinServer.

remedy

Since the networking problem was being intractable and there was a deadline looming, I:

  • logged BossPC in locally (instead of to the WinDom domain) as Administrator. (I was lucky that the password was the same as Administrator/WinDom.com; I don't think it had been documented by whoever set up the machine. The password for BossPC/WinDom.com did not work on the local machine.)
  • copied the QB data file from the network to BossPC using a USB stick, and then reconfigured QB to load from there.

I spent all my available time over the 3-day weekend of 9/4-9/6 trying to fix the networking problem so that QB could be restored to its proper configuration and all would be well, but to no avail; see "BossPC/network" above.

final status

An emergency fix was implemented on 9/2; this fix is still in place because I have not had access to BossPC since payroll was completed. This is a bad situation because QuickBooks data is currently not being backed up to the network -- if BossPC's hard drive goes down, all QuickBooks work since 9/1 will be lost.

PayClock

problem

This issue arose in three phases:

  • Phase 1:
    • Because BossPC could not access the network on 9/6, BizBoss's usual login wasn't working -- so I had to log BB into the local machine instead (see "QuickBooks" above).
    • Because the local machine login did not have BizBoss's usual shortcuts, BB was having some difficulty accessing the PayClock application.
  • Phase 2:
    • After considerable rummaging around (and attempting to back things up locally, which ended up taking too long), I found what seemed to be the PayClock data files. These, however, turned out to be from 2004.
    • After further rummaging around on 9/7, I found data files which seemed to be current.
    • Unfortunately, these files were missing the last 2 weeks of payroll data. Said data did not appear to be on the PayClock device either
  • Phase 3:
    • It later emerged that the PayClock device does not seem to be working anymore. I was not aware of this when I was working on BossPC, and I have been asked not to look at it yet.

The missing data remain missing, and I am not able to look at the problem with the PayClock device because BizBoss is worried that any further access to BossPC will make things worse.

causes

The triggering cause of this problem was the "BossPC/network" problem, which meant that the desktop shortcuts via which BizBoss normally accesses PayClock were not available. During the process of trying to fix this problem, someone (not me, to the best of my knowledge and recollection) moved some of PayClock's system and data files into places where it couldn't find them, which may have in turn caused the data loss.

Without a better understanding of PayClock, I can't determine what might have happened to the lost two weeks of payroll data. Without being able to examine PayClock's setup, I can't begin to determine the reason for its failure to work properly beginning 9/8.

Some details which contributed to the problem:

  • There were two copies of PayClock on BossPC, one from 2004
  • I was not aware that I needed to be supporting PayClock until it was broken
  • The PayClock data had never, as far as I could tell, been backed up until I did so (local backup 9/6, network backup 9/7)
  • The installation CDs were not available, and I never did get any clear indication about whether BizBoss thought BB could find them
  • The manual turned out to be on the computer, but I didn't know that when I was first trying to fix it

Also, PayClock itself contributed to the problem somewhat, by its lack of diagnostic and logging features. Any such mission-critical software should have features of that sort in order to help prevent exactly this sort of disaster. In particular, there should be a log somewhere of every time-clock event that is entered -- where did it come from and when was it added to the database -- preferably mirrored as a text file. This would have made the search for missing data much more likely to bear fruit.

Also, it is absolutely idiotic that the database format is proprietary and completely non-interoperable with any existing database tools. ODBC interface? Dump to raw SQL? Dump to CSV? Any of those options would have made a data merge possible, so the fact that their tech support says it is impossible makes it clear to me that this application's data is quite deliberately sealed up tight. This is taking "customer lock-in" much too far.

final status

My last understandings from direct observation were:

  • the PayClock server was running (task tray icon showed no errors), so employees should be able to clock in as usual
  • the missing data might still be findable, but not in time to be of practical assistance in doing payroll
  • the PayClock client was running properly, and usable for new data (after subsequent sync)

I was told verbally by BizBoss, however, that the server has not been working and employees have not been able to clock in. I have not been offered the opportunity to investigate.

This software appears to be rather temperamental and difficult to support, and a replacement should be sought.

narrative

As of the morning of 9/6, I had only very peripheral awareness of the existence of PayClock, and no awareness that it was part of the payroll process. I had spent the weekend trying to fix the BossPC/network problem, with no luck, and left the evening of 9/5 with the understanding that the priority was to be sure QuickBooks was working -- which it was.

I did notice the PayClock icon in the desktop task tray, and noticed that some of the things I had tried caused it to show an error status -- but I made sure it was running properly (as determined by the status message "PayClock server is running" or similar) before I left.

My first knowledge that PayClock was a critical component for payroll was in a phone conversation with BizAdmin on 9/6. I believe I said that I did not know anything about PayClock as far as supporting it, but that as far as I could tell it was working properly.

On the morning of 9/7, I received a call from BizA (I don't remember who specifically -- probably BizAdmin) saying that PayClock was not, in fact, working. When I arrived, BizBoss said BB was having difficulty finding the icons (shortcuts) BB normally uses to load the program. I did some searching and found where some of the files were, but had to leave for an appointment before I could rearrange things properly.

(Part of the problem here was that my first impulse was to make a backup copy before doing anything -- but there was over a gigabyte of data, which was going to take well over half an hour to back up -- which was more time than I had available. After waiting way too long in hope that it would finish, I had to stop the backup in order to reboot as part of my diagnosis process.)

At that time, the PayClock icon in the system tray indicated that the server was still working.

When I returned about an hour and a half later, BizBoss said BB had rearranged things a bit and been able to get partway in, but it still wasn't right... and the system tray indicated that the server was not working. Some exploration of the manual and settings revealed that the problem was that the PayClock server daemon was expecting to see the data files in c:\PAYCLOCK -- and they weren't there.

After further exploration, I determined that there were two copies of PayClock -- one installed in the desktop folder ("c:\Documents and Settings\Desktop\") and one in "c:\Program Files\".

Having discovered earlier that making a copy took too long to be practical, I simply moved one of the copies (possibly it was the desktop copy -- going on the theory that this was the one BizBoss had been using) into c:\PAYCLOCK, and rebooted. The server came back online.

There was then some difficulty with getting into PayClock, as it is password-protected and none of BizBoss's passwords seemed to work. (None of these passwords, of course, were in any of the password documentation I was given.) After further reading of the PayClock manual (which took some time), I was able to get in by using the default password. The data we were then looking at turned out to be very old, however -- 2004.

Before we determined that, however, BizBoss told PayClock to sync with the clock-in device. It may have been at this point that the missing two weeks of payroll data were removed from the device and either lost or stored somewhere that we have not been able to find.

(In later investigation, I found that the PayClock application would not easily allow me to look at dates later than 2004; I had to scroll through the calendar month-by-month to get to August 2010, but I did not see any 2010 data in the 2004 dataset.)

So I moved c:\PAYCLOCK to a safe location and tried the other copy -- which proved to have 2010 data in it. Thinking the problem solved (BizBoss had left some time ago), I went home -- but the next day (9/8), of course, I got a call saying that there were two weeks of payroll data missing.

technical notes

PayClock appears to be a client-server application, with both client and server running on BossPC. The server connects to the clock-in device via a serial cable connected to BossPC and plugged into what look like ethernet ports in the wall. The device is also plugged into ethernet ports in its wall. I do not know if the ports in question were specially wired to handle serial communications instead of ethernet, but this setup does seem rather dubious.

File locations: Since PayClock's server daemon was clearly expecting the data files to be in c:\PAYCLOCK, and the daemon indicated it was running fine when I first left on 9/6, the files must have been in c:\PAYCLOCK at that time. Since they were no longer there when I got back, I can only conclude that BizBoss moved them. If my hypothesis is correct that the data was lost (or misplaced) when BizBoss synchronized the clock-in device while logged in to the 2004 data, then the fact that BB moved the PayClock folder to the desktop folder and then synchronized before being sure we had everything straightened out was the primary cause of the data loss. I mention this only because I have been told repeatedly by BizAdmin that "everything was working until you started messing around with things", and it seems necessary to show that this is a gross misrepresentation.

My preliminary recommendations:

  1. install the PayClock server on WinServer, so the payroll process isn't tied to a specific machine
  2. replace the serial-based clock-in device with an ethernet-based one, if this is not too expensive
    • this would allow the wall ethernet ports to be used for their intended purpose and should make support a little easier
    • the clock-in device is physically quite close to WinServer, however, so it should be possible to wire it directly without going through the wall ports at all

Further study of Payclock seems warranted, however, and it should be fixed it in its current installation (if possible) before trying out a new configuration.

Communication

I see communication problems in the following areas:

  1. nature and priority of problems needing fixing
    • Saying "everything is messed up" does not tell me enough about the problem so that I can fix it.
  2. existence of software in need of future support
  3. installation of new software and devices (QuickBooks 2009, faxlaser)

Possible remedies for item #1:

  • when one or more technical issues has been reported and not yet solved, an email from BizA confirming the list of open issues and stating their priorities
  • a regular (daily? weekly?) email from me listing all the open issues I am aware of, and their status
    • daily if issues are urgent, weekly for less-urgent ones
  • when I am not aware of any open issues, I could drop by the shop regularly (daily? weekly?) at some predictable time to check in verbally if there are any computer issues or concerns

No solution is going to work, however, if we don't agree on what it should be.

I do not know what the solution is for #2 and #3, except to suggest strongly that it would be a really good idea to at least send me an email whenever a new piece of software or hardware is installed.

If I am properly informed when new computer-related items appear, I can evaluate it for possible liabilities or benefits at leisure, when there is time to solve problems or rearrange things to the best advantage, rather than during a crisis. I can also help to ensure that auxiliary components like CDs, manuals, paper trays, extra cables, etc. are not lost.

Mea Culpa

I've identified the following things I could have done differently which might have improved the outcome.

  1. pre-emptively made an inventory of all applications on BossPC, rather than just asking what applications BB depends on (which I did in 2009 or earlier)
    • Likely outcome: This probably would have identified the importance of PayClock well over a year ago, and I could have made sure we had at least one available copy of the installation CD. Hopefully I also would have taken some time to study the application so that I would know where to start if it malfunctioned. I might also have recommended installing the server component on WinServer, where it would have been less likely to malfunction.
    • Why I didn't: a combination of unnecessary restraint and being distracted by other things.
  2. spent more time over the weekend of 9/4-9/6 trying to solve the BossPC/network problem
    • Likely outcome: No change. Maybe I would have solved it
    • Why I didn't: I was running out of ideas and needed to get some distance from the problem before being able to make any headway. I also didn't want to continue racking up hours to no good end, especially if I wasn't sure that BizA would want to pay for them.
    • Best outcome: If I had succeeded, of course, this would probably have prevented the PayClock problem from happening.
  3. set up the old Staff PC as a test machine (prior to 9/7)
    • Likely outcome: I would probably have determined that there was nothing wrong with the WinDom domain controller, which would have narrowed the field of inquiry down sufficiently that I might have thought to try turning off the firewall (McAfee), which would have identified the problem and allowed me to fix it -- thus probably preventing the PayClock problem from happening.
    • Why I didn't: I didn't think of it. Facepalm.
  4. made a complete backup of BossPC
    • Best outcome: would have provided falsifiability of my deduction that PayClock was moved from c:\PAYCLOCK -- which would have been useful forensically, but wouldn't have fixed anything. So not terribly useful.
    • Why I didn't: didn't seem useful enough to justify the time.
  5. attempted to move the printer earlier
    • Best outcome: The "printing" issue would not have arisen on 9/1, so I would have had one less major thing to focus on and the other problems might have gotten resolved sooner -- possibly soon enough to prevent the PayClock disaster.
    • Why I didn't: reluctance to meddle prematurely
  6. uninstalled McAfee earlier
    • Likely outcome: if I had tried this before 9/7, I could have restored BizBoss's desktop and the PayClock disaster probably would not have occurred.
    • Best outcome: no BossPC/network problem, hence only email would have been an issue on 9/1, and everything would have been resolved by 9/2.
    • Why I didn't: no real justification for doing it, except a gut feeling that McAfee tends to cause problems like that; reluctance to meddle prematurely.

Sequential Narrative

2010-09-01 (Wed)

At the time: [Another office worker] called my cellphone around noon to report that nobody can print because BossPC is off the network. She mentions something in passing about email, but says she thinks it is working now.

Unfortunately I was in a place where I couldn't get over to BizA quickly. I got there at 3:45.

Upon arriving, BizAdmin told me that the email was not working either. So now there were three major urgent issues (email, printing, and BossPC/network).

After considerable experimentation, I finally noticed that the local IP address had changed, which meant that email programs configured to access the server via its internet domain name were not able to reach it, nor were emails addressed (or forwarded) to BizA.com. Inspection of log files (from an automated process I had set up on LinServer) showed that internet service had gone down a little after 4 a.m. and returned some time after 10 a.m.

I was not able to make much headway with BossPC/network. Uninstalling the LAN card's drivers did seem to restore basic TCP/IP connectivity, but it still couldn't see anything on the Windows network (although it could ping the other computers). I booted a Linux live CD on BossPC, and was able to access the Windows network that way -- so nothing obviously wrong with the card.

Determined best solution for printing problem, which required extra-long USB cable to connect LinServer to faxlaser.

I had to go home to get the necessary information to update the domain pointers, and out to Intrex to get a cable. It was late by the time I reached this phase, and Intrex was closed, so I got the cable on my way back over the next afternoon (which was the earliest I could be there for any substantial length of time).

2010-09-02 (Thu)

It was brought to my attention at this point that BizBoss needed to use QuickBooks. Since I couldn't see any quick way to get BossPC back on the network (BossPC/network), I worked out a way for BossPC to log in without accessing the network and used a USB stick to copy the QuickBooks data from WinServer over to BossPC's hard drive.

Bought the necessary cables to connect faxlaser to LinServer. Ran cable through ceiling and around corner, connected to printer, verified printing from LinServer, hunted for Windows driver CD (Linux comes with drivers for most printers).

The remainder of work time was spent diagnosing the BossPC/network problem, including about half the items in the "remedies" list.

2010-09-03 (Fri)

On this day I mostly had to leave things alone while BizBoss used QuickBooks (if I'm understanding my notes correctly). I installed a longer USB cable for faxlaser so that the existing cable would not be in the way (and prone to being pulled out) and re-tested the printer from a workstation just to be sure.

Asked BizBoss when BB would next need to use QuickBooks; BB said Tuesday. The word "payroll" was first mentioned at this point as part of what BB needed QB for, but "PayClock" was not mentioned.

My understanding of priorities at this point, then, was:

  • Top Priority: make sure QuickBooks is still working, one way or another, by the morning of Tuesday 9/7
  • Second Priority: try very, very hard to fix BossPC/network so that BizBoss can have BB's desktop back to normal and QuickBooks files can go back on the network (where they are regularly backed up to a separate physical drive) instead of being local copies

2010-09-04 (Sat)

Purchases:

  • new hard drive for BossPC, in order to set up a completely new system
  • longer USB cable - other one was slightly too short and consequently too exposed to be stable

Spent many, many hours trying to fix BossPC/network -- about 1/3 of the items in the "remedies" list.

2010-09-05 (Sun)

Spent more hours trying to fix BossPC/network, including the rest of the "remedies" list and fresh installation of Windows on new hard drive. The installation worked, but was not a suitable replacement because of the absence of the QuickBooks 2009 install CD (and, as it later turned out, the PayClock install CD as well).

After noticing the version of QuickBooks that was installed on BossPC, I felt kind of "set up" myself at this point, as there is no way I could have performed the most certain fix -- i.e. rebuilt BossPC's system on a new drive -- without the install CD. I had not been informed of the purchase; this communication failure cost the additional time it took to go down the blind alleyway of setting up a new system plus time spent looking for the install CD, asking people if they knew where it was, and so forth.

Inventoried software on BossPC. Removed any software which might have been contributing to the network issue (BossPC/network), unless I wasn't sure if it was needed and/or could easily be reinstalled.

At a suggestion from BizMgr, I spent some time digging through BizBoss's desk and cabinets trying to find any of the needed install CDs. I did find install CDs for several other things which could have been useful on other occasions, but not the ones needed right now. I collected the install CDs I did find and put them in the server closet, where all CDs should be kept until there's a better place.

2010-09-06 (Mon)

BizAdmin called me at home to make sure everything was ready to go for payroll on Tuesday. BA mentioned PayClock. I said that as far as I knew it was running (because of the task-tray icon), but that I didn't know anything about PayClock because nobody had mentioned it before.

Spent some more time wrestling with the network. Sent out email describing the situation and mentioning that the install CDs for QuickBooks and PayClock were both going to be necessary for getting BizBoss back up and running in the longer term (i.e. beyond just doing payroll).

Felt further "set up" at this point, as even if we had found the QuickBooks install CD, there is no way I could have set up a new system with PayClock on it without actually having the PayClock install CD -- which is not something you can buy at the store. Further, QuickBooks was something I had been told was important -- but PayClock had never been mentioned, to the best of my recollection, until today, so there is no way I could have known to include it in my thinking.

The absence of either of the install CDs for these vital applications essentially made it impossible for me to rebuild BossPC, which is the usual fallback when a difficult problem like this arises and a quick solution is needed. My inability to resolve this problem by Tuesday 9/7 was, ultimately, entirely due to this failure of communication. It is fairly common for a Windows system to become munged beyond the point of recovery, in which case a rebuild is the only option -- and it is not possible to rebuild a system without the installation CDs for all needed applications.

2010-09-07 (Tue)

Someone called to report that PayClock was not running properly. Upon arrival, the problem turned out to be not PayClock as such, but the fact that the usual icons were not in place -- which in turn was because of the login issue resulting from the BossPC/network problem (which hadn't been resolved because I didn't have the install CDs to set up a new system).

I first tried to backup the existing PayClock data before trying anything; since there is well over a gigabyte of data, this proved to take too long for the time available. After about 20 minutes of trying other things while waiting for the backup to complete, I finally had to stop the backup so I could try some things which required rebooting before I had to go (to a prior appointment).

I did manage to find a PayClock installation and showed BizBoss how to invoke it, but there was some problem and BB couldn't use it. I noted that the PayClock icon in the task tray still showed that the server was running, and left for my appointment.

Upon returning, BizBoss had been rearranging things. To the best of my recollection, the PayClock client application could not be loaded at this point and the task-tray icon was showing a red "error" indicator, and it was only after I did some further rearranging and extensive investigation that I was able to load it (and restart the server), when it showed what turned out to be old data (the "2004 data"). (I actually found 2 different installations of PayClock; neither one of them was where the server was expecting them to be, based on the server's configuration settings, leading to the inevitable conclusion that the files had been moved in between the time when the status indicator said ok and when it showed an error. The installation BizBoss had found was in a folder on the Windows desktop, so I first tried moving that one back into place. After rebooting, the server reported it was working again and I was able to load the client applications, so I thought I had solved the problem.)

BizBoss was working over on BizMgr's desktop while I researched the problem; when I got it to load, BB came over to look at it -- at which point BB noted that the date was very old, so this was not the data needed.

Soon after this, BizBoss tried (I did not suggest it) to sync PayClock with the timeclock device. A status bar came up and seemed to show that data had successfully been transferred from the device to BossPC, but no new data appeared in PayClock.

If the "sync" process is what I think it is, i.e. offloading the data from the device to the PC, it is probably at this point that the "lost" two weeks of data became lost -- through being removed from the device and placed somewhere inaccessible on the PC -- though it's unclear why it didn't show up in the 2004 data and where it might now be found.

At this point, I moved the "2004" PayClock installation to another folder and moved the other installation into its place, at which point much newer data appeared -- up through mid-August 2010. Once again, I thought I had solved the problem, and moved on (I think BizBoss had left the office, so I couldn't check it with BB right away).

Sometime during all this, BizAdmin said that BA's printing wasn't working; I rebooted BA's desktop, and it resumed working.

After (I thought) fixing the PayClock problem, my continued search for solutions to the BossPC/network problem came across a suggestion that it could be due to the firewall. The firewall on BossPC was "McAfee Internet Security", so I deactivated that -- and suddenly BossPC could see the Windows network.

At that point I removed everything McAfee -- "McAfee Internet Security" (not really necessary) and "McAfee VirusScan Enterprise" (which I replaced with free software I trust more) -- and rejoined the WinDom domain.

Although BossPC could now see files on WinServer, it still wouldn't let me log in as BizBoss/WinDom. Since I was late getting home to help with putting the kids to bed by this time, I left at that point, thinking that I could figure out this final issue on Wednesday after BizBoss had a chance to do payroll.

I did note that the PayClock server was once again reporting in for duty, though it had been giving errors when I was trying various things previously.

2010-09-08 (Wed)

BizAdmin called to say that payroll data was missing from PayClock. I went over to BizA and called Lathem tech support (PayClock is made by Lathem), but they said there's no way to move data between databases. I didn't think to ask if there was any way to figure out where the data had gone before they hung up, and they never did answer my email inquiry. The tech support guy seemed kind of unhelpful and unsympathetic, so I'd be inclined to find a different time clock solution if at all possible.

BizAdmin seemed to be very much under the impression that the current mess was my fault (face-to-face conversation). I said that I didn't think so, but that maybe now wasn't the best time to discuss it.

Spent more time investigating PayClock. I figured out how to re-activate the license file, which had somehow gotten detached from the more current PayClock installation (it kept warning that it would go inactive in 30 days). I moved the data from the other PayClock installation, which appeared to have data within the past two weeks. Called BizBoss to report this -- who was initially elated, but then gave me a specific employee and date to check, and that showed that the only new data was some that they had entered by hand -- so no, the missing data had not been recovered. (PayClock really ought to have a log somewhere of every time-clock event that is entered -- where did it come from and when was it added to the database? That information should be part of any such mission-critical software, and would be helpful in a wide variety of situations. I have to say that this program appears to be temperamental and difficult to support.)

Did a little more investigation to see if the data files themselves offered any clue, but they appear to be in a proprietary binary format. Gave up on PayClock at this point, as it didn't seem likely that I could recover the data -- and time was moving on, and BizBoss had indicated (in the phone conversation) that BB could just as easily talk to Lathem directly about trying to recover the data, if that seemed like the best option, or else enter the data by hand.

Immediately after that, I tried one more time to log in as BizBoss/WinDom -- no luck. On a whim, I tried Administrator/WinDom -- success! On another whim, I tried BizBoss/WinDom again -- this time with success.

Appendix

Hardware

LinServer

LinServer is a Linux server I set up in 2009; it lives in the server closet. In early 2010, when Verizon handed off their customers to Frontier, I made LinServer the server for all BizA.com email. It receives incoming email directly, but due to Verizon/Frontier's blockage of outgoing SMTP it must hand off outgoing email to Frontier's servers for delivery. It also provides IMAP email delivery for in-house and remote usage.

LinServer also includes a web server which is largely unused; among the served applications is a password-protected wiki onto which I plan to move all BizA technical documentation as time permits, and a web client for accessing email in the event that other solutions fail. (I had originally thought this would be a better solution than MS Outlook, but so far this idea has been rejected.)

LinServer backs up all of WinServer's work folders nightly. On my to-do list is to set up less frequent backups over the internet to an off-site location, in case of fire or other disaster at BizA.

LinServer runs a Samba server, which provides Windows Networking capability. I currently have it configured for the less-secure "share"-type connections, but "domain"-type security could be enabled without much difficulty at this point -- making LinServer possibly a drop-in replacement for WinServer.

WinServer

WinServer is a Windows 2000 Server which controls the WinDom domain into which all Windows workstations at BizA log in. It lives in the server closet next to LinServer. Work folders from all workstations are backed up onto it every night, and it also includes a number of shared data folders. (These used to also include Outlook emails in PST files, but email is now stored in IMAP folders on LinServer.) All of WinServer's working data is backed up nightly onto LinServer.

WinServer was set up by someone long before I arrived. I was given no documents on how it is configured, and I have been working to replace it with a server whose configuration I have been documenting.

Since nobody seems to know where the Windows 2000 Server installation CDs are, if this machine goes down it would have to be replaced by something else or else the CDs would have to be ordered (they are probably no longer available off-the-shelf locally), incurring several days' delay; making LinServer available as a possible emergency backup for WinServer has consequently been a background priority for some time.

Loose Ends

Missing install CDs include:

  • PayClock Pro
  • QuickBooks Premier Edition 2009
  • Windows XP Professional
  • Windows 2000 (standard)
  • Windows 2000 Server
  • drivers for faxlaser

Cleanup needed:

  • Fix PayClock.
  • Make sure PayClock data is being regularly backed up. (It currently isn't. I manually backed it up onto the network on 9/8. There is well over a gigabyte of data.)
  • Move QuickBooks data back to server, where it can resume being backed up regularly (it currently isn't being backed up at all, unless BizBoss is backing it up manually); reconfigure BossPC to load from that location.
  • Remove spare hard drive from BossPC (it's not hooked up).
  • Remove extra LAN card from BossPC; verify that everything is still working.

Proposal to prevent problems with BossPC from precipitating major crises (also avoids messing with BossPC until we have a working swap-in):

  • Set up completely new PC (call it BossPC2) with all necessary software
  • Have BizBoss test-drive it in place
  • If satisfactory, install it at BizBoss's desk and remove old BossPC (call this BossPC1)
  • Do cleanup as described above on BossPC1
  • Do belated maintenance on BossPC1 (SpinRite, disk defrag)
  • Keep BossPC1 ready as emergency replacement for BossPC2