Back in the 80s, I was involved in the installation of an environmental monitoring and control system for a military base in Florida. The system used a CATV coaxial distribution system to send data to and from the many buildings we were controlling. We developed a 1 Mb/s modem -- in the early 80s there were few other options. The heart of the system was a redundant pair of PDP-11 computers at the central control facility.
After the system was installed, we returned to California, expecting that the task manager and an able assistant would take over at the base. They were both excellent programmers and had developed much of the system's software. But they had a problem -- the system occasionally crashed. After three months or so of studying core dumps, the guys were not making any progress in solving the software problem. So they sent me back to see if I could help solve the mystery.
I set up a relatively primitive logic analyzer to stop on the loop address where the computer ended up when it crashed. I let it run overnight. Sure enough, the next morning we had captured an event. What seemed to be happening was that we were receiving an illegal message from the links modem and this message was used to create a vector that sent the computer astray. I was assured that this was impossible because the modem's hardware checked a 32-bit CRC code to confirm the data was valid before passing the data to the computers.
I reviewed the CRC detection and interrupt hardware design and found only one minor hardware deficiency; it didn't inhibit the receiver's input as it was specified to do while valid data was being passed to the computer. It was a "minor problem" because the link had a good signal-to-noise ratio.
Next, I found that the PDP-11 was given a hardware interrupt to read the input data and it took a 100 ns to respond to the interrupt that occurred every few milliseconds. What I found was that occasionally lightning struck with enough local intensity -- and at just the right instant -- to create an extra data bit and clock pulse in the 100 ns interrupt window. This extra clock shifted the validated data word just "one little bit" before the computers read the data. This corrupted word was used in turn to create a vector that sent the computers to an incorrect address where they ended up in empty instruction space and then to an infinite loop. With this insight, the problem was soon fixed and the crashing was eliminated.
Jim Gundersen worked for Hughes Aircraft (which became part of Boeing Aircraft) for 40 years. He was the principal designer of the DC-10 Entertainment and Service System in 1969. This is an interesting system because of probable firsts: large scale applications of MOS integrated circuits in a single system. The success of the program rested greatly on the application of self-aligned MOS gate technology which was changing the MOS industry.
Tell us your experience in solving a knotty engineering problem. Send stories to Rob Spiegel for Sherlock Ohms.