System Lockup Stumps Engineer

DN Staff

July 22, 2014

5 Min Read
System Lockup Stumps Engineer

I started working for IBM Havant in Britain even before the results of my university degree were announced. I was very green, and the real world of banking system computers was a considerable shock to me. Suddenly I was thrown into the deep end of hardware, firmware, and operating system engineering, and despite excellent theoretical classes provided by IBM, the acquisition of practical experience was mostly "trial by fire."

In those days IBM guarded proprietary information quite jealously in a tiered system where the laboratories in the US had unfettered access to information, while engineers in satellite markets had access to only the most rudimentary diagnostic tools. In between these extremes IBM provided secondary support centers, which were key facilities scattered around the various markets that had access to almost all proprietary information. IBM Havant incorporated a secondary support center for banking systems. Remember the very first ATMs? Yes, those were our babies, along with the computer networks behind them.

My group in Havant was responsible for the EMEA countries (Europe, the Middle East, and Africa) although in practical terms we mostly operated in Europe. Why? Because those were the markets where our products were most severely stressed, tested, and (inevitably) broken. Certain markets, like Austria, Denmark, Germany, and Norway were especially ingenious in the use of our products, hence they encountered the most problems.

This particular problem had stumped just about everyone because it occurred only when the system was restarted in the mornings. As a part of the system "bring-up," the core computers would start accepting input from the various attached terminals. Right in the middle of this process the entire system would suddenly lock up. Unfortunately, the freeze happened long before the most rudimentary diagnostic tools had time to start, so all we really had was a trace of startup communications between the terminals and the core system. Since the terminals were typically unmanned at startup, the traces didn't seem particularly relevant and indeed showed no hint of unusual protocol exchanges.

The investigation dragged on for a couple of days. Unfortunately, the problem didn't trigger on every restart. Once up and running, there wasn't any hint of what might have caused the problem. The customer pressure for a solution was growing rapidly, since the escalation process through the IBM support layers had dragged on for some time before we were called in. They were naturally growing impatient. There was a different type of pressure coming at me from the senior IBM engineers for the domestic market. They felt with access to the same proprietary information they could resolve the issue without the help of a 22-year-old rookie. I needed a solution fast and the machines were telling me nothing.

Under siege from all quarters I decided to do what any engineer would do. I went out for lunch with the only sympathetic character around -- a junior computer technician about my age. Over lunch I admired his company Mercedes, and we discussed what work had been performed on the system prior to my arrival. I was looking for something "off-kilter" that might provide a clue. Twenty minutes into the meal, I had the critical information I needed.

As a part of the installation process, the customer had decided to delete an extraneous key from the terminals. The key was used in certain markets, but not this one. IBM had provided a delete kit for the terminals, and by chance the technician had with him the residuals from such a kit. I wanted to look at the instructions for the delete operation. However, what caught my eye immediately was a small, innocuous-looking square plate, maybe a half-inch across at most. It appeared to have an adhesive material on one side. But what was it and why was it there?

This is what had happened. IBM keyboards back in that era utilized a press-to-break contact. Until you pressed a key, the corresponding contact (either electrical or field effect) was actually in the closed position. To delete the key properly, you had to affix the sticky back plate across the contacts on the circuit board where the key had been. Otherwise the keyboard would think the key was being held down by the operator and thus transmit the character code for the key nonstop. During normal operations the character code was unused and filtered out. However, during startup the firmware filters were inactive and the same keystroke from every terminal was transmitted up to the core system nonstop. Since the keystroke was not part of any recognized protocol, the diagnostic trace had omitted it.

Unfortunately, if enough terminals were powered on ahead of starting the core systems, the relatively primitive boot routines could be overwhelmed, resulting in a system freeze. The immediate solution was to start the core system with terminals powered down. Technicians subsequently went to each terminal in turn and installed the missing plate. In the long term, I believe steps were taken to make the startup process more resilient. I handed the problem back to the country support team and headed for the airport. My first solo trip was a success.

On reflection, in effect we had seen, back in 1980, the very first denial-of-service attack, albeit a self-inflicted assault!

Alastair Stell graduated with a degree in electrical and electronic engineering from UCL in Britain in 1978. Since then he has used engineering as a passport to travel the world. He has lived in Africa, Australia, and America and has had extensive travels throughout Asia. He has worked in areas as diverse as banking, mining, telecommunications, silicon manufacture, aviation, and automotive control system. His primary motivation in life is problem solving, and he is currently engaged on a project to replace conventional classrooms with an adaptive teaching system that uses interaction with the student to tailor an optimum approach to learning. The code name for the project is Diamond Age. He lives in Phoenix, where he restores cars and designs scalable AI systems as his primary hobbies.

Tell us your experience in solving a knotty engineering problem. Send stories to Lauren Muskett for Sherlock Ohms.

Related posts:

Sign up for the Design News Daily newsletter.

You May Also Like