The next morning, I came into work with plans to do some really invasive instrumentation of the Ethernet driver code. When I powered on my development system, I got the illegal instruction fault. I sensed, though, that it seemed to take longer for the board to fail. I hit the reset button and noted the time it took for the board to fail. I did this a few times and got a very consistent value. I powered down my system and went to get a cup of coffee. When I powered up the system again, the time before the board failed was significantly longer.
A cold board appeared to function better than a warm board. Using a can of freeze spray to cool the processor and memory chips, I could get the board to load the RTOS image from the network. The warm processor and memory chips, with their increased propagation delays, seemed to manifest a fault when the processor and Ethernet device were accessing the memory at the same time. I decided to take a closer look at the hardware. I removed the evaluation board and our board from my development system and laid them side by side on my desk.
The memory, processor, and associated chips were the same. When I flipped the evaluation board over, I discovered hand rework consisting of some small capacitors and a buffer chip wired to the memory clock lines. These modifications were not reflected in the schematics for the evaluation board. There was also no mention of any memory timing issues in any of the documentation or errata notes for the processor. After several inquiries and the intervention of our division management, the division that designed the processor eventually admitted to a problem with the memory interface. We modified our boards and resumed development.
The chief lesson I learned from this experience was not to presume or to allow someone to define the scope of a problem based on initial evidence. I spent a lot of time looking for a nonexistent firmware problem while the hardware team sat idle believing the hardware was fine. This experience also reinforced for me the importance of quiet observations providing subtle but important clues to solving problems. By noticing a slight change in the failure time, I was able to determine the true cause of the problem.
This entry was submitted by Jason W. Evans and edited by Rob Spiegel.
Jason W. Evans was schooled in electrical engineering at Lehigh University. He stumbled into embedded firmware development. Over the past 25 years, he has worked in the radar, telephone, secure communications, Internet networking, and industrial control industries as a developer and architect. His current interests include embedded high-performance computing and embedded vision systems.
Tell us your experience in solving a knotty engineering problem. Send stories to Rob Spiegel for Sherlock Ohms.