About 15 years ago, I was developing firmware for telephone systems in a small division of a large company. The primary business of this company was the design and manufacture of processors and associated chips for the desktop and server markets. The products my division was working on were seen as providing entry into new markets for the company's chips.
One of my tasks was to port a commercial real-time operating system (RTOS) to a new processor the company had developed for the networking/telephone market. This RTOS performed a two-part boot with an initial board resident firmware image that would load a full-featured firmware image from a server on our network. Since our own boards were still in development, the initial RTOS port was done on an evaluation board supplied by the division that designed the new processor. The porting effort went smoothly, and within a couple of weeks, I had the RTOS functioning and stable on the evaluation board.
Not long after I completed the RTOS port, our boards arrived from the fabrication house. From the RTOS perspective, our board was nearly identical to the evaluation board. The only significant difference was a newer but backward-compatible version of the Ethernet network interface chip. I modified the board resident RTOS image to give the hardware team the tools they needed to bring up and validate the initial batch of boards, such as memory test functions. Other than a few minor hardware design issues and fabrication problems, the board bringup was proceeding on schedule.
That changed when we attempted to load an RTOS image from the network server. When the Ethernet network interface was activated, the processor crashed as it attempted to execute an illegal instruction. All the working boards we had exhibited this failure. Because of the new Ethernet interface device, the lead hardware engineer was adamant that the problem was a firmware compatibility issue and not in the hardware. I took one of the boards back to my desk, plugged it into my development chassis, and started working on the problem.
The first odd thing I noticed about the failure was the address from which the processor was fetching the illegal instruction. The address was in the data heap space from which the RTOS allocates buffers and other dynamic data structures, and not an area containing executable code. The next thing that struck me as strange was the lack of consistency in the failing address. The address floated around in a region of the data heap. From the network traffic, I could see that the board was successfully sending and receiving Ethernet frames before the failure.
I added instrumentation to the code that told me the data heap region where the failure was occurring was the buffer pool used by the Ethernet driver. The driver configured the Ethernet device to transfer data directly into and out of these memory buffers. I inspected the Ethernet driver code for device compatibility problems and found none. I checked that the code was not blowing out of a memory stack or overwriting the RTOS's bookkeeping data structures and found no problems. I bumped up the stack sizes and increased the buffer pools with no effect.
The peculiar symptoms of the problem and my inability to influence those symptoms through software changes led me to suspect that this was a problem with the hardware. The hardware team insisted this could not be the case. They had literally copied the design -- from the parts used to the layout and artwork -- from the evaluation board. If the code worked on the evaluation board, the hardware folks reasoned, it should work on our board. After three days of no progress toward a solution, I was very frustrated and out of ideas.