By Don Russ
Back in the 90’s I worked for a company that designed data radios that were used for telemetry and control on licensed narrowband radio frequencies. The latest product was a high-quality radio that had a DSP and microcontroller combo and was more stable and easier to produce than the previous generation because it was a single board and SMT. We got through the initial prototype and production run without all that much trouble but as the product started shipping in larger quantities there were reports from the field that every once in a while the radios would not respond to polls. The fix was to simply cycle the power. It was further observed that this issue occurred on power-up and power glitches. When the radio started working it continued working.
We grabbed a dozen radios and set up a work area and started cycling power. The radios worked perfectly. After a week of off-and-on effort we caught a break - a tech on an alignment bench caught a radio exhibiting the issue. We used some software commands on the radio to do some diagnostics but in the process of troubleshooting the power got glitched and the radio started working. We grabbed the radio and brought it into the lab.
After several thousand power cycles the radio did not hang up. We got tired of torturing the poor radio and went back to work on other things but not before putting the word out to the people in production that if a radio EVER failed to receive immediately call in the engineering team.
It wasn’t more than half a day after the word went out that another radio was caught acting up. This time we were ready and carts full of equipment went out to take over the poor tech’s work area in the factory. This time we had a plan and we executed it. We quickly narrowed it down to the receiver’s 8 pin ADC because the SPI output was repeatedly putting out the same conversion value. No matter what went in, the same thing came out.
The VP of engineering assigned me to get to the bottom of the problem. I contacted the part’s vendor and they replied with the standard “It works for everyone else” and “It must be something you are doing.” They had no interest in helping out.
My first approach was to find out what I was doing to their part that was causing the issue. The failure happened in the factory but not the lab. What was different? After a few days went by several more failures were found in the factory with exactly the same symptoms. The interesting thing was that they came from only two of the alignment benches, the other benches were working well. I went out to the factory and examined the benches and found that the most interesting difference was that the power supplies on the benches were not standard, some were scavenged, some were bought and some were older than I, but the two benches that were catching the radios acting up had the same model power supply.
Bringing the radios and supplies back to the lab I determined that having the power ramp-up at just the right rate would cause the ADCs to reliably hang. I tried several ways to get it to not hang, but short of nasty software, GPIO ports and FET switches I couldn’t fix it. The VP of engineering made an executive decision: “Get rid of the damn thing!” I put in a competitor’s ADC that actually ended up lowering the cost of the product. There were no hang-ups after that.
Half a year went by when I got a query from the original vendor asking why the sales went from thousands a month to zero. I told the story and explained that I designed their part out. The next thing I know the manufacturer is sending in an engineer from the design team in Ireland to visit the company.
The following week I am sitting across from the engineer telling the saga of the stuttering ADC and when the conversation turned to solving the mystery the engineer asked if I had any ideas as to what causes the problem. Having recently taken a course in advanced digital design for my masters degree work I knew exactly what the problem was: “You have a state machine with invalid states that are not covered with an exit as well as a poor power on reset circuit. The state machine is starting in an invalid state and bouncing between that and other invalid states. Fix the state machine and you will fix the part.”
She quietly started looking through her pile of papers she brought with her and got a funny look on her face. She quickly ended the meeting and thanked me for my time. Half a year later I receive a tube of parts marked prototype that were taped to a piece of the manufacturers stationary with a hand written request to try them in our product to see if they work. I replied by email that I had no time to try them and that I was sure that they work. I then used a push pin to hang them from my bulletin board as a reminder that even chip manufacturers make mistakes. Later in my career when someone says “It’s never the IC, it’s something else” I point to the yellowed note and smile.
Don Russ currently works at Crossmatch Technologies in Palm Beach Gardens, Fla. designing biometric devices. He has worked for various companies over the years including Lockheed Martin, Motorola and Microwave Data Systems. He has an EE Degree from Binghamton University and didn’t quite make it through his masters at RIT before moving to the warm south.