When we received field returns back, they were re-tested on the production test stand to the original acceptance test criteria. The reported erratic behavior was usually confirmed, and the temperature display said the sensor was operating at the wrong temperature. When we disassembled the sensors down to the individual thermistor level, one of the four control thermistors was typically found to be out of spec -- the resistance was too high for a given temperature.
This was perplexing because the thermistors were the expensive hi-rel ones that were not supposed to fail. We removed quite a few thermistors, including both failed and good ones, to perform failure analysis on them. Interestingly, after tedious removal with very strong solvents (the epoxy was highly filled and very rigid), some thermistors had cracks in the glass, and some did not. Some good ones were cracked, some not; some failed ones were cracked, some not. The only pattern was that the surface-mounted monitoring thermistors never failed and were never cracked.
Once we realized that this failure mode was not an isolated incident, and as more sensors were coming back, we needed to develop a fix to prevent a very large warranty-cost exposure with the rest of the order. Our support engineering team, of which I was a junior member, met with the customer's design engineers, who had developed and tested the sensor before outsourcing it to us. They had never seen this failure in their testing.
We looked at the startup data stream coming out of the system and saw that the failures happened very shortly after power-up, typically within a few hours. We also reviewed the manufacturing logs, lot history records, and acceptance tests for the failed sensors. We found no correlation to part lot, equipment, or assembly technician. We reviewed the post-shipment life history of each failed sensor, and compared it with ones that were operating properly.
Here we found a correlation. The failed sensors had generally been stored in the customer's warehouse for a longer period of time than ones that did not fail. We knew that FIFO (first in, first out) was not used as the primary inventory control technique. Each sensor was serialized and its performance characterized. Performance-matched sensor sets were needed in the system, so the “best” spare was chosen in preference to an older one. When we asked, we were told that the warehouse was not heated for comfort, and often reached near-freezing temperatures in the winter.
With this information, we attempted to develop a hypothesis for why the thermistors were failing. The general consensus among quite a few smart and experienced engineers was that the damage was occurring due to mismatched temperature coefficients of expansion (“mismatched tempcos”) between the metal sensor body, the epoxy, and the glass/ceramic thermistor. But exactly how it happened, and how to prove it, remained elusive.