I always enjoy your posts, Dave. They really take on a problem and put into real world context without any of the hype or confusion that can cloud the issues. This is a perfect example. Maybe not any "rocket science" takeaway here, but some real sound advice for engineers on the common traps most fall into when trying to come to the root cause of a part failure or better yet, avoiding failure to start with.
I agree with Beth. Dave, thanks for such a clear overview. The principles you discuss here seem simple and obvious in hindsight, yet somehow can be easily forgotten even by well educated and well trained pros. They parallel\ some of the basic electrical system troubleshooting principles I learned from one of my engineer buddies years ago, which I apply mostly to my multi-component stereo system.
@Beth: Thanks for your kind comments. You're right that none of this is rocket science; it's just rational thinking. However, when parts break, people understandably get upset. Emotions can run high, and there may be a tremendous amount of pressure. As Ann points out, under these conditions, even intelligent and highly educated individuals may start to behave irrationally. The most important thing is to stay calm and focused -- especially when others aren't.
@GlennA: Experts, in particular, are susceptible to the temptation to jump to conclusions. The more experience you have, the more likely it is that a given problem resembles one you have encountered before. But that doesn't necessarily mean it's the same problem! Sometimes experience can be just as blinding as ignorance.
@TJ McDermott: You're right that sometimes time constraints can force you into a "kitchen sink" response. However, in these cases, it may be a good idea to continue investigating even after the "kitchen sink" solution has been implemented in order to determine the real root cause. Who knows? Maybe you can make yourself look like a hero for a second time by coming up with a cost savings when you realize that 2/3 of the kitchen sink solution was unnecessary.
Dave: No doubt this could be a case of finger pointing at its finest. I think the points you made are critical for engineering teams to sit back, take a deep breath and dive into the problem rather than attack it without a plan.
One of the other key points I think when solving problems is not to focus on one area or not focus on one area. If you are in design don't automatically focus on if the part is to print and then point the finger at quality. If you are in quality don't ignore if the part is to print and focus on the design.
True problem solving is a skill that takes a lot of patience and discipline. You must let the data lead you but still be open to engineering decisions and insight. As well as remembering the problem is that the part is breaking. We are all together in trying to solve this problem. Not point fingers at who caused the problem.
Dave. I couldn't agree more. Thanks for a dose of sanity. I too have been part of similar investigative teams. As noted, it seems that one of the biggest issues that pops up is getting management (or the customer) to be patient while the investigation proceeds. There are no shortcuts for a good analysis.
Can I suggest one more big mistake to add to your list?
6. Quickly dismantle a failed assembly. If you have an assembly that doesn't work, it's very tempting to take it apart to see what's broken. You probably have one or two theories as to what might be broken inside. But if you dismantle it and nothing is broken, then you're in real trouble. When you re-assemble, the chances are it will work perfectly, and you've destroyed the bug you've been commissioned to identify.
Instead, before you dismantle, get every relevant bit of information you can from the failed assembly. What are the resistances and capacitances at the terminals, or what is the frictional torque to move it, or how much does it weigh, or does it rattle when shaken etc. etc. If possible, x-ray. Develop a list of failure modes that could produce the observed symptoms, and see if you can prove or disprove any before dismantling. As you dismantle, measure the torque on bolts, look for dirt or misassembled components and for parts that have moved to unexpected positions. Once the disassembly is complete, all these clues will have been lost.
@Tigertom: That's a very good point. For those of us who are materials engineers, there's a temptation not only to take apart assemblies, but to cut parts up so that we can look at the microstructure. We end up with beautiful micrographs, but the original part falls victim to the chop saw.
As you point out, it's very important to get all of the information you can before taking apart an assembly. Once you get to the component level, it's also important to get all of the information you can from non-destructive testing before proceeding to destructive testing.
More than once, I've been in the position of realizing that I wanted to check something on a part after I had already performed a destructive test on it. As Homer Simpson says: D'oh!
I liked the article as well. Currently I am find myself trying to get to the bottom line of a lot of failures. I find your article intersting because some of the things you suggest not to do are exactly what we are doing. Our focus tends to start by understanding how big is the problem. Not because we don't want to fix everything but more from the point that we don;'t have unlimited resources and we want to get the most bang for the buck. We tend to try and get data and group the failures into different root causes. And then do focus on if the part is to print. Quite often the failures are caused because the part is not to print. Once the part is to print and the variability is taken out the system then the root cause failure of the design can be attacked and improved. However, if the parts are not capable and can't be to print, it doesn't matter how good the design gets because you will still have problems.
Although I haven't been involved in formal failure analysis, I have often been called to troubleshoot problems. Often the most senior person involved 'declared' what the root cause was. After I finished my troubleshooting, I had often proven that the 'expert' was wrong. Seniority doesn't automatically mean that you know all of the intricacies.
As well as the fact that often a failure may have more than one root cause. Carpet bombing may be the best way to improve the system as a whole. Often you need to get the part to print. Make the design better so the tolerances can be larger. And improve the tool so the part doesn't vary as much. Anyone can design a part that has no tolerances and has to be made to print +/- 0.000001. But the truly good design engineer makes a part that can be made to +/- 3 mm. or bigger.
These principles also hold true for failure analysis in electronics. I worked for a semiconductor company for years as a product and test engineer and recognize most of these scenarios as having happened at one time or another. One of the most interesting places in a semiconductor plant is the F.A. lab which is usually where customer returns are evaluated. And of course when parts started failing on the production line, the first place everyone tries to blame is the test set - it never occurs to them that their process might have shifted...
And yes I have been in all of the above situations. My favorite is getting a part in a box and being asked "why it broke?" Only the part is fully functional....
While all these were sound advice I personally still keep an open mind for problems that would be fixed quickly by one of these sinful actions. Countless times I have attached 100 probes and just measured data... and wala ten minutes later I know the solution.
It did bite me once when the issue was not design or a problematic part but rather EMI. See probes can make the EMI issue go away...
Product failure analysis covers two different types of products, those that have been working properly for a long time, and those that don't have a history of having worked. The failure analysis of the two types would be a bit different, at least after the start. The first question would be "did it ever work correctly?", since if it did not, then the design may be suspect. But it is also possible that the design is good but the part was not made to the design. Amazingly, not every design is produced faithfully the first time.
The conclusion, then, is that in order to correctly understand why some part failed, it is mandatory to understand just how the system including that part was supposed to work. Having an adequate understanding of a system is seldom a trivial task, but it is important. A part will fail because it was subjected to forces beyond it's strengths. That is the fact in a majority of instances. At that point the question becomes one of: was the part made to the design specification, or was the specification adequate? Again, in order to be able to answer correctly there must be an adequate understanding of the system.
Interestingly enough, sometimes the problem is caused by there not being an adequuate understanding of the system from the very beginning. And I am not sure how to solve that problem.
Great Article, Dave.I found myself thinking back to many different scenarios over the years, after reading each of your points, 1 thru 5.One which loudly resonates is touched upon in both your #2, andyour #5 – jumping to conclusions, and management pressure to fix it quickly. Many times, I have dealt with a manager who forced his suggestion to be the fix, without going thru the necessary trials to prove it.I preach again and again, "a sample of one doth not constitute a statistical lot".
At the Design News webinar on June 27, learn all about aluminum extrusion: designing the right shape so it costs the least, is simplest to manufacture, and best fits the application's structural requirements.
For industrial control applications, or even a simple assembly line, that machine can go almost 24/7 without a break. But what happens when the task is a little more complex? That’s where the “smart” machine would come in. The smart machine is one that has some simple (or complex in some cases) processing capability to be able to adapt to changing conditions. Such machines are suited for a host of applications, including automotive, aerospace, defense, medical, computers and electronics, telecommunications, consumer goods, and so on. This radio show will show what’s possible with smart machines, and what tradeoffs need to be made to implement such a solution.