It was on this exact address where the processor made the final jump from upper memory to lower memory (just after switching processing modes) that the target board with emulator attached lost its mind and jumped to a totally random address to begin executing garbage instructions. Because the problem occurred so consistently on that one jump instruction (and didn’t occur with an actual processor chip installed), the customer was convinced the emulator was at fault.
We had debugged plenty of code that made the real-to-protected mode transition with no trouble, so we were doubtful of a fault with the emulator. Yet the customer’s early setup code for the transition looked correct, so we were hard pressed to blame his code and tell him the problem was somehow in that portion of his code -- particularly since it did run on the actual processor chip itself.
Finally, in desperation, we asked the customer to send us his target system so we could replicate the problem -- which we did quite easily. But we noticed one thing the customer didn’t: The switching power supply on his board (which drove the whole processor and memory subsystem, as well as several peripheral devices) seemed awfully small for the number of chips on the board. It was so small, in fact, that we went back and redid a rough calculation of his power budget and found the switching supply to be nominally about 40 percent undersized.
We then took a hard look at the power rails on the board and found that they sagged just enough momentarily to cause the processor to run at below its minimum spec for Vcc with the emulator installed. Putting the processor in where the emulator was, we saw a similar sag, only not quite as pronounced. Ground also tended to “bounce” noticeably in both cases, but less so with the actual processor than it did with the emulator.
Working back, we cross-triggered a logic analyzer and an oscilloscope and discovered the point of “power sag” was on the second major jump instruction -- exactly where the code went wild when running under emulation.
The conclusion? On that second jump instruction, the switching supply, which was already huffing and puffing, had the job of changing the value on just about every single one of its address lines from “high” to “low” all at once. That’s a lot of signals simultaneously going from Vcc to ground.
That makes perfect sense, Didymus7. With the ever increasing time-to-market pressures on design, I would imagaine this is a common problem. Each year (maybe six months), you take your product, add new competitive features, and shove it back out the door.
This was an excellent explanation of the problem and the fix. And I can tell you exactly why it happened. The root cause is most probably because the designers were not given enough time to test the design. The design completion date was not set based on proper design and testing, but more likely set due to a customer PO.
I do agree, but if the SOP was followed for developement, there is no excuse for this to happen. ISO certification works when the employees follow the deveopement paper trail. Cut and paste should be caught by the ISO plan. There is something to say about QC Controls.... It sounds like a total QC failure to me, and a slap on the hand for the engineers for not following the QC plan.
As several commenters here have pointed out, this could easily be traced to a human (i.e., personnel) problem. I suspect that these kinds of software/hardware glitches are commonplace, largely because companies too often can't manage to hang on to the design engineers who know why a product was designed that way in the first place.
A big chunk of what happens -- particularly in the modern world where "software guys" and "hardware guys" are rarely one in the same -- is that when somebody goofs up on one side (in this case a hardware guy undersizing a power supply), it's the guy on the other side of the wall who sees it first and diagnoses it through the lens of his own world view. And to a software guy, a piece of code that crashes repeatably on the same instruction every time smells like a bad tool, not a bad power supply.
At the same time, there seems to be this dividing line between the "analog circuits guy" (who in this case got the design responsibility for the switchmode power supply) and the "digital circuits guy" (who got the job of designing the processor core and attaching its peripherals). It was those two not talking with each other that produced the problem of "undersized supply" in the first place.
Where it's good to have an "old timer" around is where you encounter those problems that cross boundaries between "hardware" and "software" as well as between "analog" and "digital." Engineers like that were being minted 30+ years ago. Now they're not, and the only way they come into being is with the accumulated on-the-job experience of a few decades behind them. In theory, project managers should have enough visibility into their project's activities that they should also be able to spot cross-functional problems like this. But in the last few years there's been a trend for engineers who don't like the profession after a few years to go get the MBA and come back to the organization with a "management" hat on. And when that happens, you're about guaranteed to have projects run by people with neither the breadth or depth to solve thorny problems like this.
The good news in the United States is that at least there are a few engineers who put in the 25 years it takes to get good at the business before taking on project management responsibilities. There are countries in this world where societal pressure is such that you're perceived a failure if you're still "just an engineer" after 25 years and you haven't moved into management. And the net result there is that *nobody* in that society puts in the time to get really good.
I logged 25 years as a "pure engineer" myself before founding Focus Embedded. And the average age of employees here is 54. But we're something of an anomoly in that regard. That said, we also get all of the really hard problems that nobody else can solve because nobody has the breadth of experience to see the entire picture when a beast such as "all F's change to all 0's" raises its head.
Good points, TJ. I believe that inexperienced engineers may be part of the problem. Another is the rapid pace of new product development. So may products get updated on a schedule (think Apple). Engineers are tasked with adding new features with each release. It makes perfect sense that they would keep adding new features to the existing design without revamping the entire product design.
This is the risk companies take when they force out senior engineers for less expensive junior ones. The loss of corporate knowledge can be critical. This is not just a case of experienced vs. inexperienced, but that of knowing a product line well and remembering the reasons for design expansions.
Companies end up answering the question "Why is the design this way?" with the answer "Because that's the way we've always done it" and have no basis to support the answer.
This error is so understandable. The design works, so it gets passed down again and again. Why not? Why redesign? At a certain point developments in what the design is supposed to do outgrow the basic dependable design. This has got to be happening all the time.
From Dell / Intel® New Paradigms in Design Work Scott Hamilton, vertical market strategist for Dell Precision workstations, 5/2/2013 3
Early in my career, I worked as a draftsman and remember the days of drawing on vellum with numbered pencils and Mylar with plastic lead. This was a fun experience in the sense that I ...
I've been using workstations for more than 10 years and love finding ways to get more performance from my system. With demanding professional applications that require more power each ...
A lasting memory from my first job as an engineer in an auto assembly plant is standing on hard concrete at six in the morning, vending-machine coffee clutched in hand, listening to ...
A quick look into the merger of two powerhouse 3D printing OEMs and the new leader in rapid prototyping solutions, Stratasys. The industrial revolution is now led by 3D printing and engineers are given the opportunity to fully maximize their design capabilities, reduce their time-to-market and functionally test prototypes cheaper, faster and easier. Bruce Bradshaw, Director of Marketing in North America, will explore the large product offering and variety of materials that will help CAD designers articulate their product design with actual, physical prototypes. This broadcast will dive deep into technical information including application specific stories from real world customers and their experiences with 3D printing. 3D Printing is
To save this item to your list of favorite Design News content so you can find it later in your Profile page, click the "Save It" button next to the item.
If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.