The Pathfinder Reboot Problem

The most famous example of a priority inversion problem is when, early in its mission, the Mars Pathfinder robot would reset itself after some hours of operation. After duplicating the problem on earth, the NASA/JPL team concluded that the problem was being caused by priority inversion. In the Mars Pathfinder, a shared information bus was used for communication between different components of the Pathfinder, and priority inversion prevented a high-priority task from running, making the watchdog timer reset the system.

Since the information bus was shared, access to it was regulated by using a MUTEX. In the Pathfinder, a high-priority (but infrequently run) Information Bus Management task moved data in and out of the information bus, while a low-priority meteorological task used the information bus to publish its data. Sometimes the low-priority task held the MUTEX so that the high-priority task would have to wait until the MUTEX was released. However, in the brief period during which the low-priority task had the MUTEX, a medium-priority communication task, which did not use the information bus, could preempt the low-priority task’s execution and prevented the low priority task from releasing the MUTEX. Thus, even though the high-priority task had a higher priority than the communication task or the meteorological task, it could not run due to the MUTEX not being released. When it was blocked from running for too long, the watchdog timer assumed there was a major issue and reset the system.