Release Autonomous Systems with Confidence
At Silexica, we’re passionate about complex software solutions and solving the challenges associated with developing such systems. Today, we see that it is more and more difficult to keep the quality of the software stacks running. In the future, software complexity will continue to increase. Especially if you look at mobile systems like automated vehicles or robotics which have a high likelihood of doing harm, it is paramount that the software stacks within such systems are developed with rigor and thoroughness. Nobody wants to ship faulty products–especially if they could harm someone.
After investigating how modern software systems are developed and tested it became clear to us that there’s a gap that needs filling. With the inherent complexity of today’s software systems (and future systems even more so) traditional approaches of debugging and testing are not adequate to capture all defects. According to the 2019 EEtimes Embedded survey, some of the biggest challenges in developing and testing today’s software systems are.
1. Managing the complexity in the (software) system (many components, no good overview)
2. Debugging those systems effectively
There are so many software components involved that when an error occurs, developers often spend several days to figure out the root cause. There is often no overview available which shows the complete system state including relevant metrics so that you can draw conclusions about where the root-cause might lie. If the system has real-time constraints, it gets even more difficult as it’s often impossible to pause the system at a certain point in time as the system will most likely crash. Usually, you’re running your system several times and everything goes well. But often, during one execution, something goes wrong. If you’re lucky, you already prepared your system in some way and can start the investigation with some hint where the root-cause might lie. But, how do you now work your way through this data effectively? This approach generally leads to hours, or even days, spent instrumenting your system to gain more insights into the root cause of the defect, often complicated further when trying to recreate the failing scenario.
All-in-all, the time it takes to mitigate a defect can be divided into the following components:
- Realize there’s a defect
- Instrument your system and re-create the scenario leading to the defect
- Analyze the logged data and identify the defect’s root-cause
- Implement the mitigation/fix
Silexica has created a new disruptive platform that aims to reduce the first three components radically: SLX Analytics.
To illustrate the benefits of using SLX Analytics in your development lifecycle, let’s use a real-world example that we’ve encountered. Based on the Autoware.AI software stack, we’ve put SLX Analytics to the test in a typical scenario for automated driving systems. A typical system metric that you would want to monitor in an automated driving system is the software pipeline latency. In our case, we took the latency of the lidar perception pipeline that acts upon the raw lidar input data and outputs a filtered version. Suppose our system design requires that this part of the software pipeline shall never exceed 650 milliseconds (ms) duration; we can define this as a constraint in SLX Analytics and verify automatically that this constraint is not violated.
Continuous Monitoring of System Metrics
SLX Analytics uses multiple runs to gather statistical evidence regarding system-level KPIs. A specific set of such multiple runs is called an experiment. An experiment could be done for example for each nightly build of your software. Based on this data, it aggregates important data and provides a summary view of your most relevant metrics. In this example, we focus on the latency of the software pipeline. Let’s assume that SLX Analytics should test our software each night using 20 runs.
In the picture, you see five experiments which were executed Monday to Friday. SLX Analytics shows you the overview of experiments by day and the statistic distribution of the measured latency in the form of box plots (the big black line in the middle represents the median, the small lines at the top/bottom represent the highest/lowest measurements). In this case, the experiment runs on Monday were fine in terms of our pipeline latency; however, on Tuesday something changed: you can see that the worst-case execution time went up. The median (shown by the big horizontal bar) is still similar compared to the previous day, but there are new outliers: this could be due to an intermittent defect in your system. The system is not violating any constraint (yet), but it could lead to one in the future.
By Wednesday, the defect’s effects got worse, and it now violates the constraint. We now want to see what was going on and why this violation happened.
Drill-down into your system
We have identified the Wednesday build as a violating one due to our SW-pipeline-latency constraint. SLX Analytics provides drill-down features to inspect the system state at the failing point in time. By clicking on the box plot of the failing experiment, we get the histogram view of that baseline.
The histogram shows you all executed software runs for this experiment and the measured latencies of all those runs. Here we can see that in three instances, the latency constraint was violated. Clicking once more on one of the failing instances, SLX Analytics provides you with the timeline view of that single run, showing the measured latencies during that run.
Finally, here we can see the exact point in time when the latency constraint was violated. Before drilling-down further, there are even more insights this view offers. See how the latency is first on a plateau at around 120 ms, then it rises to the event leading to the violation, and then, after some up and down, settles on a lower plateau at around 15 ms? This is an indication that after the violation, the software pipeline does not recover because the latency is too low for a functional pipeline.
Coming back to our violating event, if you drill-down even further, SLX Analytics provides you with a view of the system state at the latency violation:
Mitigate the defect
With this view, you can inspect your system at the failing point in time and figure out the root cause. The view can categorize and filter different layers of your software system, so you can distinguish between your application’s processes and other processes (e.g., kernel, middleware processes). This means you’re not limited to analyzing only the application level but also the middleware down to the OS level all in one analysis.
In this case, the kernel processes are colored in pink. The dotted blue line at 41.78 s represents the beginning of the failing constraint event chain. Meaning, at this point the start event should have occurred. The dotted red line represents the end time of the failing constraint (thus, it lies 650 ms later). You can see now that after the begin-marker, there are much more kernel activities than before. Without this system-level, it would have taken a long time to figure out there’s unnormal kernel activity as usually there’s a focus on debugging the application code first. In this case, after digging a bit deeper, we figured out that the kernel started an SD-Card I/O -task which in turn led to a blocked system. Ultimately, we then identified an erroneous configuration on our real-time system due to the scheduler not scheduling the real-time-tasks correctly. Without SLX Analytics, it would have easily taken some days to get to this conclusion.
With SLX Analytics, Silexica took a radically new approach at non-functional system-level testing: it unifies all relevant layers in your software stack and gives you complete insights into what’s happening to your system. There’s no more need to re-create degrading scenarios as it allows you to drill-down into the full system state at post-processing. This way, you can keep your software developers developing valuable features instead of fixing defects.
Integrating this approach in your continuous integration (CI) pipeline gives you complete confidence in the stability of your releases and makes sure that you never again miss a release due to integration problems or unknown defects. SLX Analytics provides you with a complete overview of your most critical system metrics and automatically verifies all of them by running each release multiple times and generating statistical evidence on those metrics.
Combining these two benefits, it further acts as a substantial enabler when it comes to incremental shift-left testing as it reduces the time required for each testing and debugging cycle.
Dr. Kai Neumann has extensive experience in ADAS/AD development and validation. During his career, he was Engineering Manager for ADAS/AD Validation at ZF Friedrichshafen AG and Lead Engineer for Automated Driving Platforms at Aptiv. He is now a Product Manager at Silexica defining products that will enable robust and safe future intelligent systems. He holds a Ph.D. in Engineering from Aachen University and is based in Cologne, Germany.