$ foo --help: Embedded Conference Bangalore Notes

I went to the Embedded Systems Conference that was held at the NIMHANS convention center. Some of the session were very useful while others were very boring and useless. I shall write about some good sessions that I attended.

Beyond C, National Instruments

Although in the beginning I was thinking that this session would talk more about some new language for writing code for embedded systems it turned out to be a demonstration session by a couple of good speakers from National Instruments, Bangalore who demonstrated the working of LabView.

The presentation was nice coupled with some videos of people all around the world showing how they have successfully used LabView in the areas of medicine, robotics (Romela, Virginia Tech), speech controlled wheelchair (Ambient)

LabView is a graphical system design platform using which developers can write code as blocks and have it downloaded directly into the FPGA. Apart from that there are modules available like LabView-RT which can write it along with an embedded OS like VxWorks cutting down on the development time.

Apart from NI there are some other vendors as well like Tolemey, Malworks, Singular etc.

Static Code Analysis - David Kalinsky

David Kalinsky (http://www.kalinskyassociates.com/DavidBio.html) is a PhD working on high availability safety critical systems. He gave a nice presentation on Static Code Analysis tools.

To start with David said that the static code analysis tools are not 100% ready. There are certain vendors who claim that the tools they have are good but they do not take care of many important things for Multithreaded applications. As an example he said that if we have 100 threads in our application with each thread having a 100 lines of code it is not wise to invest in the SCA tools at this point in time. However is we have 10 threads with 10,000 lines of code SCA might be useful clearly suggesting that for mutlitasking apps SCA is not yet ready.

Before moving into details of Static analysis David talked about C language and how Dynamic analysis is not a completely fool proof mechanism for ensuring that the code is covered.

The C\C++ compilers (if not used along with warning enabled like -Wall etc) would excuse the developers use of dangerous code and would generate the assembly for it. He suggested that the compiler ought to be more critical to stop such writing practices.

Talking about Static analysis tools now, David said that there have been 2 generations of SCA tools so far and the 3rd generation is being made and getting better by the day.

First Generation: In England, long time back, some guys created a subset of C which they called MISRA-C. This contained all the things which were supposed to be safe and this was used in many cars (automotive business) but the MISRA-C compiler only took care of abour 2/3rd of the restrictions because 1/3rd of these restrictions can not be taken care by the compiler.

Second Generation: In Virginia, there were some developers who went on to make the open source version called LINT. It worked (works) like a compiler but the difference being that it does not produce code but checks the code for vulnerabilities. The problem with lint is that it is very shallow when it comes to reporting bugs. It would say "there MIGHT be" in the report as if it is not sure if there is a bug or not.

Third Generation:The third generation tools dig deep but there still might not be a 100% certainty that all the bugs would be found. The question that comes to the mind is how deep should it dig?

To understand that we need to look at the criteria for code coverage. One is doing it as what we call "Line Coverage". This case of line coverage we need to run enough tests in order to ensure that each and every line of code has been covered. Sometimes it is easier said that done. Even if the whole testing is automated it is virtually impossible for doing a 100% line coverage and it is known that 2-5% of code is never tried at the first place.

But David says that even a 100% line coverage is not good. The better option is what he calls "Path Coverage". Path coverage is based on the criteria that whenever a branch is encountered, the tool needs to go inside at least once, e.g functions called within functions and so on, switch cases, if and else statement etc. But even in case of path converge it would take years to write test cases to get a 100% path coverage but there are scenarios like in case of D0-178B, path coverage is required. Some other metrics are Decision Coverage, Condition Coverage, Multiple Decision Coverage, Modified Condition + Decision Coverage (MC/DC)

Dynamic Analysis

- Observe an executable at runtime
- Is very useful for dynamic memory corruption
- Write test cases that are realistic and relevant to the code.
Valgrind is a GNU Dynamic analysis tool

Downsides

1. You got to have software that runs first (executable). Unit testing finds small bugs, got to wait till integration testing phase for the bigger ones

2. Analysis is slow

3. We start cutting corners only testing the case that we planned for and in some cases test are skipped.

4. The results are sometimes non-deterministic. There might be bugs that are not repeatable (might be at the interleaving between the threads - test cases might not cause the interleaving that is required.

Static Analysis

- Defects are detected early
- No test cases required, smart algorithms do it.
- Analysis is fast
- Analysis can be deterministic (because the analysis is done ofline

Downsides

1. Static analysis tool don't yet understand all the languages (only supported for C\C++)

2. If you have assembly code written within C (as an example), the code is treated as a benign black box because it does not understand the language. It starts making conservative assumptions about the code.

3. Sometimes there are false positives. (False bugs, usually 1%)

4. Sometimes there are false negatives (Tool will miss the bug)

ADA compiler has a much better compiler when it comes to not letting people write code that is dangerous which is the reason why ADA is still the preferred language to write code in Aerospace.

The thing to note is that Dynamic Analysis Tools and Static Analysis Tools complement each other and both should be used.

Working of a Static Analysis tool

These are the steps which are usually taken

1. A call graph is created using code, makefiles etc.

2. The functions are examined bottom-up. Every single function is looked and any suspicious activity is bookmarked (like pointer assignment, dereferencing, etc)

3. Every function is the call tree is again reviewed taking into account the path of the function and every sub-function (control flow graph)

4. The code defects are noted and reported.

Tools

This website lists all the static analysis tools available on the face of the earth and it also has the name of the three which David suggested - Coverity, Klockwork and Polyspace

Learning from Disaster - Jack Ganselle

This was a very good and interesting session by Jack Ganselle (http://www.ganssle.com/bio.htm) a very well know person in the field of embedded systems. Jack talked about small things that people do not take care in their design which results in huge blunders and throughout the presentation he gave lucid examples from real life disasters. Here are a few

Tacoma Narrow Bridge

The USP of that bridge was that it was made at the cheapest cost but the bridge had some serious problems. During high winds the bridge would start resonating in a wave like fashion (Check the video on Wikipedia, http://en.wikipedia.org/wiki/Tacoma_Narrows_Bridge) which ultimately led to its destruction. The flip side was that the person who had designed this bridge (Leon Moisseiff) had designed some bridges in the past that had the same problem.

So,

1. Cheaper sometimes turns out to be more expensive. (The bridge had to be remade)

2. Management might want less cost but it can cheat Physics (Science)

3. We need to learn from the past (Now all bridge design are tested in a wind tunnel in the US)

4. The problem is that we keep doing the same thing over and over again in the embedded world and do not share with anyone.

Clementine Lunar Failure

Clementine (http://en.wikipedia.org/wiki/Clementine_mission) failed because:

1. Schedules were tight and people were working for 60-80 hours. Schedules can't rule because tired people make mistakes.

2. Software that was put in the machine was not tested.

3. There were watchdogs but they were not used.

4. There was no version control system used.

Mars Exploration Rover

The rover was supposed to work only for 90 days and it still sending data to NASA. But initially it had a problem. When it started its work and began the drilling process it simply stopped. The problem was the scientific data that was being created was being put in the flash file system and it became full. The engineers tried to free the data like delete it but it being a FAT type of file system, the directory structure still persisted. There was a watchdog which was taking care of the exception and would restart the rover and every time it would encounter the same problem and restart again and again.

The rover was to work for 90 days but it was never tested for more than 9 days. Exception handling was awful. It seems 6 other NASA missions had the same problem (as they used the same OS) - we got to learn from our mistakes and past experiences.

Ariane 5

Ariane had great first four launches. For the 5th one junta was so confident that there was a payload of half a billion dollar for this launch. But Ariane 5 blasted in the air after 40 second into the launch.

The problem was with the Inertial Navigation System. The Ariane folks changed their hardware but used the old code. There was a place where a 64 bit value was being converted to a 16 bit INT value. The exception occurred and it shut the Inertial Navigation System down and there was no backup!

So the learning are

- we need to be very careful with ported code
- never assume that the software would never fail
- test everything.

Therac 25

This was a radiation therapy machine (http://en.wikipedia.org/wiki/Therac_25) for treating tumors which generated a serious abnormality due to which 6 people were killed due to overdose of radiation. There was a bug which would continue to say that dosage was not given even when the operator had pressed the button. So the operator would press the button again and the system would repeat the same until so much radiation was given to the patient that he\she was dead!

There is an MIT paper about the flaw which can we read here.

- The entire code was written by one person and he left the company. Perform code inspections
- They were using a homegrown RTOS which had a sync problem. Use only tested and certified RTOS.

There are a lot of examples given in the actual slide presented by Jack which is attached with this post.

Overall learnings

1. Do code inspections

2. Testing has to be adequate

3. Simulation is good but at the end it is not reality. Perform testing on real systems

4. Exception handlers are constant set of problem. Write good handlers.

5. Watchdogs should be used. They save lives

6. Use the methods of "Design by contract" (http://en.wikipedia.org/wiki/Design_by_contract )

7. Use a version control system

8. Be terrified of the C language

C (worst case) 500 bugs / KLOC
C (automatic code generation) 12.5 bugs /. KLOC
ADA 4.8 bugs / KLOC
SPARK 4 bugs /KLOC (The compiler had a static analysis tool)

9. Think of using MISRA-C

10. Use static analysis and dynamic analysis (Valgrind\LINT)

11. Schedules can't rule as tired people make mistakes

12. Reuse, sometimes, is very difficult.

13. Be wary of financial shortcuts. Management would always want something at a low cost.

14. Conduct scientific post-mortems

15. Last but not the least, "Learn from the mistakes for others"

Embedded Conference Bangalore Notes

No comments: