When purchasing a car, there are several factors that are typically considered. Miles per gallon, horsepower, crash rating, and cargo space are all objective measurements that can be used to determine which choice best meets your needs. However, there are no such simple measurements that can be used to compare two software security tools. There simply is not an industry-standard “measuring stick” for security tools.
One contributing factor is the difficulty of creating a true benchmark for software security tools. The most applicable definition of a benchmark is provided by Merriam-Webster, which states, “A standardized problem or test that serves as a basis for evaluation or comparison”. To illustrate some of the difficulties in creating such a test, we will look at the OWASP 1.2b Benchmark. From here on, this will be referred to as a “test suite” instead of a benchmark; this will be discussed further after a brief introduction to the test suite.
The OWASP 1.2b test suite is a collection of 2,740 java servlets containing 11 different vulnerability categories originating from 5 java libraries. Among these servlets, referred to as “test cases,” there are 1,415 that have been intentionally seeded with an example of a particular vulnerability. The remaining 1,325 test cases are intended to be secure, or to contain no vulnerabilities. Once a tool is run against this test suite, its score is determined by the correct classification of each of the test cases against the expected results.
Before discussing the implications of these results, it is important to understand the distinction between referring to this collection of test cases as a benchmark vs. a test suite. A benchmark must be a basis for evaluation, according to its definition. By this standard, the OWASP test suite would be an acceptable benchmark if what is being evaluated is the ability for a security tool to detect instances of 11 vulnerabilities in java servlets stemming from 5 libraries. This simply does not accurately represent the ability of these tools to detect an ever-growing number of vulnerability types present in complex enterprise level code. Because of this, it is most accurately referred to as a test suite as opposed to a benchmark.
An ideal scan of this test suite would contain exactly 1 vulnerability, of the intended category, in each of the 1,415 vulnerable test cases. However, in an effort to create succinct and straightforward code, many typical programming conventions, such as error checking, are ignored. This creates code that is easily read and understood by auditors at the cost of unintended vulnerabilities, which far outnumber the intended vulnerabilities in the case of OWASP. These extraneous vulnerabilities complicate scoring of the test suite and undermine the confidence in the results. Because of this, much more effort must be put into interpreting the results of this, and similar test suites, than looking at the single number produced by the scoring methods.
In the graph above, "Test Cases" represents the number of test cases falling under the categories of either "True Positive" (intended vulnerable) or "False Positive" (intended secure). "Dead Code" represents the 245 vulnerabilities inside of dead code paths that are intended to test for false positives. As a design choice SCA chooses to report these errors due to dead code often being the result of a design mistake which could unexpectedly become active after patching. "Reported" represents number of issues reported by SCA.
While the OWASP 1.2b test suite provides useful information about the performance of security tools, it should not be taken as a definitive measure of quality. This is not a problem that is unique to the OWASP test suite and highlights the need for more research into objectively evaluating security tools. In the meantime, it is very important to evaluate not only the results of self-proclaimed benchmarks but their quality and applicability as well. In other words, don’t forget to measure the measuring stick!