Regarding litigation purposes, the Micro Focus ArcSight: Protecting Security Analytics with an Audit Quality SIEM Solution white paper discusses the acceptance of ArcSight CEF as evidence. Digital evidence is copyable and can be mathematically proven to be identical to the “original”. I use “original” because of the way computer systems log data. It is not all Syslog. It is not even all text! If digital data were held to the same standards as physical data, entire database systems, or Microsoft Active Directory Servers, would be required as “original” evidence. That’s not even taking into consideration the likelihood of the evidence being altered merely by accessing (reading) the data on the security data’s source system.
Preferring raw security data for litigation purposes is moot. Normalized security data is equivalent to the raw security data, from everything I’ve read, including “Computer Security Log Files as Evidence.” This seems based on the same legal foundations that support computer forensics digital evidence, as I understand it. Standard disclaimer, I am NOT a lawyer!
Potential Normalization Errors
This topic has the potential to start international holy wars. Apple vs. Microsoft? Linux vs. Windows? VI vs. EMACS? Hot Dogs are or are NOT sandwiches? Ketchup (Catsup?) on hamburgers? Yeah, that level of international holy wars! Fun. This is probably worth a post on its own.
In order to normalize modern security data, it must be parsed. Cutting edge logs (think cloud!) are likely to be structured, e.g., in JSON format. Some security devices log in XML. Many security devices log in CEF, LEEF. These are fairly easy to parse, as they ultimately boil down to key-value pairs within the log structure.
Note that I am not saying that parsing XML is trivial. Throw in a complex XML system including layered dtd files, and you can run into many problems writing your parser. Throw in CDATA, or other unstructured data, or Syslog, and the way many Syslog systems are typically misconfigured (do all your devices throw their logs into the same Syslog server, with almost no source information, with mangled time stamps?), and parsing becomes the art of herding cats.
Throw in the fact that what one firewall vendor calls a given field, another firewall vendor calls something else, and yet another firewall vendor doesn’t even have a similar field, and things can get complicated for mapping the data to your preferred schema. And that’s just one device type. Throw in other device types, like IDS/IPS, or NGFW, or DLP, or web proxies, or, you get the idea. The overlap of data fields becomes smaller and your favorite schema starts growing. One of my favorite former ArcSight colleagues, Sanford Whitehouse estimated there were around 1,500 distinct fields across all the devices that a SIEM should monitor.
Distinct fields of 1,500 is a lot for a schema. Many complain that ArcSight’s schema of over 450 fields is way too much. Add the idea that large schema field variations cause filter and rule conditions to expand, then it becomes difficult to write security content that can be applied across multiple vendors, and your solution falls apart.
The point of all this is that normalization is a non-trivial task. With a limited schema, it is often impossible for all the data in some events to be completely mapped to any fixed schema. Even with extra custom fields, things can be left out.
The interesting part is that in some log messages, some of the data is possibly not security relevant. The result is that the parser author needs to figure out which data in that particular message must be translated and mapped into the schema. This can be a tricky process, as Ken Tidwell pointed out in the comments of my last post. I like his thought, “Normalization involves some degree of reducing the data to a least common denominator.” He made other good points, which I might expand on in the future.
But even in the early days of ArcSight, there are two things that can be done to preserve the data that physically cannot fit into the schema. The first is the additional data fields. Granted, there are short-term problems with this solution. The second is the option of storing the raw message (yes, there is a raw message field in the ArcSight schema!).
In general, with normalization (and parsing), there are many potential sources of errors, not including having more security log data fields than are representable in a given schema. It is possible to swap the source and destination information (human error). It’s possible that for one organization, some piece of data mapped to a custom schema field is less significant information than a field thrown into an additional data field.
Developers, parser writers, security engineers, and analysts are (typically) human, and can make mistakes. But, there are more issues than parser and normalization errors.
Security device and application vendors are also not static. They patch and update their systems. They change how they log. They add, remove, or modify fields in their logs.
Analysts, security engineers, developers & parser writers learn more about systems and their logs, and what they mean, as time advances. A message that was “properly parsed” a few years or months ago may be discovered to be no longer correct today, based on insights gained from investigation of an event. This can possibly include messages generated by security vendors and output via CEF, LEEF, or even XML, JSON, or AVRO.
These can be compelling reasons to ingest only the raw log messages. But, again, you are going to need to parse and normalize that data to effectively and efficiently work with it. Sure, if you do that after you have stored the data, you must parse and normalize it again each time you work with or query it. If you do not store it normalized, you could lose historical information, such as collection process, time stamps, and how the messages were originally parsed and normalized. If you store it normalized, there could be normalization errors in your data, but there are ways of getting around that, too.
I will get deeper into when is best to normalize in a later post.
This is my second post on Security Data as part of my Normalized or Raw? blog series.