8 min read time

What to do about so many archive file formats?

by   in Unstructured Data Analytics

There are always more Archives

Let's first learn more about what archive files are.

The most basic goal of archive files has been to encapsulate folders and files into a single container - making it easier to move a collection of files around.  However, in the early days of computing, storage was expensive, and networking was slow. To store and/or transmit more data, compression algorithms were invented. These algorithms reduced the size of data that is stored and/or transmitted. Many compression algorithms have been developed over time - with some faster / slower than others and some compressing better or worse than others. Archive files eventually incorporated compression algorithms resulting in greatly reduced file sizes. Despite vast technological improvements to storage and networking, data growth has arguably eclipsed the infrastructure resulting in a continued reliance on compressed archive files.

Like with virtually all computing technology, archive files have evolved over time. Proprietary, open-source and hybrid solutions have been created. Let’s learn more about a few of these solutions.

In the late 1980's, PKWare introduced and popularized the zip file format - first introduced for MS-DOS. Since its introduction, the software supporting the creation and consumption of ZIP files has been developed for virtually every popular operating system and enhanced with alternate compression algorithms and other valuable features like encryption, password protection and self-extracting files. You may be surprised to learn that zip files support nearly 20 compression algorithms as well as multiple encryption methods.

On Unix and Linux, archive programs like ar, shar and tar have been part of the standard distributions for decades. These programs focused on allowing a single file to collect and store separate folders and files. Like with zip, this made it easy to copy and distribute many files. Other utilities like compress and gzip were independently introduced to compress entire tar and similar files producing .tar.Z, .tar.gz and eventually .tgz.

MacOS and its predecessors, Commodore, OS/2, mainframes all have/had their own popular archive utilities and/or compression algorithms.

In the early days of archive files, one of the main reasons for the creation of alternate compression algorithms was the patent on the LZW algorithm and the threat of litigation and/or licensing fees. However, it also turns out that there is also more than one way to compress data. Some algorithms are faster, and some are slower at compressing files. Some algorithms produce smaller files than others. Thanks to on-going innovation by computer scientists, there's now many compression algorithms available to make archive files smaller.

The 1980's introduced the ARC, SEA, ZOO, LHarc and ARJ archivers primarily to circumvent the LZW algorithm patents. The 1990's brought us WinRAR, Ark, 7-Zip, and PowerArchiver.  Many other open-source and proprietary solutions were introduced in the following decades.   And maybe to your surprise, many of these are still in wide use today. See the following Wikipedia articles (Archive File, Comparison of file archivers and List of archive formats) for more insights into the many variations of archive files. 

It's fair to say that innovation and avoiding litigation & licensing combined with time and different operating systems are the reasons that have resulted in the creation of dozens and dozens of archive file formats. It is reasonable to expect that a combination of these drivers will continue, resulting in more compression algorithms and even more archive file formats.

There is always residual risk

Risks with archive files

The risk profile for archive files has been and continues to be something that cannot be ignored. The theft of large amounts of sensitive data is easier than ever.   An internet search for ‘zip file malware’ will return millions of results about trojans, ransomware and other forms of attacks and vulnerabilities.

The inherent capabilities of archive files that encapsulate many files into one container which when combined with multiple compression and encryption algorithms that reduce file size and obfuscate the contents and other features like self-extraction and alternate data streams form the foundation for this risk.

With the May 2023 announcement by Microsoft that Windows 11 will be adding native support for RAR, gzip, 7-zip, tar and other archive files, the use of a greater variety of archive formats will increase. And similarly in May 2023, Microsoft also announced that many of their applications and services will scan password protected zip files for malware.   The risks and threats are real and plentiful.

How to address the challenges of archive file formats?

For solutions like OpenText IDOL KeyView, that are responsible for processing file formats and doing so consistently across multiple operating systems, the vast array of archive file formats and the evolving assortment of compression algorithms, encryption and other archive file features is a great challenge. The most basic of these processing capabilities for archive files are:

  • Distinct identification of the format
  • Complete extraction of the contents

The seemingly simple task of distinctly identifying the specific archive (and really any) file format is complicated by the fact that many totally unrelated file formats like for business intelligence (BI), office productivity applications and others from commercial vendors and open-source use ZIP as their container. For the curious, use your favorite archive application (e.g. 7-Zip) and try to open any file format and see what happens. Microsoft Word/PowerPoint/Excel (docx, pptx, xlsx), iWork Pages/Numbers/Keynote, OpenOffice Text/Spreadsheet/Presentation (odt, ods, odp), PowerBI (PBIX) and Tableau (TWBX) are some common examples that use ZIP.  To address this challenge, KeyView continues to be enhanced to recognize file formats automatically and intelligently - including those that leverage ZIP as their container.  Accurate format detection allows for applications to properly workflow files and to extract metadata/text/sub-files as needed.

Self-extracting archives like from PKZip, 7-Zip, WinRAR and others also challenge format identification.  Are they executables?  Or are they archive files? Both answers are correct. But usually, regular old executable files have little information to extract. So, the better answer is “archive file” with the best answer being the “specific flavor of archive file”. This enables applications for use cases like security, enterprise search, eDiscovery and others to easily and accurately process analyze archive files.  KeyView identifies self-extracting ZIP, 7z and RAR 4 / 5 files as the appropriate archive format and allows for sub-file extraction.

To extract sub-files from an archive file requires the knowledge of at least 2 key things:

  • the overall specifications that define the structure of a given archive file
  • compression algorithms

As we learned earlier, many dozens of distinct archive formats have been created over the years with many supporting multiple compression methods making this hard.  ZIP, the most widely used archive file format across all operating system platforms, supports nearly 20 different compression algorithms. Some compression methods are more widely used than others, but that doesn't make zstd or PPMd or WavPack any less important.  One of the most popular open-source archive file formats is 7z - first implemented by the 7-Zip archiver. 7z is modular by design - supporting any compression, conversion, or encryption method. 7z files officially support 5 different compression methods: LZMA, LZMA2, Bzip2, PPMd and DEFLATE and a few others unofficially. While supporting all compression algorithms and thus enabling sub-file extraction is the goal, identifying if a compression method is not supported is also critical. This allows for the calling application to clearly flag such files, in case a malicious actor is attempting to hide something. In the rare case this situation occurs, KeyView returns: error code 43 - KVError_CompressionNotSupported. For ZIP files, KeyView's most recent addition was support for the ZStandard (zstd) compression method – increasing KeyView’s sub-file extraction capabilities.

Encrypted and password protected archive files are another wrinkle to both the detection and sub-file extraction processes.  Encrypted and/or password-protected files are unfortunately excellent methods to hide sensitive information.  To combat this risk, KeyView reports encryption / password-protected status at detection time, when possible, and always when sub-file extraction is attempted.  KeyView also supports the processing of password protected files as summarized here (for the 23.2 release) and rights managed files as summarized here (for the 23.2 release).  KeyView also supports the detection of many other encrypted formats as enumerated here (for the 23.2 release).

Some archive file formats support seemingly obscure features like alternate data streams.  Alternate data streams (ADSs) are part of NTFS and other common file systems.  ZIP and RAR (both v4 and v5 versions) are two examples of archive formats that allow for the capture of alternate data streams. What is important to know about alternate data streams in the context of archive files is that it's another way to hide sensitive data and malware.  See this article for more insights into ADS and security vulnerabilities. KeyView mitigates these risks by supporting the detection and extraction of alternate data streams in ZIP and RAR files.

Next steps

Security and legal solutions are designed to prevent malicious actors from hiding or stealing sensitive data.  Knowledge and document management solutions are designed to share information.  These use cases require the use of the most capable file format identification and content extraction solution available - OpenText IDOL KeyView.

Beware that just because a file processing solution claims to support ZIP, RAR, 7z or any other archive file formats, that does not mean that all forms of compression or other available features are fully supported.  You can trust that OpenText IDOL KeyView is doing this hard and time-consuming work for you when it comes to handling archive files and other supported file formats.

You can learn more about OpenText IDOL KeyView.  The KeyView SDK documentation is available here.

Labels:

Artificial Intelligence
File Analysis & Management