Why are there so many file formats?
It should not be a surprise to hear that many ten's thousands of applications have been created over the decades that computers have been around. While some of these applications are obsolete due to dependencies on no longer available operating systems, hardware, or both, some of these old applications live on. When we focus on active operating systems, there are still thousands of applications to consider. Some of these applications use the same file format type to store data. However, many employ a proprietary file format optimized for that application and use case. To facilitate data interchange, standards bodies have also introduced file formats. It is easy math to conclude that thousands of file format formats are in active use today.
And given that new applications are continuously created, the number of file formats will continue to grow.
Some of these thousands of applications and file formats are universally used. PDF (Portable Document Format) is one obvious example, Microsoft Office file formats (docx, ppt, xlsx) are another example. In the world of images, almost everyone has heard of JPEG files. And everyone who has used a computer has surely used a Zip file.
So, there are more widely used file formats. However, if one looks at file formats from different perspectives like geography, operating system, industry, or use case, then the definition of popular changes.
Suppose you are doing word processing, presentations, or spreadsheets and are located in South Korea, you are very likely using Hancom Office. In China, you are likely using Yozosoft Office or other office suites that use the Uniform Office Format. In some parts of the world, OpenOffice is more widely used than Microsoft Office. If you are an Apple MacOS or iOS user, then it is likely that you use the Apple iWork suite some of the time. There are even more office suites like SoftMaker Office, Ability Office and JustSystems Ichitaro, to name a few.
In the world of computer-aided design (CAD) and computer-aided manufacturing (CAM), Autodesk's AutoCAD DXF and DWG are widely popular. There are also many other Autodesk products like Revit, Inventor, and 3ds Max each with their own native and extended file formats. In aerospace, CATIA is widely used. For 3D CAD, SketchUp is popular for interior design, industrial design, and landscape architecture. A search for '3d printing file formats' yield results like "9 Most Common 3D Printing File Types" and "3d Printing File Formats: Everything you need to know". In the CAD / CAM industry, hundreds of file formats are regularly used.
Life Sciences (including Pharmaceutical, Biotechnology and Medical Devices) and Healthcare have numerous file formats unique to these industries. For medical images, DICOM is widely used, but there are others like ECAT PET and BioRad confocal image. SAS, with its multiple formats and SPSS, are de facto standards for statistical analysis. The National Center for Biotechnology Information (NCBI) helped create the GenBank file format for storing DNA sequence data. There are easily many dozens of file formats used exclusively within Life Sciences. Legal and law enforcement have their own collection of widely used file formats.
For eDiscovery, PST, EML and MSG are household names. For evidence collection, the Expert Witness Compression format and AccessData Logical Image (AD1) are de facto standards. The Legal Electronic Data Exchange Standard (LEDES) has a collection of file formats supporting billing, budgeting and timekeeping. Another standards body (OASIS) created the XML Common Biometric Format (XCBF) for storing DNA, fingerprints, iris scans and the like. Again, there are many dozens of file formats unique to and widely used within legal firms and law enforcement.
Financial services including banking and insurance are not unique. Well-known applications like Quicken for personal banking and QuickBooks for bookkeeping represent data in a handful of different file formats. For data interchange, EDIFACT and X12-encoded documents and Open Financial Exchange XML are widely used in ERP, financial exchanges, international banking, global trade and insurance.
If you are familiar with GIS (Geographic Information Systems), then you know that there are multiple file types used by the ESRI platform. Behind the scenes of GPS-based navigation systems are formats like OpenStreetMap XML data and Protocolbuffer Binary Format (PBF). There are, of course, some now obsolete formats like Microsoft Street & Trips map (EST), Microsoft Pocket Street map (MPS) and Microsoft MapPoint (PTM). And, like other industries, data interchange formats exist, like GPS Exchange Format. For those familiar with self-driving vehicles and LIDAR, you may (or may not) have heard of the LAS LIDAR file. The Mapsforge Binary Map File Format (MAP) was designed for map rendering on devices with limited resources, like mobile phones. Again, many dozens of file formats have been created for GIS use cases.
One's choice of operating systems often dictates what is popular and what is fringe. When it comes to archive file formats, on MacOS, DMG and StuffIt are common. On Windows, CAB is common. On Linux, gzip and Tar are typical. PkZip (aka Zip) is used everywhere. You may have also heard of or used WinRAR RAR and 7-Zip files. If you're a Windows 11 user, you will soon become more familiar with most of these formats when native support is released – see here to learn more. In the article, What to do about so many archive file formats, you can learn more about the challenges with handling archive file formats like ZIP and the dozens of variations.
If you've used a digital camera, scanned a document, taken a screenshot on your computer or done web page design, you know there are many file formats for images and video. JPEG, TIFF, BMP, GIF, PNG and WebP are just a few examples of raster graphics image formats. However, if you've used applications like Photoshop or GIMP or Paintshop Pro, you'd know that each of these has its own file formats (PSD, XCF, PSP). In a search for simpler and smaller files for representing video, animated GIF and animated PNG were introduced and are widely used. For advanced digital camera users, you may be familiar with RAW file formats. Vector graphics have their own file formats, such as SVG, WMF, EPS, and CDR. When we extend this topic to audio and video file formats, the same complexity exists with WAV, WMA, MP3, OGG, FLAC, M4A and, WMV, MPG, MP4, and FLV, to name a few.
Again, hundreds of file formats are used to represent images, audio, and video data.
It is fair to say that what is popular, or mainstream depends on what applications are common in your industry, location, operating system and more simply, your use case. The thousands of applications supporting these use cases have resulted in thousands of file formats.
What are the risks?
A consequence of the thousands of applications and file formats is a risk with information theft and malware front of mind. These risks can ruin an organization's reputation, shut down operations and/or result in fines and lawsuits.
Unfortunately, it does not take much research to identify exploited vulnerabilities tied to specific file formats. Here are a few noteworthy exploitations:
- Microsoft Temporarily Disables SketchUp Support After Discovery of 117 Vulnerabilities – November 2023
- Unauthorized Access to Okta's Support Case Management System: Root Cause and Remediation – November 2023
- Critical libwebp Vulnerability Under Active Exploitation - Gets Maximum CVSS Score – September 2023
- WinRAR 0-day that uses poisoned JPG and TXT files under exploit since April – August 2023
- Using Alternate Data Streams in the Collection and Exfiltration of Data – September 2022
- 7-Zip used by attackers to prepare data for exfiltration – March 2021
- Theory Behind Hiding Zipped File Under Jpg Image – February 2009
In case, you are not keeping score, there have been at least four different file format based exploited vulnerabilities just in 2023. One can learn more about these and other risks in the National Vulnerability Database (NVD), the list of Common Vulnerabilities and Exposures (CVE), and the many publications focused on security and the IT industry.
What can be done to manage these risks?
Knowledge is power when managing the risks associated with file formats.
This starts with accurate and precise file format detection that employs multiple techniques:
- that avoid false positives,
- knows that some file formats (e.g., PKZip, XML, OLE) are widely used containers,
- understands that file extensions are just a naming convention,
- and the other challenges inherent in file format detection.
You may be surprised to learn that some solutions to file format detection leverage the file extension - the handful of characters like .zip, .docx or .pdf appended to file names. Anyone who has sent an email in the past decade knows that renaming a file's extension is an excellent way to trick some security systems from blocking the transmission of your email. For those familiar with file format types, it is common for vastly different applications, and thus, file formats to share the same file extension - .DOC is used by at least ten different applications. For multiple reasons, any reliance on the file extension (aka a naming convention) is guaranteed to be wrong some percentage of the time.
Better solutions leverage magic bytes at the beginning of a file to perform format detection. However, while using a few bytes at the beginning of a file for format detection is better than file extensions, magic bytes are not always sufficient. Our friend PKZip uses \x50\x4b\x03\x04 (P K ETX EOT) as its magic bytes. Unfortunately, dozens of file formats use the PKZip format as their container. Business Intelligence file formats use PKZip. Many Office Suite formats use PKZip. Some CAD formats use PKZip. Reliance on just the magic bytes will result in misidentification and extra downstream processing costs. Therefore, deeper inspection is required to uniquely identify the many file formats that leverage the PKZip container. XML is another widely used container format leveraged across all industries like life sciences, legal, financial services, and GIS, resulting in dozens of unique file formats based on the XML standard. Magic bytes are simply not unique enough to identify many file formats precisely and accurately. Therefore, it requires deeper inspection, using other unique file characteristics, to avoid the consequences of false identification.
Risk mitigation does not end with precise and accurate file format detection.
Knowledge of the file format enables the extraction of the file's contents like metadata, text and sub-files. This allows for content-based inspection, classification, retrieval, and analytics within a file and over a collection of files. In the world of data loss prevention (DLP) and compliance with data privacy regulations like GDPR and CCPA, the combination of file format type and content-based inspection enables safeguarding regulated data and, more broadly, intellectual property protection. For eDiscovery, complete retrieval of responsive content is required to avoid court sanctions, to manage case strategy and to do so in a cost-effective way by performing analysis on only what's responsive. Many countries have freedom of information laws like FOIA in the US or UK requiring disclosure of information by public authorities where missing content has consequences. More broadly, any information retrieval and analytics solution is only as smart as the quality of its inputs - garbage in, garbage out (GIGO) comes to mind.
To help our technology partners manage these risks, OpenText's IDOL KeyView uses a combination of techniques to achieve accurate and precise file format identification. Depending on the file format, a combination of metadata (standard and custom), text (visible and hidden) and sub-file extraction operations are available. As of the KeyView 23.4 release, nearly 2000 file formats are supported. You can learn more about the Power of KeyView and can also access the IDOL KeyView documentation.
According to Merriam-Webster, the definition of the word fringe is "something that is marginal, additional or secondary to some activity, process or subject". In the context of file formats, it's fair to say a file format is fringe until it is not. Malicious actors continue to find new file formats as vectors for malware and information theft. The use case (geography, industry, platform) also determines what is popular and what is fringe.
We’d love to hear your thoughts on this blog. Comment below.
The OpenText Analytics & AI team