How does StarTeam use MD5 checksums to identify unique files in the vault?

0 Likes

Problem:

How does StarTeam use MD5 checksums to identify unique files in the vault?

Resolution:


  • Product Name: StarTeam
  • Product Version: All
  • Product Component: MD5 Checksums
  • Platform/OS Version: N/A


StarTeam uses MD5 checksums to identify files in the vault, and to determine whether a file that is being added to a StarTeam repository already exists in the vault. This article describes how this is done in the StarTeam SDK and Server. It also includes a description of StarTeam server options that can be used to perform additional inspections of files if the use of MD5 checksums to identify uniqueness is determined not to be reliable. The additional inspections require more resources in both computation and bandwidth and are not considered a worthwhile tradeoff under most circumstances, and as a result are not recommended by Borland.

It"s important to note that no two "natural" documents have ever been found to produce the same MD5 checksum value. Furthermore, given an MD5 checksum, no one has been able to produce a "fake" document that generates that same MD5 checksum except for well-known values generated from very short documents (a few bytes). With considerable computational effort, some researchers have been able to produce two fake documents that generate the same MD5 checksum. Furthermore, they"ve been able produce scenarios where two different, fake documents can be pre-pended to a real document, and the two aggregate documents will produce the same MD5 checksum value.

As described in an article warning about the reliance on MD5 checksums for web certificate authenticity, with considerable computational effort this weakness could be exploited. For example, with manufactured certificates, it is possible to create a signed document that appears authentic but in fact is not. Despite the computational effort required to produce fake certificates, the use of MD5 checksums is no longer recommended for certificates, and stronger hash functions such as SHA1 are preferred. The fake certificate scenario does not apply to StarTeam because StarTeam does not use MD5 checksum values to sign messages.

StarTeam uses MD5 checksum values for one primary purpose: to identify duplicate content in a vault hive.

How does StarTeam use MD5 checksums

When files are added to a StarTeam repository their MD5 checksum (hereafter referred to as "MD5") is calculated and then compared to existing files in the vault. If a matching MD5 is found (an MD5 collision) this is believed to be an identical file, so the file already in the vault is referenced and the new file is not added to the vault. While the use of MD5 checksums is considered sufficient for identifying unique files, additional inspection is possible when an MD5 collision is encountered when adding or checking in a file to StarTeam.

StarTeam default behavior

This is the default and recommended setting for StarTeam server configurations.

  • RequireContentUpload is turned off
  • VerifyOnMD5Collision is irrelevant

For each file to be added, the SDK computes the file"s canonical MD5 and sends the list of MD5 to the server. The server checks each MD5 against all hives to see if that MD5 is already stored. It then tells the SDK which MD5s are already present and what hive they are stored in. The SDK then sends the content only for files whose MD5 are not already stored in the vault. Finally, the SDK performs the check-in operation, which simply creates the new database records, pointing each one at the appropriate archive file, which then resides in the vault.

Using this process means that the content of files at the client is never compared to the content of files stored on the server. StarTeam trusts that when a client computes the MD5, if it matches the MD5 of a file already stored in the vault it is the same file. In short, there is no further inspection after an MD5 collision is detected.

Additional inspection after MD5 collision (NOT recommended by Borland)

The level of inspection after an MD5 collision is detected can be controlled using the server options "RequireContentUpload" and "VerifyOnMD5Collision". Both of these options are disabled by default. If they are enabled these options add computation and file transmission operations which have a noticeable impact on performance. The use of these options is not recommended, per the very small probability that false MD5 collisions will be encountered. The behavior of file adds and check-ins with regards to these options is:

  • RequireContentUpload is enabled and VerifyOnMD5Collision is off:

For each file to be added, the SDK sends the file"s content to the server. The server computes each file"s MD5 and searches all hives in the vault for that MD5. If it finds an existing archive with the same MD5 the server compares the existing and new file lengths, but it does not compare contents. If the lengths are different, the server throws an "MD5 collision" exception. Otherwise, it discards the new content that it just received and tells the SDK that the content was already present and what hive it was found in. If the new file"s MD5 was not found the file is stored in a new archive file and the SDK is informed what hive was used. The rest of the add/check-in is then the same.

This process has a little bit of additional inspection after MD5 collision is detected in that it compares file lengths. It is slower since file content is unconditionally sent to the server.

  • RequireContentUpload and VerifyOnMD5Collision are both enabled:

For each file to be added, the SDK sends the file"s content to the server. The server computes each file"s MD5 and searches all hives in the vault for that MD5. If it finds an existing archive with the same MD5, the server performs a byte-by-byte comparison of the existing and new files. If they are different, the server throws an "MD5 collision" exception. Otherwise, it discards the new content that it just received and tells the SDK that the content was already present and what hive it was found in. If the new file"s MD5 was not found the file is stored in a new archive file and the SDK is informed what hive was used. The rest of the add/check-in is then the same.

This process has complete inspection after MD5 collision detection and as a result is even slower since file content is unconditionally sent to the server and duplicate MD5s are always compared byte-by-byte with existing archives.

If, after weighing the potential benefits against the performance impact on every add and check-in operation, you choose to enable these options (which is not recommended due to the very low probability of false MD5 collisions and the negative performance impact on add and check-in operations) the following entries need to be added to the StarTeam server configuration file:





Old KB# 29341
Comment List
Related
Recommended