Codepath

Checksums

A checksum is the value returned from a one-way hash algorithms. It can be used to validity the integrity of the data because modifying the data in any way will change the value returned. For this reason, checksums are sometimes referred to as "fingerprints".

Checksums are often posted along with downloadable files. A user downloading the file can run the algorithm on the file after download to ensure that it is the same as the file posted. It cannot be modified by transmission errors or malicious intent without changing the checksum.

A checksum can also be used automatically during online communication. The sender sends the checksum of data, then sends the data. The receiver can verify that received data is complete and accurate by deriving the checksum.


An algorithm suitable for checksums needs to be fast, widely available, and require no keys. It does not need to be cryptographically secure. There are three PHP functions which are suitable for checksums: crc32(), md5(), sha1(). bcrypt could be used for checksums but it is much slower than the alternatives.

<?php
  $string = "Give me a checksum.";

  echo crc32($string);
  // 3703541059

  echo md5($string);
  // cfa5d275b53523cc6b393b4b76da2da7

  echo sha1($string);
  // b813c8d640644c451d3a45b628a6ebbe60fbb9ba
?>

Collisions

A collision is when two pieces of data have the same checksum. This is unavoidable when distilling large data sets down to a short string. It must be true that there are not as many possible representations of the data as there are possibilities for the data. For example, MD5 returns 32 hexadecimal characters for a short string and also for a 1 GB file. In general, the shorter the returned hash, the fewer character choices available for representation, and therefore the more collisions which are possible.

Collisions only become a problem for large data sets. When comparing two files—an original file and a modified file—it is highly unlikely that they will generate the same hash. However, when calculating the checksums for millions of files, it becomes much more likely that two of those files will generate the same checksum even though the input is different.

This is one reason why some hash algorithms are considered unsuitable for storing passwords. It becomes too likely that more than one password will yield the same result. Imagine that an attacker is trying millions of password in an attempt to guess a user's hashed password. Collisions could mean that the attacker doesn't have to guess the correct password, but could find another password which yields the same hash and be considered valid.


Checksums in Git

The Git Version Control System uses SHA-1 checksums on the contents of all change commits. In fact, the checksum is used as commit identifier and commonly referred to as "the SHA". Git's checksums include meta data about the commit including the author, date, and the previous commit's SHA.

Git assures the integrity of the data being stored by using checksums as identifiers. If someone were to try to alter a commit or its meta data, it would change the SHA used to identify it. It would become a different commit.

Git ensures that the historical chain of commits cannot be edited either, because each SHA includes meta data about the parent commit which precedes it. Altering one commit deep in the history would create a waterfall effect where every child commit had to recalculate its SHA as well. The history would become a different history.

Fork me on GitHub