Genealogy: File formats and preservation

Genealogy: File formats and preservationFile formats are the KEY to preservation. One of the goals of digital preservation is to prevent a loss of access to files due to file format obsolescence.

If you are using a file format migration strategy for preservation, then you will be refreshing the digital files over time to keep the content stored in formats that are readable by the current technology.

If you are practicing a software emulation strategy for preservation, then you are maintaining software that will be able to read the old file formats.

When you create a digital object, the type of file that it is will be declared by its extension (.jpg, .pdf, etc.). The type of file you are dealing with has big implications for how preservation practices can be applied to it now and in the future. This is because being able to access the contents of a digital object depends on the ability to store, read, and edit the digital files.

When a software program creates a file, the program can re-open the file to view it, edit it, etc. This is because the program knows the file format’s specifications and was designed to be able to work with it. As software programs get upgraded or disappear, the ability to read the files that it created becomes riskier.

Software upgrades happen all the time, and it is usually possible to open a file created with the previous version of a program. And it certainly won’t be possible if the software stops getting upgraded and will eventually not be capable of running on new machines.
The following factors should be considered in assessing a file format’s long-term stability:

  • Wide adoption
  • History of backward compatibility
  • Good metadata support (in open format such as XML)
  • Good range of functionality, but not overly complex
  • Available interchange format with usable target
  • Built-in error checking
  • Reasonable upgrade cycle

Open and Proprietary
File formats can be clumped in two categories:

  • Open
  • Proprietary

When a file format is proprietary, the format’s specifications are not available because they are usually guarded as property of the company that created the program that creates the files.  Proprietary and closed specifications represent some of the most enduring and successful software in use.

However, these also tend to evolve quickly and exist in many different versions for different platforms, with only limited backward compatibility provided. Some open proprietary formats are .psd, .ppt, and .doc.

Open file formats are those in which the file format specifications are publicly available. When this information is available, programs other than the one that created the file can be made to interpret the file’s format (or migrate an old file into a newer format), and we are not dependent on the original program. This implies a more guaranteed longevity for the file in its original format. Some open file formats are .pdf, .jpg, and .tif.

Open and Popular
With digital preservation, the rule of thumb is to move your content into file formats that are:

  • Open
  • Popular

When a file format is open, we can get inside its structure and know what’s going on, even if the software that a file was originally created on no longer functions.

The thinking behind going with a popular file format over one that is used less frequently, is that a way to “get inside” the format will be inevitable since so many people will have invested their content into that format someone will find a way in.