The Library of Congress and online archival – Part 1

This past weekend I read about the US Library of Congress’ online archival system, partly out of simple fascination with the scale at which they operate, and partly to learn from it, to create my own offline archive of web pages and websites that are important to me.

The Library of Congress’ site describes the process:

The Library’s goal is to create an archival copy—essentially a snapshot—of how the site appeared at a particular point in time. The Library attempts to archive as much of the site as possible, including html pages, images, flash, PDFs, and audio and video files to provide context for future researchers. The Library (and its agents) use special software to download copies of web content and preserve it in a standard format. The crawling tools start with a “seed URL” – for instance, a homepage – and the crawler follows the links it finds, preserving content as it goes. Library staff also add scoping instructions for the crawler to follow links to that organization’s host on related domains, such as third party sites and social media platforms, based on permissions policies.

The Library of Congress uses open source and custom-developed software to manage different stages of the overall workflow. The Library has developed and implemented an in-house workflow tool called Digiboard, which enables staff to select websites for archiving, manage and track required permissions and notices, perform quality review processes, among other tasks. To perform the web harvesting activity which downloads the content, we primarily use the Heritrix archival web crawler External. For replay of archived content, the Library has deployed a version of OpenWayback External to allow researchers to view the archives. Additionally, the program uses Library-wide digital library services to transfer, manage, and store digital content. Institutions and others interested in learning more about Digiboard and other tools the Library user can contact the Web Archiving team for more information. The Library is continually evaluating available open-source tools that might be helpful for preserving web content.

It’s extremely encouraging that it explicitly specifies open-source tools. The most interesting part to me is the data format it uses:

Web archives are created and stored in the Web ARChive (WARC) and (for some older collections) the Internet Archive ARC container file formats.

I am now digging into the tools available to save, search and view articles in this format.

(Part 2 – A little more on why this is important to me)