MoMA

MoMA’S DIGITAL ART VAULT

April 14, 2015  |  Conservation, Media Conservation
MoMA’s Digital Art Vault
MoMA's data center. Photo: Ben Fino-Radin

MoMA’s data center. Photo: Ben Fino-Radin

Recently on Inside/Out, we heard from Assistant Media Conservator Peter Oleksik about MoMA’s efforts to preserve and digitize its collection of analog video art, amassed over the course of four decades. The heroic undertaking of digitizing over 4,000 videotapes was absolutely critical for preservation and access purposes. However, when we digitize analog videotape, we have just begun a new chapter in the artwork’s life, one that is rife with grave challenges and risks that are unique to digital materials. In another recent post, we learned from Media Conservator Kate Lewis that it is increasingly common for time-based media artworks to be delivered to the Museum in digital form, due to evolving tools and artistic practices. Today I’ll describe how MoMA has faced head-on the significant challenges in digital preservation by designing a state-of-the-art digital vault for these collections. In order to distill some rather technical and complex ideas that inform this effort, I’ll break this digital art vault down into three parts: the packager, the warehouse, and the indexer.

img1

The packager addresses the most fundamental challenge in digital preservation: all digital files are encoded. They require special tools in order to be understood as anything more than a pile of bits and bytes. Just as a VHS tape is useless without a VCR, a digital video file is useless without some kind of software that understands how to interpret and play it, or tell you something about its contents. At least with a VHS tape you can hold it in your hand and say, “Hey, this looks like a VHS tape and it probably has an analog video signal recorded on it.” But there is essentially nothing about a QuickTime .MOV file that says, “Hello, I am a video file! You should use this sort of software to view me.” We rely on specially designed software—be it an operating system or something more specialized—to tell us these things. The problem is that these tools may not always be around, or may not always understand all formats the way they do today. This means that even if we manage to keep a perfect copy of a video file for 100 years, no one may be able to understand that it’s a video file, let alone what to do with it. To avoid this scenario, the “packager”—free, open-source software called Archivematica—analyzes all digital collections materials as they arrive, and records the results in an obsolescence-proof text format that is packaged and stored with the materials themselves. We call this an “archival information package.”

img2

In addition to the issue of how to ensure that our successors will understand what a given stream of bits is supposed to represent, we have also the problem of authenticity. How can we prove in 100 years that a given digital object in the collection has not become corrupt, and has not been maliciously modified, since the moment it entered the collection? It would be of course impossible to periodically manually inspect millions of digital files. To address this issue, the packager passes each and every digital object through a cryptographic algorithm called a “checksum.” The checksum value for one digital file is essentially a sequence of a few hundred letters and numbers. This provides us with the ability to come back to an archival package in the future, run the digital files through the same cryptographic process, and check to make sure that we wind up with the same values that were originally recorded. So in summary, these archival packages contain not just MoMA’s digital collections, but the information that we will need in the future in order to understand what the materials are and to confirm their authenticity. These archival packages are then sent off to what we call the “warehouse”—a digital storage system maintained by the infrastructure division of MoMA’s IT department.

This is the digital equivalent of MoMA QNS, our physical art-storage facility in Long Island City. The “warehouse” is a very large cluster of hard drives configured as a Redundant Array of Independent Disks (RAID) that lives in our data center at 53 Street, along with a duplicate of the entire cluster that lives offsite at MoMA QNS. This has served us well for about five years, but this type of disk-based storage becomes an untenable expense with very large amounts of data. MoMA’s digital collection currently is about 80 terabytes in size (80,000 gigabytes). This is a lot of data, but it is minuscule compared to our anticipated collection growth over the course of the next 10 years. As MoMA acquires more digital artworks, and as the image resolutions used by artists and filmmakers increase, we project the digital collection to grow to approximately 1.2 petabytes (1.2 million gigabytes) by 2025.

A close-up of one of the servers currently used to store our digital collections. Photo: Ben Fino-Radin

A close-up of one of the units currently used to store our digital collections materials. Photo: Ben Fino-Radin

It would be irresponsibly expensive to continue using hard drive storage, as it was not quite intended for this scale of data. We are currently in the final stages of designing a completely new “warehouse” with a company called Arkivum. This system will include a small cluster of hard drives, but for primary long-term storage it adds a very cool new element to the mix: data tapes. When archival packages are first stored, they land on the cluster of disks, but are shortly thereafter copied to data tape, a process that is automated by software (and robots!). The video below provides an inside look at a machine very similar to the one that will be storing the projected 1.2 million gigabytes in MoMA’s digital collection.

This system will allow us to store the projected 1.2 million gigabytes of digital collections material redundantly in three locations: the Museum, our art storage facility in Long Island City, and our film preservation center in Hamlin, Pennsylvania.

The two parts of MoMA’s digital art vault we have discussed here (the packager, and the warehouse) ensure that we will be able to keep a bit-for-bit stable and authentic copy of all digital collections objects, and that we will be able to understand what these objects are and how to use them decades into the future. Unfortunately, neither the packager nor the warehouse facilitates day-to-day, active management of the contents of the warehouse. MoMA searched near and far for a system to facilitate these aspects, and found none. So we built one ourselves. Stay tuned for the next post for more about this new system.

Comments

hi, very interesting and very clear post! That’s a really cool read, and it’s nice to see a museum sharing that kind of info with the general public.
I’m curious about the tape system: do you plan to upgrade it continuously during the next 10 or 20 years? Or is it built at start with the full capacity and you just need to add tapes?
I understand it’s a proprietary machine from IBM, but is the tape itself standardized?
All in all, it would be very interesting to hear about what were the criterias that made you go with tape as a cold storage versus adding more disks along the way (the “add pod” strategy used by Backblaze for example https://www.backblaze.com/blog/vault-cloud-storage-architecture/).
I can guess the inertia of the tape might be a plus, but what are the other advantages over adding disks as needed? Power consumption?

Hi Julien, thanks for your comment – these are all great questions!

First I’ll address your question about whether the tape itself is standardized. The answer in our case is: yes. We will be using LTO6 tapes (LTO = Linear Tape Open), and the LTFS filesystem.

We plan on refreshing / replacing all of our tapes about every five years. LTO6 tapes store about 2.5 TB uncompressed, but by the time we refresh our tapes, a new LTO generation will be available that is higher capacity. This means that when we refresh our tapes and put in a newer version of LTO, we will be taking up less physical slots in the tape library. Knowing the number of tape slots that are available in the tape library, and knowing the LTO roadmap, we are confident that we will never outgrow the tape library’s physically available tape slots – at least not in the next 10 years.

Lastly I’ll address your question about the criteria that led us to tape / cold storage. This is a question of cost, and how the data is used. We certainly could have worked with a storage vendor to devise a disk based system that is more affordable and modular than what we are moving away from – but when you look at how a museum, and MoMA in particular, uses their collections, tape really just makes more sense than disk. By this I mean: lots and lots of data that is not used very frequently. Museums typically only are ever showing a very small fraction of their collection, and we have long, year long or multi-year exhibition cycles. We may exhibit something when it is first collected, and not show it again in the gallery for another 5-10 years! One point I left out of the post is that Archivematica (“the packager”) automatically makes compressed viewing copies for us, and that these viewing copies are stored in a separate storage system that is indeed disk. So, we do have on-demand, instant streaming/viewing of anything that has been added to the repository. The only time that we need to access the full-quality master files is if we are, for instance, manually preparing an exhibition file. This means that once the original masters for a work are stored, they really might not be requested for several years – and when they are requested, it is likely for an exhibition that has been on the calendar for a year or more, which gives us the ability to preemptively stage the files from tape to disk before a conservator even requests them (so that they experience no latency). So – to summarize my very long-winded answer: tape just makes sense when you look at how we use the files that we are storing on them, when you look at how massive these files are, and when you look at the per TB cost of tape versus disk.

Hope that answers your questions!

Does MoMA use IBM storage solutions for their art vault?

Great article Ben – nice to see such a complex system described in a fresh and succinct way. Best Regards form Ireland!

David: The Arkivum system I’ve described here does indeed employ IBM hardware. The video above depicts a TS3500. We plan on deploying the newer TS4500.

Thanks for the very interesting post Ben! This blog is wonderful and very helpful for small archives, I’m grateful for Moma’s ongoing will to share their knowledge.

I have a question you might be able to answer regarding the “archival information package”. could you name the outmost important metadata fields need to be captured (/that the Archivematica captures) which without them “no one may be able to understand that it’s a video file, let alone what to do with it” ? this will help me in creating my born-digital ingest workflow at the archive I work at. I’m considering these days what kind of technical metadata I should ask the filmmakers to provide me with their files (at the moment we can’t deal with a new software and it’s all in its naissance). I’m aspiring in this phase to ask from them to provide me the minimal necessary fields.

Many thanks,
Hila

Hila: That’s a very good question, but unfortunately a very big one as it sounds like you’re starting from square one.

One very basic thing you could do, in the absence of *any* sort of infrastructure could be to run MediaInfo on files you receive, and output it as PBcore XML metadata. This is far from any kind of a silver bullet, but at least it gives you characterization of the materials in a standards based format without too much effort.

When considering metadata the content creators may provide – that depends on your context. Who are your content creators? Digitization vendors? Artists / filmmakers?

Leave a Comment

* required information
Name*

E-mail address*

Your comments*

Spam check*
Cri_165814 Please enter the text in the image.