It’s only been a year? It seems like two—yet there is one moment since the launch of Humanities Commons that stands out in my memory as particularly rewarding. Like many, I’ve grown increasingly concerned about access to data. I followed the Data Refuge events last spring. I participated in the March for Science in Princeton (since the crowd was about 4000:1 pro-science, it wasn’t exactly a challenge). We on the Humanities Commons team were of course aware of the proposed plans to shutdown the NEH and NEA, so when Kathleen Fitzpatrick suggested we mark Endangered Data Week by archiving all the white papers that have originated from grants issued by the NEH’s Office of Digital Humanities, I was immediately intrigued.
The number of papers we wanted to deposit was fairly small, which meant there were multiple ways we could approach the task. In addition, the NEH has a public API for downloading metadata relating to all projects. Each project has its own Web page with a description and a link to related assets. One can construct a query to download the project pages, and parse them to extract the links to the white papers themselves.
Next we needed to figure how to map the NEH metadata to our fields in CORE and upload the documents. We have a utility that copies deposits from production to development and that became the basis for our batch load utility. Because we didn’t need to include a large number of fields, the data mapping exercise went quickly and we discovered just a few issues that we needed to resolve.
We were going to be uploading these materials—not the original authors—but at the time, only original authors could upload materials to CORE. What’s more, the NEH metadata listed people with the roles of both project director and author, but CORE only allowed for authors. Before going further, we needed to update CORE to add additional depositor roles, such as project director, editor, and translator, and allow an admin to deposit on behalf of others.
With that out of the way, we turned to the metadata itself. Names and subjects needed to be processed prior to depositing into CORE. There were 1340 distinct names in the NEH metadata, a few hundred of which matched Humanities Commons member records; we connected those to ensure that credit was given where credit was due. CORE’s subject taxonomy aligned reasonably well with that of the NEH: we added just 34 subjects and mapped about 50 NEH subjects that were nearly identical to those in CORE to their repository counterparts.
With these prerequisites out of the way, I got down to writing the php script that would upload all the white papers into CORE. I needed to process each row of metadata, lookup any available usernames, map the subjects and send data to CORE in the format expected. It went fairly smoothly and we hope to be able to build on this work to facilitate batch uploads in the future.
While testing, I got the chance to explore the many interesting papers in the collection. My personal favorite was the Van Allen Project by Paul Allen Kaiser, Gil Weinberg, and Mark Downie, which dealt with the wartime work of astrophysicist James Van Allen, but there are many other interesting projects to be found!
Eric Knappe is head of software development at the Modern Language Association and the technical lead on Humanities Commons.