I’m looking forward to attending the Great Lakes THATCamp unconference later this month. What follows is an extended discussion of some of the things I have been thinking about. It begins in observing a path from command-line tools to speculation about the nature of evidence in humanities research from the perspective, especially, of history, and representing historical sources in digital form.

I’m currently in the midst of a project that involves what might be called hack-enhanced editing (aspiring toward inquiry-based hacking), preparing a digital collection of tens of thousands of articles that were translated and organized in an idiosyncratic paper database back in the 1930s, the Chicago Foreign Language Press Survey. (Images available at the Internet Archive; transcription project site not available, yet.)

With regular expressions, short scripts, bits of XPath and XSLT, and some venerable tools like ‘make’ and ‘grep’ (now approaching its fourth decade), I have gradually been building up tools and workflows to check quality and normalize some data-like elements across tens of thousands of transcribed articles. It’s not a complete Programming Historian approach, but I would be glad to share a few relevant geeky parts of this process if there’s interest, and hear ideas or learn methods from others (where else but a THATCamp?), although it’s not necessary to get into too many details for the sake of discussing the more general questions that this can lead to.

There’s a kind of productive bootstrapping Catch-22 involved in editing in this way. It’s not possible to make informed decisions about certain aspects of how to structure the digital representation until the content is available in an initial transcription. We can’t decide what ought to be normalized, or how practical that will be to attempt, until the full range of variation is known. Exploring that variation is a matter of creating small automated tools to explore latent structures and data values. So editing is not a merely technical activity, and it’s also not a matter of searching to find information, as if the resource were a transparent carrier of historical data. Editing is more a process of asking, in all sorts of ways, what is this thing, what could imaginably go wrong with it, and in whose judgment would it count as wrong?

Technical details aside, “what is this source?” is a fundamental question in many contexts, including just about any browser tab. With a queryable database, it seems to me that there may be little overt difference between querying a set of data for quality-control purposes, observing patterns and seeking inconsistencies and errors to be edited out, and performing the essentially the same query to explore possible historical hypotheses about the data. Some “data errors” might themselves amount to historical evidence.

We tend to think of searching as an activity directed at finding documents, representations of documents, or information. But in practice a certain amount of searching is better described as querying, not simply to find what a database can point to, but to size up the database itself, to better understand the nature of its mediation. Any interaction with an information source, digital or otherwise, involves a certain amount of figuring out what its limits are, what it ignores or takes for granted, what kind of processes produced it. When I can’t find what I’m looking for, I need to be prepared to be intellectually engaged at many different levels, not knowing in advance which will turn out to be relevant. It could be that I mistyped a term; it could be that the evidence I thought would exist is somewhere other than where I’m looking for it; it could be that the imagined evidence simply does not exist, or perhaps never existed. Making sense of a list of search results is not just a matter of searching and finding. It can in itself lead to provisional hypotheses about historical processes.

These are not new observations within digital humanities, but I think there is more to be done to keep drawing out the humanities continuities amid the perpetually ascribed novelties of the digital. What is the database when it’s understood as itself a kind of historically rooted document? And can we constructively make databases that accept their own document-like nature and don’t presume to evade history altogether? (I’m eager to see pragmatic Linked Data thrive as grounded documents, and I’m deeply skeptical of dreams of a Giant Global Graph if it is thought of as a kind of atemporal central vat containing a slurry of deracinated triples — standardized assertions of fact — which it doesn’t necessarily need to be.)

This line of thought makes me wonder if a language of evidence could help clarify issues that get muddied in item-focused battles over originals and digital surrogates, vexations over authority and authenticity, and perceptions of innovation in visualization. Historical inquiry has always looked past single documents toward pattern, with an understanding that the pattern often is not a property of any single document alone. Evidence has never just been in discrete items and their metadata; the quality of evidence depends on the quality of the questions we ask of it.

When we talk about search, it is convenient to make a simplifying presumption that what we are doing is looking for items already known by their type. But inquiry is a hermeneutic venture in which a set of questions is iteratively refined through the resistance of the world to answering them as initially stated. That resistance itself can be evidence.


It seems late. Isn’t blogging dead yet?

In any case, there’s a small backlog of miscellaneous things I have thought would suit a blog, generally relating to working with digital data and methods in the humanities.

I don’t promise to be timely, to post regularly, or to avoid an accumulation of miscellaneous incoherence.

This is an individually maintained blog. Opinions expressed here are my own, at best, and subject to change.

