Even before the Internet disrupted their environment in ways that are still unfolding, newspapers were complicated things, at once periodical publications, businesses, and devices of social organization and communication. The names of the best-known newspapers carry an aura of institutional solidity — the New York Times, the Wall Street Journal — but the history of newspapers includes many locales, many more papers, some of them short-lived, many changes in ownership, editorial leadership and political stance. Mergers and renamings have left their stamp on names like Star-Ledger, Journal-Constitution, Post-Gazette. We cite historical newspapers by name and date, usually ignoring the complexities of daily variations in editions and other irregular publication patterns that made newspapers awkward misfits in book-oriented bibliographic contexts long before digital media added new complications. Editorial page writers and historians have often employed without apology the convenient social fiction that a newspaper is a continuous identity of singular agency. judging that a more precise account would be hopelessly unwieldy. But they have been in a position to know how much of a fiction it is.

Ten years ago I had the good fortune to participate in the preparation of the Encyclopedia of Chicago. The editors sought to supplement the alphabetical entries with a number of new maps, tables, and charts. One of my colleagues prepared several charts to visualize highlights of the history of daily papers in English, in other languages, and in the metropolitan region beyond Chicago. The scope was limited to dailies in order to keep the research and visualizations from getting to be too much.

Last year the Library of Congress’s wonderful Chronicling America project announced a major release of its data on the Web, with a web-friendly API (Application Programming Interface) to provide access to their data, including links between data. Anything with a three-letter acronym can sound complicated, but the essential idea of their API is quite simple and familiar: every major kind of data resource has a bookmarkable Web address, and the document found at that URL can have a structure suited to its content and links to related resources. Their API is intelligently organized to serve human readers and also, importantly, to provide information to machines that others invest in that can serve even more readers at one remove from the Library of Congress team, providing services they don’t have to anticipate in advance.

Chronicling America has more than a million newspaper pages digitized through the National Digital Newspaper Program (NDNP), and will grow to as much as 20 million pages. It also has approximately 140,000 bibliographic title records gathered from libraries across the U.S. There is a lot there. As a way of exercising my uneven digital humanities skills and engaging particular topical interests, as a personal project I started a speculatively exploring for myself a small subset of just the bibliographic records. So here’s a report on playing around:

Interestingly, one of the metadata fields in these bibliographic records is a pointer to other records representing the successor to the paper described. Making use of the friendly API, I downloaded about 1,700 bibliographic records for newspapers associated with Chicago to start to play with a subset of data more systematically than was possible by searching for single titles and traversing the links in a browser´╗┐. [Brief technical note: this exploration was hacked together iteratively using Python, SPARQL queries, and Graphviz.]

If we represent each bibliographic record as a node (an oval), and we draw arrows between these nodes to indicate which record is succeeded by another record, we get a directed graph that at first glance seems to amount to a genealogical chart for a newspaper, the skeleton of a narrative that exists in no single bibliographic record, but emerges from linkages across a small subset of records. For example, we see that the Western Herald became the Prairie Herald in two quick steps in the late 1840s.

That’s an easy one, and there are many like it among the 177 graphs with two or more nodes based on the Chicago records I extracted. There are also a small number of more dense graphs with multiple branches representing mergers and renamings, like the relationships between Swedish papers over a 75-year span (click through for a larger image):

It matters, though, that these nodes really are bibliographic records, and not newspapers. The linkages between records are imperfect. Some “successor” relationships associate related records that are not actual successor publications, but predecessors or coexistent publications that ended up merged. Catalogers created these records and their links for the sake of aiding discovery of library materials by researchers in particular contexts. Somehow the Southtown Economist ended up tangled in a bibliographic network that looks more complicated than the underlying data seems to be:

Still, if we understand the origins of the data and are willing to revise our understanding of what the arrows mean, we can infer some outlines of stories that suggest a tension between neighborhood identities and centralizing business considerations.

Because of the diverse origins of these records, there are also duplicates and inconsistencies. The Chicago Tribune, for example, has many bibliographic records describing what researchers would consider the same paper, and the successor relationships recorded in metadata don’t bring these together in a single connected graph:

Behind each of these records is an additional set of holdings records, and beyond that a set of institutional contexts, drawers of microfilm or even shelves of bound paper. These records were created to serve discovery in the research process. They weren’t meant to be graphed and read like this, exactly.

The exercise of looking at these graphs makes me wonder about the many stories and important forgotten histories that must be undiscovered in the millions of pages of old news. But it also makes me think about how the mechanics of a research process intended to lead eventually to interpretation is already necessarily a process of interpretation from the beginning. Metadata is created to serve discovery, but once created, it becomes evidence, and how it serves as evidence is beyond the control of its catalogers and creators.

I can imagine network graphs like these provoking three different kinds of story-construction simultaneously, in different dimensions (none of them new, they have long histories). We can look at the metadata as historical data. Even before we get to see a page of newsprint, we can use aggregated metadata not solely for the sake of discovery, but to look toward history, constructing provisional stories. Yet to appropriately discount and trust the data in this way, we need also to read the graphs with an eye to quality control, to look through the metadata to the circumstances of its production and aggregation, envisioning how what we’re seeing is the evidence of disparate and evolving library processes converging at network scale over spans of decades. And finally, we can look at these graphs as maps offering a field of prospective narratives of possible research paths. Any one of these nodes may have once been a card in a catalog. Each is now its own web page, each is attached to holdings records. To find what we are looking for we may have to traverse a long chain of such records, reading and filtering, judging what’s likely to be the main path and what will be a distraction.

Without further comment, here are a few more interesting graphs:

