Exploring World History Dissertations, Digitally

Northeastern University’s World History PhD program was founded in 1994. World History, in general, is a growing field, with only a few doctoral programs around the country. What distinguishes world history? It is merely a concern with the whole world, or a focus on theories like world systems? ¹ What could a deeper analysis into Northeastern’s World History dissertations reveal about what determines a world history? What areas of the world get preference in world history in practice? While World History strives to move past the dominance of the nation-state, as well as combat Eurocentrism, how is this reflected in the corpus of Northeastern’s PhD dissertations? The wordcloud below reflects the most prevalent terms in this corpus, potentially revealing a a continuation of the dominance of Europe (even within a field that works explicitly against that dominance).

In its 25 year history, the Northeastern History Department has awarded slightly more than 30 doctoral degrees. This amount of dissertations makes for a good corpus for textual analysis. To create the corpus, I created PlainText files for all of the dissertations available online, through Northeastern’s Digital Repository Service. Unfortunately, the DRS only has dissertations published in 2008 or later, but there were only seven dissertations pre-2008. In order to complete my corpus, I went to Snell Library’s Archives and Special Collections to scan the remaining dissertations. In considering what metadata to include, I considered Lisa Palmer and Mary Piorun’s “Digitizing Dissertations for an Institutional Repository.” ² I then used GoogleDrive’s OCR tool to turn my scans into PlainText documents. Using regular expressions and Find & Replace, I cleaned the documents by removing footnote numbers \s\d{1,3}\s and page numbers \[[A-Za-z\d]+\]. In the titles of the documents, I included the metadata of the year of publication and the last name of the advisor, to probe whether there are any discernible trends associated with different advisors. See below for an exploratory summary of the corpus.

Using R, I turned these PlainText files into a more traditional dataset structure, separating out all of the words, as well as the year of publication and advisor into columns. While working in R, I attempted to utilize the tidyverse effectively. ³ I used the following code to elucidate the most used unique words for each advisor:

NU_dissertations %>%
group_by(word, advisor) %>%
summarize(count=n()) %>%
group_by(word) %>%
filter(n()==1) %>%
arrange(-count)

Some results: word          advisor       count
##    <chr>         <chr>         <int>
## 1 kuklinski     burds          2455
## 2 moiseyev      robinson        968
## 3 aguinis       salter          645
## 4 battenberg    robinson        645
## 5 brewers       salter          563
## 6 salvationists frader          442
## 7 manila        havens          441
## 8 legazpi       havens          368
## 9 furnace       campbell        354
## 10 mahjar        khuri-makdisi   292

Unsurprisingly, the results were dominated by proper nouns that were the subject of various dissertations. For example, ‘Moiseyev’ is the most used unique word from dissertations advised by Professor Harlow Robinson; the Moiseyev Dance Company is the subject matter of one of the two dissertations in the corpus that he was the advisor for. These words also helped to cement the geographic specialities of the various academics, as most of the proper nouns, like Moiseyev, referred to specific geographic regions. I then created a visualization of the most frequently used words across the entire corpus, which is included below though is difficult to read.

The words that are legible in the above visualization include, American, Soviet, British, Political, and Government. The high place of the words ‘political’ and ‘government’ is reflective of the dominance of narratives of high politics in the field of history; with methodologies such as subaltern studies emerging to combat this. Their position in the corpus potentially illuminates the difficulties in escaping these narratives. In both the above visualization, and the wordcloud at the top of this post, the words ‘American’ and ‘Soviet’ are some of the most frequently used. The high frequency of the word ‘Soviet,’ in particular is interesting, as it is less surprising that ‘American’ would dominate amongst a corpus of dissertations written at an American university. While this could be read to reveal a dominance of Soviet subject matter across the entire corpus, the following visualization demonstrates it is, rather, a case of a few dissertations using the word at very high frequencies.

Much of the dominant words within the corpus come as no surprise: world, history, and foreign for example. War is another word that has a very high frequency, with a similar amount of mentions as world, potentially reflecting a substantial focus on the two World Wars across Northeastern’s World History dissertations. This coupled with the high frequency of ‘Soviet’ potentially indicate a temporal focus across the dissertations situated in the twentieth century. The relations between the words, in the sense of number of mentions, as well as between dissertations can be seen below.

There is much left to explore within the corpus of Northeastern’s World History Dissertations, with this inquiry just scratching the surface; this is currently more exploratory than anything. Due to difficulties working in R, this review has included supplementary visualizations created using Voyant Tools. In particular, further explorations of the relative word sentence density and vocabulary density could prove interesting.

Immanuel Wallerstein The Modern World System Berkeley: University of California Press, 1974.
Mary Piorun and Lisa Palmer “Digitizing Dissertations for an Institutional Repository: A Process and Cost Analysis” J Medical Library Association 96, no.3 (2008): 223-229.
Julia Silge and David Robinson Text Mining in R: A Tidy Approach eBook, 2017.

Leave a Reply Cancel reply