Digital Methodologies: Using Semantic Vector Spaces for Identity Research


Quintus van Galen is a PhD candidate at Edge Hill University, working on the application of computerised textual analysis to historical questions in periodical studies. His thesis will focus on the methodological viability of such analyses within historiography, by the use of a case study on imperial identity in Victorian newspapers. When not working on his own research, he teaches methodology to undergraduates, and acts as an assistant for Bob Nicholson’s digital projects.

Quintus would be happy to answer any questions you have regarding LSI and what it can do for your research at or through twitter @QVgalen


When I graduated with an MA in History in 2015, I was part of the first cohort that started their programmes after the genesis and widespread adoption of digital archives in the early 2010s. We had been part of the first generation to have access to full-text web-searchable databases for our dissertations. Many of our generation, myself included, gratefully used those resources and many would not have been able to complete their dissertation without them. While on archival research in India, I could cross-reference documents found there with those in Kew, Brussels or The Hague with just a few mouse-clicks. Yet there was one thing our training left us unprepared for when using these fonts of knowledge – the incredible amount of material they contained. The methodologies that have served historians so well for as long as historiography has existed were always meant to cope with a dearth of information, in the implicit understanding that there was always some source that was unknown or unavailable to the historian. Now, it seems we are swamped by sources the moment we hit the search bar.

One possible solution to this textual overload comes in the form of ‘distant reading’. A term popularized by Franco Moretti in 2005, it offers the historian a way to analyse a large collection of texts and gaining a clearer understanding of the way these texts influenced each other – and society. The simplest way to preform distant reading for historical research, on both a practical and conceptual level, is the N-gram method. First conceived in the 1990’s, these are graphs measuring occurrences of a string of words of length n over a set of documents over time. They are commonly used in the branch of historiography known as culturonomics, which seeks to analyse of culture based on statistical evidence.

The research for my PhD will use N-grams in a primarily exploratory role. By adding the tables of the category under investigation (for example, a year of a specific periodical) and deriving the relative word frequencies from these we can construct N-gram graphs, which can answer very specific research questions. While no n-gram corpora exist (publically) for periodicals, short grams are easy to generate.[i] In the initial stages of the project, I built a simple 1-gram viewer. This allowed me to do two things: to get a very rough overview of the content of the digitised periodicals and its evolution over time, and to choose subsets for further programs to work with in a less biased way.

But there are questions N-grams cannot answer. They look at grams in isolation, therefore they lose the contextual meanings of the words they chart. An N-gram to research the adoption of the American term ‘skedaddle’ in Victorian newspapers would indeed show its adoption in the British vocabulary, but would also show the brief reign of the racehorse of the same name – without distinguishing between the two. The list of context-sensitive terms is almost endless. Without contextual semantic information embedded in the rest of the sentence, it is impossible to know which you are mapping. Collocate searches can provide some solace, but only to a degree.

The solution is to no longer consider words as solitary semantic units, but to include their contexts in the analysis. This is easier said than done, and only recently have computers become ‘smart’ enough to deal with semantic ambiguities; the most accessible way of achieving this is by using semantic vector spaces. These programs apply the saying “you shall know a word by the company it keeps”, and encode each word as a vector containing its relation with all other words in the corpus. Among others, one such method is Latent Semantic Indexing (LSI).

My PhD research focusses in on the perceptions of Imperial identity in the British press, and the way the press helps construct this identity. As part of this process, I am using LSI to discover whether there is an evolution in the topics discussed in British newspapers, and if different newspapers generally report on different topics. Both of these questions require the use of a large dataset of digitised newspapers, which is fortunately available in the form of the British Library Nineteenth Century Newspaper Archive.

To answer these questions, a slice is made in the newspaper corpus to narrow it down to a time period, newspaper, or keyword-topic. In this case, the Gensim LSI algorithm was applied to a sample of one thousand newspaper articles which contained the keywords ‘empire’ and either ‘border’ or ‘frontier’, with the aim of discovering trends in the narrative about the edge of the Empire. The reason for this relatively small sample size is that the analysis program is still in heavy development, and using more articles would incur exponentially longer processing times. Eventually, the goal is to use a sample around the ten to one-hundred thousand mark. The output is a list of the strongest semantic clusters. Very few historians have employed the technology yet, so its full potential is still unknown. However, promising experiments on the Dutch Newspaper corpus Delpher have seen the technique used to calculate the most important words in a text, and subsequently determine the associations between these words and the entire text. My PhD project was in large part inspired by those experiments.

Several of these are clearly garbage – a sure sign the OCR-cleaning algorithm is slacking at its job. Some of the associations are expected in a dataset such as this, for example, the Boer War. This is a good sign, because it means the program is to some extent working like it should. There are, however, several interesting observations that can be made from these. For one, about half of the clusters contain some form of reference to the military, the backbone of the Empire; seven clusters containing “Troops”, and three mention staff officer ranks (Colonel and General). Additionally, there are clear clusters of enemies, most notably the Russians (who are strongly associated with the Turks and the French), and above all the Boers. It needs to be stressed though, that the importance given to this latter group is a reflection of the overrepresentation of the late 1890’s in the sample. Of particular interest is the cluster I interpreted as discussing the national budget, which contains strong associations with “Army” and “Tribes”. These findings seem to suggest that the British Empire was indeed strongly defined in the papers by its military actions on the frontiers, supporting scholarship by Makenzie, Potter, Williams and others.

The second interesting find is the lack of “Britain” or “England” in the clusters. “British” occurs in only two clusters: once with “Government” and “War”, and once with “Africa” and “Turkish”. This absence might be explained by the theory of Banal Nationalism: there was no need to explicitly state a paper was discussing the British Empire – this was known by the reader by virtue of it being the only ‘Empire’ worth mentioning as such. A closer reading will have to be carried out to verify this, but this suggests the importance of the British press in ‘Flagging the Empire’ on a daily basis cannot be overestimated.

Going forward, the biggest obstacle to overcome is the lack of quality in the OCR, and improving this will be the first step towards producing better analyses. Fortunately, there are various options available to deal with this issue, so there are multiple avenues to explore. Once this is dealt with, I hope to apply LSI to various slices of the newspaper archive (by time, newspaper, or topic), which should allow for tracking trends over these categories. This should provide both an insight in the viability of LSI as a historical research tool, and allow an overview of the entirety of newspaper reporting without having to resort to sampling.



[i] The n-grams that do exist for the BLNP dataset have to remain within institutions for copyright reasons. They have been built and operated with great success by the University of Bristol. See: Lansdall-Welfare et al, ‘Content analysis of 150 years of British periodicals’, DOI: 10.1073/pnas.1606380114



Michel et al. ‘Quantitative Analysis of Culture Using Millions of Digitized Books’,  Science, December 2010.

Nicholson, B. and Johnes, J., ‘Sport History and Digital Archives in Practice, in G. Osmond and M. Philips (eds.), Sport History in the Digital Era (University of Illinois Press: Chicago, 2015). Pp. 53-74.

Lansdall-Welfare et al. ‘Content analysis of 150 years of British periodicals’, PNAS, January 2017.

Williams, K., Read All About It! A history of the British Newspaper (Abingdon: Routledge, 2010).

Potter, S. (ed.), Newspapers and Empire in Ireland and Britain: Reporting the British Empire, c.1857 – 1921 (Dublin: Four Courts Press, 2004).

MacKenzie, J., Propaganda and Empire: The Manipulation of the British Public Opinion 1880-1960 (Manchester: Manchester University Press, 1984)

Billig, M., Banal Nationalism (London: SAGE, 1995).


Comments are closed.