Sunday, January 27, 2013

Data mining for witches with Ngrams

So, the blog got neglected. I have been too busy at work, and following my vitrectomy at the end of last February, I slowly developed a dense post-operative cataract in my eye. This seemed to cause me eye-strain, as though I was over-taxing the properly functioning eye. Reading in the evening became very restricted, and so gratuitously working through texts on EEBO fell from my list of priorities.

Frustrated by this, and facing more months of the cataract getting worse before the NHS could schedule me, last Tuesday I had the procedure done privately, at Reading's weirdly luxurious new Circle Hospital. This has the ambience of a brand new hotel, an illusion only spoiled by the occasional medico shuffling past in ward pyjamas and plastic clogs. Art works are everywhere, and, astonishingly, a young woman with a concert harp was playing to relax those in the lounge area. At no point was she summoned up the lift into theatre to assist in giving a client-customer a reassuring departure, but it seemed a disturbing possibility ("Emergency, harpist to theatre 4, please!"). I had my original operation wearing beneath my plastic shroud my jacket, shirt and tie as usual; for this far less major procedure, Circle had me don a 'Patient Dignity Gown'. I thought this suited me a lot, and I will take to wearing it round my department, especially if the same suppliers can set me up with a 'Wounded Dignity Gown' to alternate it with.

But the chart above is nothing to do with my well-being. It's a Google Ngram, off Google Books. I just discovered these last week, when following an OED request for information into user responses. Someone had tracked a pre-dating (I think it might have been for 'Ironman Triathlon') using Ngrams, and I followed that up, not having heard of them.

Ngrams use the corpus of digitised books, and will plot you a graph for frequency of occurence of one word against another within the corpus. As all graphs would otherwise just show a fierce left to right ascent, the plotting is made proportional to the number of books being published within the period.

The results become more convincing if one plots related words in different graphs. My first effort, above, plotted 'witch' against 'conjurer', 1600 to 1800. It seems to indicate some quiet years after 1610, with not much chatter in print about the topic, and then, towards the end of that decade, a peaking concern (I wondered if it might be reflecting the Overbury case). Then, irregular peaks of concern between 1630 and 1650. There's a late 17th century minor peak (I wondered if one might see in that the late flurry of people like Glanville asserting the existence of witchcraft). Then a splendidly rational 18th century, before Gothic Romanticism brings it all back in.

These tentative results look a bit more convincing when a related graph looks quite similar. Here's 'witchcraft' plotted against 'magic':
Somehow, and perhaps it isn't entirely an artefact from the sample, the quiet years after 1610 show up, then the late decade rise, things going quiet after King James' death, until the 1630's take off again (Lancashire? Loudon?). Irregular peaks thereafter, the 1670's and 1680's still quite strong, an enlightened 18th century, and then Gothicism.

My third try just plotted 'witch' against 'devil': it merely shows the predominance of devil talk over witch talk:

Maybe the late 17th century peak in devil talk is as dissenting literature gets more widely published...?

The Renaissance course I teach on is called 'Love, Honour, Obey: Literature 1525-1660'. So I ran the key terms, this time between 1600 and 2000, with this result:

'Love', we see, goes up and down ("the way it does", as a colleague wittily remarked when I was showing this round). There may be a Cavalier peak, and a Restoration one. 'Honour' does well through the Stuart years, and then shows very solidly as the novel gets going in the late 18th century (a range established around the Richardson peak, as it were). 'Obey' is the interesting line. It's far less common as a word, but there just may be a peak around the end of the Civil War. Gradually, society loses interest in the idea, as people also did with 'honour'.

Are the results of this data-mining real? One obviously has to be cautious. I do not know if Google have had access to the work of the digital text creation partnership that's working through EEBO. When this happens, and also when all texts have been digitised, then the true detail will emerge. But this looks completely convincing to me:

This is 'prophecy' plotted against 'throne'. I couldn't use 'king', as that word appears so often that 'prophecy' gets flattened out. Anyway, just look at the peak in 'prophecy' before the Restoration, and the steep drop-off in interest once Charles II is installed. That looks like a very plausible result to me.

No comments: