Sunday, January 27, 2013
Data mining for witches with Ngrams
Frustrated by this, and facing more months of the cataract getting worse before the NHS could schedule me, last Tuesday I had the procedure done privately, at Reading's weirdly luxurious new Circle Hospital. This has the ambience of a brand new hotel, an illusion only spoiled by the occasional medico shuffling past in ward pyjamas and plastic clogs. Art works are everywhere, and, astonishingly, a young woman with a concert harp was playing to relax those in the lounge area. At no point was she summoned up the lift into theatre to assist in giving a client-customer a reassuring departure, but it seemed a disturbing possibility ("Emergency, harpist to theatre 4, please!"). I had my original operation wearing beneath my plastic shroud my jacket, shirt and tie as usual; for this far less major procedure, Circle had me don a 'Patient Dignity Gown'. I thought this suited me a lot, and I will take to wearing it round my department, especially if the same suppliers can set me up with a 'Wounded Dignity Gown' to alternate it with.
But the chart above is nothing to do with my well-being. It's a Google Ngram, off Google Books. I just discovered these last week, when following an OED request for information into user responses. Someone had tracked a pre-dating (I think it might have been for 'Ironman Triathlon') using Ngrams, and I followed that up, not having heard of them.
Ngrams use the corpus of digitised books, and will plot you a graph for frequency of occurence of one word against another within the corpus. As all graphs would otherwise just show a fierce left to right ascent, the plotting is made proportional to the number of books being published within the period.
The results become more convincing if one plots related words in different graphs. My first effort, above, plotted 'witch' against 'conjurer', 1600 to 1800. It seems to indicate some quiet years after 1610, with not much chatter in print about the topic, and then, towards the end of that decade, a peaking concern (I wondered if it might be reflecting the Overbury case). Then, irregular peaks of concern between 1630 and 1650. There's a late 17th century minor peak (I wondered if one might see in that the late flurry of people like Glanville asserting the existence of witchcraft). Then a splendidly rational 18th century, before Gothic Romanticism brings it all back in.
These tentative results look a bit more convincing when a related graph looks quite similar. Here's 'witchcraft' plotted against 'magic':
My third try just plotted 'witch' against 'devil': it merely shows the predominance of devil talk over witch talk:
Maybe the late 17th century peak in devil talk is as dissenting literature gets more widely published...?
The Renaissance course I teach on is called 'Love, Honour, Obey: Literature 1525-1660'. So I ran the key terms, this time between 1600 and 2000, with this result:
Are the results of this data-mining real? One obviously has to be cautious. I do not know if Google have had access to the work of the digital text creation partnership that's working through EEBO. When this happens, and also when all texts have been digitised, then the true detail will emerge. But this looks completely convincing to me: