Culturonomics : Books as Raw Material to Study Cultural Tendencies
When Google and Harvard get to work together, we see the birth of a new science field : Culturonomics. A team from Harvard University, led by Jean-Baptiste Michel and Erez Lieberman Aiden is analyzing the data resulting from the millions of books scanned by Google since 2004. Together with over 40 university libraries, the internet titan has thus far scanned over 15 million books, creating a massive electronic library that represents 12% of all the books ever published.
The purpose is to use the recurrence of specific words or strings of words during a specific timeframe and use the resulting numbers to infer cultural tendencies in a society.
Though Google began scanning books in 2004, the culturonomics project began in 2007, following the publication of a paper showing that verbs become more regular over time. Upon realizing that the past forms of many irregular verbs have taken on the standard “-ed” suffix, in a way that fits a startlingly simple mathematical formula. Michel and Aiden suddenly realized that they could use the mass of data being gathered and digitalized by Google to study of the evolution of culture. Leading to the birth of culturonomics.
Getting Google involved wasn’t hard. Lieberman-Aiden reminisces, “From the earliest stage, they realized that this had a lot of potential. We talked with them, the door was opened to advance the project some, we showed results, the door opened further. Eventually the door was just open.”
The team eventually worked with a third of the full corpus, selecting those books that were dated most accurately and scanned most crisply. They ended up with over 5 million books published in English, French, Spanish, German, Chinese, Russian and Hebrew, and dating back since the 1500s. Together, the texts include 500 billion words. According to Michel: “The corpus cannot be read by a human. If you tried to read only the entries from the year 2000 alone, at the reasonable pace of 200 words per minute, without interruptions for food or sleep, it would take eighty years. The sequence of letters is one thousand times longer than the human genome.”
Copyrights limitations prevent the team Aiden from publishing the entire texts to the public, but some parts are available at www.culturomics.org. For non-scientists curious about the evolution of a word frequency, Google’s real-time browser enables the public to do so.
This is the kind of surprising result one comes across.
Below are the table corresponding to the frequency of the words “book” and “ebook” between 1800 and 200.
As you can see, the word book is appearing more often, as it uses spreads (the data are weighted to integrate the relative number of occurrence and avoid bias due to the increasing quantity of printed material available) which makes sense.
More surprisingly is the behavior of the word ebook, which seems to appear as early as the beginning of the 19th century. So though the ebook itself is only 40 years old, the idea is much older.