Can I change the X axis to be something other than the years from 1000 CE to 2016 CE?
Sorry! That's the #1 suggestion everyone makes, and I didn't get that feature done in time for this initial release.
Where did this content come from?
When the Internet Archive scans books, we scan them as images, and then use OCR (optical character recognition) to find the text of the words on each page. We use this text, for example, to make books available to our patrons that have a difficult time reading text. OCR is an imperfect process, so most of the incorrect and odd-looking words you see in this experiment are probably the result of OCR, and not mistakes in the original book.
How did you choose the 82,000 books in the index?
I created a ranking system for books, based on data such as the publisher of a book: books from academic publishers received higher ranks than books from vanity presses. I also used data such as citations in Wikipedia and some book sales data from Better World Books, a used-book seller that partners with the Internet Archive. Using these ranks, I then picked the 82,000 books in our collections with the highest ranks. The sentences displayed in the results are also sorted using these ranks.
How does this relate to Google's Ngram Viewer?
Google's Ngram viewer uses the publication date of the entire book as the date used in the graph. This visualization is using dates found in the same sentence as the things.
Can you talk a little about the techical aspects of this system?
Sure! The year graphs for each thing, and the sentences list for each thing/date pair, are stored in a fully denormalized database. This means that any user interaction with the visualization only needs 1 call to the server, which needs to perform only 1 disk read to reply. This denormalizated data takes more storage in order to be fast. A single server ought to be able to handle 20 queries per second from a regular disk (spinning rust), or as much as 10,000 queries per second from SSD.
The database takes 45 gigabytes for 82,000 books. I expect to eventually index around 10 million books, which scales to a predicted 6 terabyte database.
What are some bugs in the current version?
The x-axis of this visualization isn't dynamic, which leads to an inability to see dates before 1000 AD/CE, and also very narrow bars in the graph at all times, even for "things" which only have sentences with a very limited range of dates.
The sentence splitter isn't very smart, and often will show 2 or 3 sentences run together. Occasionally this means that the date isn't very relevant to the "thing" you're interested in.
Dates in the BC/BCE range of years are incorrectly parsed as if they were AD/CE, so a search for [Julius Caesar] will have an inaccurate graph. (Fortunately, the bug mentioned above regarding the x-axis prevents you from seing this!)
The counts in the year graph are frequently a bit higher than the actual number of results - some sentences were double-counted in the year graph. In other cases, some identical sentences from multiple editions of the same book were suppressed & lower the number of visible sentences.
Some extremely similar sentences still appear in the results. These are mostly caused by slight OCR-related differences.
Searches for things which do not have Wikipedia articles will always return 0 results, by design. We hope that a future iteration of this experiment will use a library such as Stanford NER to extract all "things" mentioned in a book.
Mentions of decades (1920s) and centuries (1900s) are treated as a reference to the start year, resulting in odd-looking graphs for things which have a lot of decade and century references.
The "thing" isn't always boldfaced in the sentence, due to various issues. Sometimes it can be difficut to figure out what's going on if it's a synonym that is the match. For example, "Gregorian Calendar" and "Western calendar" are considered synonyms in the Wikipedia