A class to demonstrate how to build an inverted index directly from the zip-compressed csv-files of the Google Books N-Gram collection.

As the original dataset that tracks the frequency of n-grams over several years, we want to map each n-gram to its sequence of year/frequency tuples. To do so, the underlying code extracts one n-gram/year/frequency triple from each line, stores these data in a record and puts this record into an instance of Indexer.

Note: It seems that Google made a mistake in their n-gram encoding. We found out that certain n-gram files contain also n-grams of length (n-k) and that these shortened n-grams generate duplicates. You have to make sure that such n-grams are filtered out when parsing the zip-files.

