public class GoogleBooks
- extends java.lang.Object
A class to demonstrate how to build a
BigHashMap directly from
the zip-compressed csv-files of the
Google Books N-Gram
Unlike the original dataset that tracks the frequency of n-grams over
several years, we want to map each n-gram to exactly one frequency value.
To do so, the underlying code first accumulates the frequencies given for
an n-gram and writes this value together with the n-gram as one record.
It turns out that this task can only be accomplished successfully, if
the following issues are handled properly.
- Since the
BigHashMap has no build-in Unicode support, you have
ToAsciiEncoder.encode(String) on the n-gram strings when
extracting records from the original data.
- It seems that Google made a mistake in their n-gram encoding. We found
out that certain n-gram files contain also n-grams of length (n-k) and that
these shortened n-grams generate duplicates. You have to make sure that such
n-grams are filtered out when parsing the zip-files.
- As mentioned above, the frequency of each n-gram is accumulated first
over all years available and then emitted as one key/value pair. Please take
care to write exactly one record for each n-gram, to keep your key set
(the set of all n-grams) unique.
- $Id: GoogleBooks.java,v 1.4 2011/04/09 23:22:46 trenkman Exp $
|Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
public static final void createRecordFiles(java.io.File srcDir,
public static void main(java.lang.String args)