de.aitools.aq.bighashmap.usage
Class GoogleBooks
java.lang.Object
de.aitools.aq.bighashmap.usage.GoogleBooks
public class GoogleBooks
- extends java.lang.Object
A class to demonstrate how to build a BigHashMap
directly from
the zip-compressed csv-files of the
Google Books N-Gram
collection.
Unlike the original dataset that tracks the frequency of n-grams over
several years, we want to map each n-gram to exactly one frequency value.
To do so, the underlying code first accumulates the frequencies given for
an n-gram and writes this value together with the n-gram as one record.
It turns out that this task can only be accomplished successfully, if
the following issues are handled properly.
- Since the
BigHashMap
has no build-in Unicode support, you have
to use ToAsciiEncoder.encode(String)
on the n-gram strings when
extracting records from the original data.
- It seems that Google made a mistake in their n-gram encoding. We found
out that certain n-gram files contain also n-grams of length (n-k) and that
these shortened n-grams generate duplicates. You have to make sure that such
n-grams are filtered out when parsing the zip-files.
- As mentioned above, the frequency of each n-gram is accumulated first
over all years available and then emitted as one key/value pair. Please take
care to write exactly one record for each n-gram, to keep your key set
(the set of all n-grams) unique.
- Version:
- $Id: GoogleBooks.java,v 1.4 2011/04/09 23:22:46 trenkman Exp $
- Author:
- martin.trenkmann@uni-weimar.de
Method Summary |
static void |
createRecordFiles(java.io.File srcDir,
java.io.File desDir)
|
static void |
main(java.lang.String[] args)
|
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
GoogleBooks
public GoogleBooks()
createRecordFiles
public static final void createRecordFiles(java.io.File srcDir,
java.io.File desDir)
throws java.util.zip.ZipException,
java.io.IOException
- Throws:
java.util.zip.ZipException
java.io.IOException
main
public static void main(java.lang.String[] args)