de.aitools.aq.invertedindex.usage
Class GoogleBooks
java.lang.Object
de.aitools.aq.invertedindex.usage.GoogleBooks
public class GoogleBooks
- extends java.lang.Object
A class to demonstrate how to build an inverted index directly from the
zip-compressed csv-files of the
Google Books N-Gram
collection.
As the original dataset that tracks the frequency of n-grams over several
years, we want to map each n-gram to its sequence of year/frequency tuples.
To do so, the underlying code extracts one n-gram/year/frequency triple from
each line, stores these data in a record and puts this record into an
instance of Indexer
.
Note: It seems that Google made a mistake in their n-gram encoding. We found
out that certain n-gram files contain also n-grams of length (n-k) and that
these shortened n-grams generate duplicates. You have to make sure that such
n-grams are filtered out when parsing the zip-files.
- Version:
- $Id: GoogleBooks.java,v 1.1 2011/04/10 16:41:25 trenkman Exp $
- Author:
- martin.trenkmann@uni-weimar.de
Method Summary |
static void |
index(java.io.File ngramDir,
java.io.File indexDir)
|
static void |
main(java.lang.String[] args)
|
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
GoogleBooks
public GoogleBooks()
index
public static final void index(java.io.File ngramDir,
java.io.File indexDir)
throws java.util.zip.ZipException,
java.io.IOException
- Throws:
java.util.zip.ZipException
java.io.IOException
main
public static void main(java.lang.String[] args)