de.aitools.aq.invertedindex.usage
Class GoogleBooks

java.lang.Object
  extended by de.aitools.aq.invertedindex.usage.GoogleBooks

public class GoogleBooks
extends java.lang.Object

A class to demonstrate how to build an inverted index directly from the zip-compressed csv-files of the Google Books N-Gram collection.

As the original dataset that tracks the frequency of n-grams over several years, we want to map each n-gram to its sequence of year/frequency tuples. To do so, the underlying code extracts one n-gram/year/frequency triple from each line, stores these data in a record and puts this record into an instance of Indexer.

Note: It seems that Google made a mistake in their n-gram encoding. We found out that certain n-gram files contain also n-grams of length (n-k) and that these shortened n-grams generate duplicates. You have to make sure that such n-grams are filtered out when parsing the zip-files.

Version:
$Id: GoogleBooks.java,v 1.1 2011/04/10 16:41:25 trenkman Exp $
Author:
martin.trenkmann@uni-weimar.de

Constructor Summary
GoogleBooks()
           
 
Method Summary
static void index(java.io.File ngramDir, java.io.File indexDir)
           
static void main(java.lang.String[] args)
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

GoogleBooks

public GoogleBooks()
Method Detail

index

public static final void index(java.io.File ngramDir,
                               java.io.File indexDir)
                        throws java.util.zip.ZipException,
                               java.io.IOException
Throws:
java.util.zip.ZipException
java.io.IOException

main

public static void main(java.lang.String[] args)