Synopsis

To write in a foreign language is a difficult task, even for an experienced author. Problems include choosing the right word or preposition in a given context, finding a wording which is commonly used, and avoiding the use of grammatical forms which reflect the author's native language. The Netspeak Web service assists authors to overcome these issues by using the World Wide Web as a source of common language. The service can be queried with short text phrases to determine their customariness on the Web. Wildcard characters can be added to the query to search for variations and synonyms of the query phrase, which will be returned as ranked list with respect to their occurence frequency on the Web. [service] [api] [video 1 2 3 4]

Research

Netspeak indexes the complete "Web 1T 5-gram Version 1" corpus as a source of common language on the Web. The corpus comprises about 3.8 billion phrases up to a length of 5 words (so-called n-grams) which were collected by Google from the English Web. The following table shows details on the size of the corpus:

n-grams
count
size (compressed)
size (uncompressed)
1-grams
13 588 391
70.2 MB
177.0 MB
2-grams
314 843 401
1.6 GB
5.0 GB
3-grams
977 069 902
5.5 GB
19.0 GB
4-grams
1 313 818 354
8.4 GB
30.5 GB
5-grams
1 176 470 663
8.8 GB
32.1 GB

People

Publications