Synopsis

To write in a foreign language is a difficult task, even for an experienced author. Problems include choosing the right word or preposition in a given context, finding a wording which is commonly used, and avoiding the use of grammatical forms which reflect the author's native language. The Netspeak Web service assists authors to overcome these issues by using the World Wide Web as a source of common language. The service can be queried with short text phrases to determine their customariness on the Web. Wildcard characters can be added to the query to search for variations and synonyms of the query phrase, which will be returned as ranked list with respect to their occurence frequency on the Web.

Netspeak indexes the complete "Web 1T 5-gram Version 1" corpus as a source of common language on the Web. The corpus comprises about 3.8 billion phrases up to a length of 5 words: 13,588,391 unigrams, 314,843,401 bigrams, 977,069,902 trigrams, 1,313,818,354 fourgrams, and 1,176,470,663 fivegrams. Basis of this resource is the extremely large number of 1,024,908,267,229 tokens, which allowed for high frequency cutoffs (see the corpus developer's readme.txt): only tokens (words, numbers, and punctuation) appearing 200 times or more (1 in 5 billion) and N-grams appearing 40 times or more (1 in 25 billion) were kept. [api] [service] [video]

People

Publications