Synopsis

A large-scale corpus of over 153 million fully-segmented emails from 14.635 public mailing lists.

The Webis Gmane Email Corpus 2019 is a dataset of more than 153 million parsed and segmented emails crawled between February and May 2019 from gmane.io covering more than 20 years of public mailing lists. The dataset has been published as a resource at ACL 2020.

The dataset comes as a set of Gzip-compressed files containing line-based JSON in the Elasticsearch bulk format. Each data record consists of two lines:

{"index": {"_id": "<urn:uuid:c1d95e4b-0f43-46c7-a99e-c575d1d8e1ce>"}}
{"headers": {"header name": "header value", ...}, "text_plain": "plaintext body", "lang": "en", "segments": [{"end": 99, "label": "paragraph", "begin": 0}, ...], "group": "gmane group name"}

The first line is the Elasticsearch index action with a document UUID, the second one the actual parsed email with a (reduced and anonymized) set of headers, the detected language, the original Gmane group name and the predicted content segments as character spans. The Gzip files are splittable every 1,000 records (line pairs) for parallel processing in, e.g., Hadoop.

Available email headers are:

  • message_id
  • date (ES 7.x: yyyy-MM-dd HH:mm:ssXXX, ES 6.8: yyy-MM-dd HH:mm:ssZZ)
  • subject
  • from
  • to
  • cc
  • in_reply_to
  • references
  • list_id

Available segment classes are:

  • paragraph
  • closing
  • inline_headers
  • log_data
  • mua_signature
  • patch
  • personal_signature
  • quotation
  • quotation_marker
  • raw_code
  • salutation
  • section_heading
  • tabular
  • technical
  • visual_separator

Compatibility Note:

The corpus was originally indexed with Elasticsearch 6.8, which accepted a different date format. Elasticsearch 7.x may fail due to invalid UTC zone offsets in some messages. The following Painless script fixes the issue by clipping offsets outside the +/-18:00 range:

if (ctx._source.headers.date != null) {
    String o = ctx._source.headers.date.substring(ctx._source.headers.date.length() - 5);
    int oi = Integer.min(18, Integer.parseInt(o.substring(0, 2)));
    String co = String.format("%02d", new def[] {oi});
    o = oi < 18 ? co + o.substring(o.length() - 3) : co + ":00";
    ctx._source.headers.date = ctx._source.headers.date.substring(0, ctx._source.headers.date.length() - 5) + o;
}

In Python, you can preprocess the date field with:

''.join((d[:-5], f'{min(18, int(d[-5:-3])):02d}', d[-3:] if int(d[-5:-3]) < 18 else ':00'))

Download

As a qualified individual researcher or member of a research institution, you can request access the Webis-Gmane-19 corpus on Zenodo.

The dataset is available only to individual researchers and research institutions. If you qualify for either one, we are happy to share the data with you under the following conditions: Any non-academic use and redistribution of the data are prohibited. By downloading the dataset, you agree to these terms. We request you be responsible in your research and in your handling of the data and adhere to ethical standards and privacy regulations.

Despite the anonymization of email addresses and headers and the fact that all data comes from a readily-available online source, we take this step as a measure to protect the privacy of users whose data can be found in the corpus.

If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to cite the corpus as [bib]. If you additionally want to link to the dataset, please use the dataset's [doi] for a stable link.

The pre-trained segmentation model and code can be downloaded freely from GitHub.

The raw crawl data (including a snapshot of the whole gwene RSS feed headline hierarchy) is available at the Internet Archive.

People

Publications