PAN'07 Workshop

Workshop Program

09:00-09:15	Opening
09:15-09:45	Author Identification Using Imbalanced and Limited Training Texts Efstathios Stamatatos [paper] [slides]
09:45-10:15	Serial Sharers: Detecting Split Identities of Web Authors Einat Amitay, Sivan Yogev, Elad Yom-Tov [paper] [slides]
10:15-11:00	Coffee Break
11:00-11:30	Forensic Authorship Attribution for Small Texts Ol'ga Feiguina, Graeme Hirst [paper] [slides]
11:30-12:00	Investigating Topic Influence in Authorship Attribution George K. Mikros, Eleni K. Argiri [paper] [slides]
12:00-12:30	Authors, Genre, and Linguistic Convention Jussi Karlgren, Gunnar Eriksson [paper] [slides]
12:30-14:00	Lunch
14:00-14:30	Adaption of String Matching Algorithms for Identification of Near-Duplicate Music Documents Matthias Robine, Pierre Hanna, Pascal Ferraro, Julien Allali [paper] [slides]
14:30-15:00	Intrinsic Plagiarism Analysis with Meta Learning Benno Stein, Sven Meyer zu Eissen [paper] [slides]

For the entire Proceedings see the publisher website.

Call for Papers

The workshop shall bring together experts and prospective researchers around the exciting and future-oriented topic of plagiarism analysis, authorship identification, and high similarity search. This topic receives increasing attention, which results, among others, from the fact that information about nearly any subject can be found on the World Wide Web. At first sight, plagiarism, authorship, and near-duplicates may pose very different challenges; however, they are closely related in several technical respects.

Plagiarism is the act of copying or including another author's ideas, language, or writing, without proper acknowledgment of the original source. Plagiarism analysis is a collective term for computer-based methods to identify a plagiarism offense. In connection with text documents we distinguish between corpus-based and intrinsic analysis: the former compares suspicious documents against a set of potential original documents, the latter identifies potentially plagiarized passages by analyzing the suspicious document with respect to changes in writing style. Such passages then can be used as a starting point for a Web search or for human inspection.

Plagiarism offense and plagiarism analysis can pertain to every kind of a digital media, e.g. text, program code, or images.

Authorship identification divides into so-called attribution and verification problems. In the authorship attribution problem, one is given examples of the writing of a number of authors and is asked to determine which of them authored given anonymous texts. If it can be assumed for each test document that one of the specified authors is indeed the actual author, the problem fits the standard paradigm of a text categorization problem. In the authorship verification problem, one is given examples of the writing of a single author and is asked to determine if given texts were or were not written by this author. As a categorization problem, verification is significantly more difficult than attribution. Authorship verification and intrinsic plagiarism analysis represent two sides of the same coin.

Near-duplicate detection is mainly a problem of the World Wide Web: duplicate Web pages increase the index storage space of search engines, slow down result serving, and decrease the retrieval precision. A naive solution to near-duplicate detection is the pairwise comparison of all documents. With specialized ''document models'', such as fingerprinting or locality sensitive hashing, a significantly reduced number of comparisons is necessary. These approaches provide efficient but incomplete solutions for solving the nearest neighbor problem in high dimensions.

Near-duplicate detection relates directly to plagiarism analysis: at the document level, near-duplicate detection and plagiarism analysis represent also two sides of the same coin. For a plagiarism analysis at the paragraph level, the same specialized document models (e.g. shingling, fingerprinting, hashing) can be applied, where a key problem is the selection of useful chunks from a document.

The development of new solutions for the outlined problems may benefit from the combination of existing technologies, and in this sense the workshop provides a platform that spans different views and approaches. The following list gives examples from the outlined field for which contributions are welcome (but not restricted to):

retrieval models for plagiarism analysis, authorship identification, and style analysis
software plagiarism, cross-language plagiarism, plagiarism in Web communities and social networks
NLP technologies for authorship identification and style analysis
knowledge-based methods for plagiarism analysis and authorship identification
handling proper citation

methods for identifying near-duplicate and versioned documents (for all kinds of contents, including text, source code, image, and music documents)
shingling, fingerprinting, and similarity hashing
hash-based search, high-dimensional search, approximate nearest neighbor search
efficiency issues and performance tradeoffs

tailored indexes for plagiarism analysis and near-duplicate detection
plagiarism analysis and near-duplicate detection on the Web
evaluation, building of test collections, experimental design and user studies

In particular, we encourage potential participants to present research prototypes and tools of their ideas.

The workshop addresses researchers, users, and practitioners from different fields: data mining and machine learning, document and knowledge management, semantic technologies, computer linguistics, social sciences, and information retrieval in general. We solicit contributions dealing with theoretical and practical questions of the development, use, and evaluation of theories and tools related to the workshop theme. Contributions will be peer-reviewed by at least two experts from the related field.

Submission Guideline

The workshop shall encourage the presentation of novel ideas in a possibly less formal way; however, we are still striving for high-quality contributions. Each contribution will be peer-reviewed by at least two reviewers, accepted papers will be included in the workshop proceedings. Depending on the quality of their contribution the author(s) of an accepted paper are invited to give a long or short presentation of their work. Moreover, we plan to invite the authors of the best papers to submit an overview article of their work for a special issue of an international journal, edited by the organizers with the aid of the program committee.

Submitted papers should be in the ACM Conference style, see the ACM template page, and may not exceed 6 pages. Submissions must generally be in electronic form using the Portable Document Format (PDF) or Postscript and mailed to pan-07@webis.de. The review is doubleblind; please anonymize your submission.

Important Dates

~~Jun 3, 2007:~~Postponed Deadline for paper submission
~~Jun 24, 2007:~~Notification to authors
~~Jul 3, 2007:~~ Camera-ready copy due
~~Jul 27, 2007:~~ PAN-07 Workshop

Organizing Committee

Benno Stein, Bauhaus University Weimar
Moshe Koppel, Bar-Ilan University, Israel
Efstathios Stamatatos, University of the Aegean

Program Committee

Shlomo Argamon, Illinois Institute of Technology
Yaniv Bernstein, Google Switzerland
Dennis Fetterly, Microsoft Research
Graeme Hirst, University of Toronto
Timothy Hoad, Microsoft
Heiko Holzheuer, Lycos Europe
Jussi Karlgren, Swedish Institute of Computer Science
Hans Kleine Büning, University of Paderborn
Moshe Koppel, Bar-Ilan University, Israel
Hermann Maurer, University of Technology Graz
Sven Meyer zu Eissen, Bauhaus University Weimar
Efstathios Stamatatos, University of the Aegean
Benno Stein, Bauhaus University Weimar
Özlem Uzuner, State University of New York
Debora Weber-Wulff, University of Applied Sciences Berlin
Justin Zobel, RMIT University