Picapica is a Web-based application for the automated detection of text reuse. Its underlying technologies and algorithms are developed at our research group and relate to the efficient retrieval and analysis of potentially plagiarized sources from the World Wide Web. Picapica combines several approaches to plagiarism analysis: identification of copies which were taken 1:1 from a Web-document, copies that have undergone certain modifications, as well as an in-depth analyses of an author's writing style. [demo] [service] [video]The project is supported by the EXIST support program of the Federal Ministry of Economics and Technology (BMWi).
Picapica implements a plagiarism analysis process consisting of three basic steps:
- Retrieval of reference documents from the World Wide Web as well as from specially prepared plagiarism indexes.
- Detailed analysis of a suspicious document against reference documents.
- Knowledge-based post-processing of plagiarism indications to avoid the detection of correct citations as plagiarism.
In the first step a suspicious document is analyzed in order to identify it's language, its topic, its genre, important keywords, and other characteristics which may help to narrow a Web search for plagiarized sources. Also, a special plagiarism index with commonly used sources for plagiarism (e.g. Wikipedia) is queried. The result of both heuristic searches is a set of URLs to Web documents which are downloaded in parallel on a distributed server architecture. In the second step the suspicious document is compared to each of the downloaded documents. This step encompasses the retrieval of passages which are equal or which have a high similarity. In this connection fuzzy-fingerprinting plays an important role: from each text passage a fuzzy fingerprint is computed, where text passages with a high similarity are likely to be mapped onto the same fingerprint. This allows for a linear time retrieval of similar text passages between the suspicious document and a reference document. Apart from the comparison with reference documents the writing style of the suspicious document's author is analyzed. This analysis can be used to detect paragraphs copied from sources that are not available electronically. The third step in analysis process is subject to our current research. Solutions to the problem of distinguishing between plagiarism and correct citations will be integrated to the Web service in the near future.
The server architecture implements a scalable distributed system based on the message oriented middleware paradigm. A gateway Web server attends to all client interactions. It receives uploaded files and delivers analysis results. A plagiarism analysis is conducted in parallel on several analysis servers. The entire communication, all analysis results, and the information about all currently running tasks are stored in a message queue. The message queue is realized with a relational database system.
EXIST scholarship students: Christof Bräutigam, Christina Eisenach, Jan Graßegger, Daniel Plath
Other students: Dennis Braunsdorf, Matthias Busse, Franz Coriand, Andreas Eiselt, Jan Hühne, Alexander Kleppe, Karsten Klüger, Alexander Kümmel, Marion Kulig, Christoph Lössnitz, Fabian Loose, Hagen-Christian Tönnies, Martin Trenkmann, Michael Völske, André Zölitz