Introduction


With the advent of the World Wide Web, information in many different formats is easily accessible. Texts, images, videos and audios are all available for consult, download, and modification. Under these circumstances, text re-use has increased in the last years. In particular, plagiarism has been defined by IEEE as the reuse of someone else's prior ideas, processes, results, or words without explicitly acknowledging the original author and source. The problem has requested the attention from many research areas, even generating new terms, such as the known as copy&paste syndrome or a new kind of text re-use: cyberplagiarism.


While people have enough expertise to detect re-use of text when reading a document, the scale of potential source documents (that of the Web) makes manual analysis unfeasible. As a countermeasure, different systems that assist in the detection of text re-use have been developed. The main idea is to automatically detect such text fragments in a document that are suspicious of being re-used and, if available, provide its presumable source. In that way, on the basis of given linguistic evidence , a human can take a final decision.


Recent efforts have been conducted to the better development of models for detection of text re-use. Probably one of the most interesting cases is the PAN, International Competition on Plagiarism Detection held in conjunction with CLEF.


A special kind of phenomenon is cross-language text re-use, where the re-used text fragment and its source are written in different languages, making its automatic detection even harder than for the monolingual case. Cross-language text re-use detection has been nearly approached in the last years, and better models are necessary.


Through in the current initiative we aim to further impulse the development of better models for text re-use detection and, in particular, cross-language text re-use detection. Our interest in the second kind of text re-use is motivated by the following facts:

  • Speakers of less-resourced languages (also known as under resourced languages) are forced to consult documentation in a foreign language; and
  • People immerse in a foreign country can still consult material written in their native language.
Such environments cause the commitment of cross-language text re-use more likely and become it an interesting problem nowadays.




Task Description


The focus of the CL!TR evaluation task is on cross-language text re-use detection. To start with, in this year's task, we are targeting two languages: English - Hindi. The source text is in English and the suspicious text is in Hindi.


You are provided with a set of suspicious documents in Hindi and a set of potential source documents in English. The task is to identify the documents in the suspicious set (Hindi) that are created by re-using fragments from the source set (English).


You are expected to identify suspicious documents which have been actually generated by re-use together with their corresponding sources. Note that this is a document level task. No specific fragments inside of the documents are expected to be identified; only pairs of documents. Determining either a text has been re-used from its corresponding source is enough. Specifying the kind of re-use (Exact, Heavy, or Light) is not necessary.


CL!TR is divided in two phases: training and test. For the training phase we provide an annotated corpus including different levels of re-use. It includes information about whether a text fragment has been re-used and, if it is the case, what its source is. In the test phase no annotation or hints about the cases are provided.


Result Submission

The results of your re-use detection software are required to be formatted in XML:


<document>
<reuse_case
  reused_reference="..."    <!-- file name of the suspicious document -->
  source_reference="..."    <!-- file name of the source document -->
/>
<reuse_case
  reused_reference="..."    <!-- file name of the suspicious document -->
  source_reference="..."   <!-- file name of the source document -->
/>
.........................    <!-- more detections in the collection -->
</document >

For each pair of suspicious and source document there will be one entry of the <reuse_case .../> in the xml file.




Evaluation Corpus


Training Collection


The training corpus is available for download here:

  • CLITR_training_data.tar.bz2
    md5sum 53381673b76196110adf29428b552bb0 , 14.8 MB
    (note that the potential source documents include Wiki-markup)

Test Collection


The test corpus is available for download here:

  • CLITR_test_data.tar.bz2
    md5sum dc2af9095c01270264e25604d9d9f2a4 , 14.8 MB
    (note that the potential source documents include Wiki-markup)


Evaluation Task


Let S be a set of suspicious documents. Let D be a set of potential source documents. The task is to find those documents s in S which have been actually re-used and their source document d in D.


Evaluation Corpus


The corpus contains a set of potential source documents D, written in English, and set of suspicious documents S, written in Hindi. In the corpus you will find plain text files encoded in UTF-8. The source documents are taken from English Wikipedia. The source documents include Wiki-mark up.


Training Collection

In order to prepare and develop your detection software we provide with a training collection. Such a collection includes annotations for every case of re-use.

  • Training Corpus Statistics
    • 5032 Source files in English
    • 198 suspicious files in Hindi
Test Collection

The test collection is composed on the same way than the training collection: a set of suspicious together with potential source documents.

  • Test Corpus Statistics
    • 5032 Source files in English
    • 190 suspicious files in Hindi

Both corpora can be downloaded from the Corpus section of this website.

Submission of Detection Results


Participants are allowed to submit up to three runs in order to experimenting with different settings.

The results of your detection are required to be formatted in XML. The result document must be valid with respect to the following XML schema:


<document>
<reuse_case
  reused_reference="..."    <!-- file name of the suspicious document -->
  source_reference="..."    <!-- file name of the source document -->
/>
<reuse_case
  reused_reference="..."    <!-- file name of the suspicious document -->
  source_reference="..."   <!-- file name of the source document -->
/>
.........................    <!-- more detections in the collection -->
</document >


Performance Measures


The success of a text re-use detection will be measured in terms of its Precision (P), Recall (R), and F-measure (F) on detecting the re-used documents together with their source in the test corpus.


A detection is considered correct if the re-used document is identified together with its corresponding source document. We consider:

  • total detected to be the set of suspicious-source pairs detected by the system.
  • correctly detected to be the subset of pairs detected by the system which actually compose cases of re-use.
  • total re-used to be the gold standard, which includes all those pairs which compose actual re-used cases.
P, R and F are defined as follows:
P  =    correctly detected
total deteted

R  =    correctly detected
total re-used

F-measure  =    2 * R * P
R + P

A reference implementation of the measures, coded in Perl, is no longer available.


It can be run as follows:

perl getmeasures.pl <gold_standard.xml> <detection.xml>

(for an example, run it considering ref_small.xml as gold standard and multiple_detection.xml.)




Evaluation Results


Participants


ParticipantInstitutionCountry
Aniruddha GhoshJadavpur UniversityIndia
Karteek Addanki et al.Hong Kong University of Science and TechnologyHong Kong (China)
Nitish Aggarwal et al.DERI Galway and UPM MadridIreland / Spain
Parth Gupta et al.UPV & DA-IICTSpain / India
Rambhoopal K.IIIT HyderabadIndia
Yurii PalkovskiiZhytomyr State University / SkyLine Inc.Ukraine


Ranking


RankF-measureRecallPrecisionRunLeader
10.6490.7500.5713Rambhoopal K.
20.6090.8210.4841Nitish Aggarwal
30.6080.6430.5762Rambhoopal K.
40.6030.5890.6171Yurii Palkovskii
50.5960.8040.4742Parth Gupta
60.5890.7950.4682Nitish Aggarwal
70.5760.5890.5641Rambhoopal K.
80.5410.4730.6312Yurii Palkovskii
90.5230.5000.5493Yurii Palkovskii
100.5090.6070.4393Parth Gupta
110.4300.5800.3421Parth Gupta
120.2200.2140.2262Aniruddha Ghosh
130.2200.2140.2263Aniruddha Ghosh
140.0850.1070.0701Aniruddha Ghosh
150.0000.0000.0001Karteek Addanki



Organizing Committee


  • Alberto Barrón-Cedeño, Paolo Rosso
    NLE Lab @ Universidad Politécnica de Valencia, Spain
  • Sobha Lalitha Devi
    CLR Group @ AU-KBC Research Centre, Chennai, India
  • Paul Clough, Mark Stevenson
    IR & NLP Groups @ University of Sheffield, UK

Program Committee


Tim BaldwinMelbourne University
Rafael E. BanchsInstitute for Infocomm Research Singapore
Carole ChaskiInstitute for Linguistic Evidence
Malcolm CoulthardCentre for Forensic Linguistics, University of Aston
Marcelo ErrecaldeUniversidad Nacional de San Luis
Michael GranitzerKnow-Center Graz
Roman KernGraz University of Technology
Adam KilgarriffLexicography MasterClass Ltd
Elisabeth LexKnow-Center Graz
Qin LuThe Hong Kong Polytechnic University
Manuel Montes y GomezINAOE-Puebla
Ted PedersenUniversity of Minnesota in Duluth
Anselmo PeñasUNED
Martin PotthastBauhaus-Universität Weimar
Ganesh RamakrishnanIIT Bombay
Grigori SidorovInstituto Politécnico Nacional
Thamar SolorioUniversity of Alabama at Birmingham
Efstathios StamatatosUniversity of the Aegean
Benno SteinBauhaus-Universität Weimar
Dan TufisRomanian Academy
María Teresa Turell JuliáForensicLab, Universitat Pompeu Fabra
Vasudeva VarmaIIIT Hyderabad
Juan VelásquezUniversidad de Chile
Luis VillaseñorINAOE-Puebla
Piek VossenVrije Universiteit (VU) Amsterdam
Dekai WuHong Kong University of Science and Technology