Coreference in Universal Dependencies 1.3 (CorefUD 1.3)

Apr 17, 2025

Qualified relations

Novák, Michal

;

Popel, Martin

;

Zeman, Daniel

;

Žabokrtský, Zdeněk

;

Nedoluzhko, Anna

;

Acar, Kutay

;

Bamman, David

;

Bourgonje, Peter

;

Cinková, Silvie

;

Eckhoff, Hanne

;

Cebiroğlu Eryiğit, Gülşen

;

Hajič, Jan

;

Hardmeier, Christian

;

Haug, Dag

;

Jørgensen, Tollef

;

Kåsen, Andre

;

Krielke, Pauline

;

Landragin, Frédéric

;

Lapshinova-Koltunski, Ekaterina

;

Mæhlum, Petter

;

Martí, M. Antònia

;

Mikulová, Marie

;

Milintsevich, Kirill

;

Mujadia, Vandan

;

Muzerelle, Judith

;

Nam, Sangha

;

Nøklestad, Anders

;

Ogrodniczuk, Maciej

;

Øvrelid, Lilja

;

Pamay Arslan, Tuğba

;

Porada, Ian

;

Recasens, Marta

;

Solberg, Per Erik

;

Stede, Manfred

;

Straka, Milan

;

Swanson, Daniel

;

Toldova, Svetlana

;

Vadász, Noémi

;

Velldal, Erik

;

Vincze, Veronika

;

Zeldes, Amir

;

Žitkus, Voldemaras

Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) (Publisher)

Resource type

Text

Subjects

coreference

bridging relations

harmonized annotation

dependency

treebank

Descriptions

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.3 consists of 28 datasets for 18 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 24 datasets for 17 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 2 for French, 2 for German, 1 for Hindi, 2 for Hungarian, 1 for Korean, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Compared to the previous version 1.2, the version 1.3 comprises new languages and corpora, namely French-ANCOR, Hindi-HDTB, and Korean-ECMT. In addition, English-GUM and Czech-PDT have been updated to newer versions and conversion of zeros in Hungarian-KorKor has been improved (a list of all changes in each dataset can be found in the corresponding README file).

Additional details

Primary language	Turkish
Related resources	This dataset is source of CorefUD \| ÚFAL IRI https://ufal.mff.cuni.cz/corefud This dataset references Coreference in Universal Dependencies 1.2 (CorefUD 1.2) HANDLE http://hdl.handle.net/11234/1-5478

Files

Name	Size
CorefUD-1.3-public.zip MD5: 95acda0b065942f8e3dae25c7ae45ddc	102.5 MB

Coreference in Universal Dependencies 1.3 (CorefUD 1.3)

CorefUD | ÚFAL

Coreference in Universal Dependencies 1.2 (CorefUD 1.2)