Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA, a question answering dataset covering 11 typologically diverse languages with 200K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology — the set of linguistic features that each language expresses — such that we expect models performing well on this set to generalize across a large number of the languages in the world. We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, and the data is collected directly in each language without the use of translation.
For a full description of the methodology used to create the corpus, see TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages.
Below you will find a short selection of examples from the dataset. They provide a glimpse into the diversity of linguistic phenomena in our dataset. We specifically single out instances where identifying relevant contexts across the question-answer pair cannot be done by simple string matchting of relevant terms.
Each example contains a question-answer pair. For simplicity, the answer is a selection of a sentence containing the short answer in the original passage. Examples are presented with Leipzig glosses or transliterations wherever relevant.
Q: | ؟ | موزارت | هو | من |
? | mwzArt | hw | mn | |
Who is Mozart ? |
A: | بالنمسا | سالزبورغ | في | 1756 | يناير | 27 | في | ولد | (1791 | ديسمبر | 5 | - | 1756 | يناير | 27) | موتسارت | أماديوس | فولفغانغ |
bAlnmsA | sAlzbwrg | fy | 1756 | ynAyr | 27 | fy | wld | (1791 | dysmbr | 5 | - | 1756 | ynAyr | 27) | mwtsArt | A#mAdyws | fwlfgAng | |
Wolfgang Amadeus Mozart (January 27, 1756 - December 5, 1791) was born on January 27, 1756 in Salzburg, Austria |
This Arabic example demonstrates variation in the spelling of non-native names. Both spellings of Mozart are correct and refer to the same entity across the QA pair.
Q: | Кто | изобрел | телефон | ? |
Kto | izobrel | telefon | ? | |
who | invented | telephone | ? | |
Who invented the telephone? |
A: | Сам | Рейс | назвал | сконструированное | им | устройство | Telephone | . |
Sam | Reis | nazyval | skonstruirovannoe | im | ustroistvo | Telephone | . | |
self | Reis | called | constructed | him | device | Telephone | . | |
Reis himself called the device he created the Telephone. |
This Russian example demonstrates how some entities of non-native origin may maintain the original Latin-script spelling, especially when the term has been directly borrowed into Russian and is phonologically similar to the original. Here the question about the inventor of the telephone contains the more common Cyrillic rendition of the term, 'телефон'. However, the answer passage has it in the original English spelling as 'Telephone'.
Q: | ؟ | العُماني | العلم | الوان | هي | ما |
? | AlEumAny | AlElm | AlwAn | hy | mA | |
What are the colors of the Omani flag? |
A: | ، | 1970 | ديسمبر | 17 | الموافق | هـ | 1391 | شوال | 18 | في | مرة | لأول | ورفع | سلطاني | بقرار | انشئ | عمان | لسلطنة | الوطني | العلم |
. | 1970 | dysmbr | 17 | AlmwAfq | h_ | 1391 | $wAl | 18 | fy | mrp | lA#wl | wrfE | slTAny | bqrAr | An$y# | EmAn | lslTnp | AlwTny | AlElm | |
The national flag of the Sultanate of Oman was established by a royal decree and was raised for the first time on Shawwal 18, 1391 AH corresponding to December 17, 1970. |
This Arabic example demonstrates variation in selective diacritization of a named entity — Oman. Note that the more formal Wikipedia text contains optionally available diacritization (◌ُ, the damma) to emphasize the pronunciation, while the question does not.
Q: | Что | такое | атом | ? |
Chto | takoe | atom | ? | |
What | such | atom | ? | |
What is an atom? |
A: | А́том | — | частица | вещества | микроскопических | размеров | ... |
Átom | — | chastitsa | veschestva | mikroskopicheskih | razmerov | ... | |
Atom | pred | particle | matter | microscopic | sizes | ... | |
An atom is a microscopic particle of matter... |
This Russian example demonstrates variation in stress marking on Russian vowels. Russian encyclopedia entries typically include stress marks on vowels of the entity name to emphasize correct pronunciation. Here, 'Атом' (atom) receives an acute diacritic on the first vowel to indicate the word initial stress: 'А́том'. The use of stress marks in questions, however, is very uncommon. This disparity in spelling between the question and the answer poses a challenge for establishing context matching: models may be mislead by the special character and discard the head entity 'А́том' in the answer as a match candidate for 'Атом' in the question – assuming normalized capitalization.
Q: | ؟ | محمد | بن | عبدالسلام | ولد | متى |
? | mHmd | bn | EbdAlslAm | wld | mtY | |
When was AbdulSalam bin Muhammad born? |
A: | . | ( | م | 1904 | - | 1830 | / | هـ | 1322 | - | هـ | 1246 | ) | العَلَمي | أحمد | بن | محمد | بن | السلام | عبد |
. | ( | m | 1904 | - | 1830 | / | h_ | 1322 | - | h_ | 1246 | ) | AlEalamy | A#Hmd | bn | mHmd | bn | AlslAm | Ebd | |
Abdul Salam bin Muhammad bin Ahmed Al-Alami (1246 AH - 1322 AH / 1830 - 1904 AD). |
This is an example of Arabic name de-spacing. The name appears as 'AbdulSalam' in the question and 'Abdul Salam' in the answer. This is potentially because of the visual break in the script between the two parts of the name. In manual orthography, the presence of the space would be nearly undetectable; its existence becomes an issue only in the digital realm.
Q: | ؟ | لبيتهوفن | سيمفونية | اول | هي | ما |
? | lbythwfn | symfwnyp | Awl | y | mA | |
What is Beethovens first symphony? |
A: | 21 | Op. | تصنيف | الكبير | دو | سلم | في | لبيتهوفن | الأولى | لسيمفونية |
21 | Op. | tSnyf | Alkbyr | dw | slm | fy | lbythwfn | AlA#wlY | lsymfwnyp | |
Beethoven's First Symphony in C Major Op. 21 |
This Arabic example demonstrates that grammatical gender marking can be inconsistent across the question-answer pair: the orinal numeral 'first' shows gender variation ('Awl' vs 'AlA#wlY') between the question and answer.
Q: | Как | далеко | Уран | от | Земли | ? |
Kak | daleko | Uran | ot | Zemli | ? | |
how | far | Uranus-sg.nom | from | Earth-sg.gen | ? | |
How far is Uranus from Earth? |
A: | Расстояние | между | Ураном | и | Землёй | меняется | от | 2,6 | до | 3,15 | млрд | км... |
Rasstonyanie | mezhdu | Uranom | i | Zemlei | menyaetsya | ot | 2,6 | do | 3,15 | mlrd | km... | |
distance | between | Uranus-sg.instr | and | Earth-sg.instr | varies | from | 2,6 | to | 3,15 | bln | km... | |
The distance between Uranus and Earth fluctuates from 2.6 to 3.15 bln km... |
Russian example of morphological variation across question-answer pairs due to the difference insyntactic context: the entities are identical but have different representation, making simple string matching more difficult. The names of the planets are in the subject (Уран-, Uranus-nom) and object of the preposition (от земли, from Earth-gen) context in the question. The relevant passage with the answer has the names of the planets in a coordinating phrase that is an object of a preposition (между Ураном и Зем-лёй, between Uranus-instr and Earth-instr). Because the syntactic contexts are different, the names of the planets have different case marking.
Q: | Missä | olut | on | keksitty? |
Where | beer.sg.nom | was | invented? | |
Where was beer invented? |
A: | Varhaisimiilta, | yli | 10000 | vuotta | vanhoilta | viljelyalueilta | Anatoliassa | on | löydetty | merkkejä | olu-en | valmistuksesta. |
Earliest, | over | 10000 | year | old | farmland | Anatolia | was | found | signs | beer-sg.gen | preparation. | |
Signs of beer have been found in the earliest, over 10000 year-old famrlands in Anatolia. |
This Finnish example illustrates how a single difference in the morphological feature of case can result in a string mismatch between two realization of the same entity. 'Beer' is singular in both the question and answer, but we see it in the nominative case in the question (olut-nom) and in genitive case in the answer (olu-en-gen).
Q: | ఖండాలలో | అతి-పెద్ద | ఖండం | ఏ-ది | ? |
continents | superl-large | continent | which-pred | ? | |
What isthe largest continent? |
A: | ఖండాలలో | వైశాల్యం | ఆధారంగా | అతి-పెద్ద-ది | : | ఆసియా | . |
continents | area | depending | superl-large-pred | : | Asia | . | |
In terms of area the largest continent is Asia. |
This Telugu example challenges our assumptions about the surface representation of copular verbs. Telugu does not always have explicit copula verbs and instead relies on predicate suffixation. In this example the predicate suffix is on the question word 'which' (ఏ-ది, which-pred), while the adjective 'large' has no overt suffixation (అతిపెద్ద). The answer context contains that adjective 'large', which is inflected with the predicate suffix (అతిపెద్ద-ది, large-pred). Because the adjective serves predicate function in the answer context, it has a different form than in the question.
Q: | Kuka | keksi | viiko-n-päivä-t | ? |
who | invented | week-gen-day-pl | ? | |
Who invented the days of the week? |
A: | Seitsen-päivä-inen | viikko | on | todennäköisesti | lähtöisin | Babylonia-sta |
seven-nom-day-pl.adj | week-nom | is | likely | origin | Babylonia-ela | |
The seven-day week is likely from Babylonia. |
This Finnish example demonstrates the interaction of complex morphology and compounding. Similar to English, 'weekdays' is a compound in Finnish. However, in the Finnish compound, 'week' is inflected in the genitive case, reflected by the attachment of the genitive suffix -n and the change of 'kk' to 'k' in the inflectional stem (a common morphophonological process in Finnish known as consonant gradation). The plural is marked on the head of the compound 'day' by the plural suffix -t. 'Week' is present in the answer context as a standalone word in the nominative case, which has no overt case marking. In the answer, 'week' is modified by a compound adjective composed of a numeric nominal indicating a set of seven and an adjectival, plural form of 'day'.
Q: | Как | называется | мясо | овцы | ? |
Kak | nazyvaetsya | myaso | ovtsy | ? | |
how | called | meat | sheep.NOUN.fem.sg.gen | ? | |
What is sheep's meat called? |
A: | Овечье | мясо | , | называемое | бараниной | , | является | одним | из | важнейших | продуктов... | ||||||
Ovechie | myaso | , | nazyvaemoe | baraninoi | , | yavlyaetsya | odnim | is | vazhneishih | produktov... | |||||||
sheep.ADJ.neut.sg.nom | meat | , | called | lamb | , | constitutes | one | from | important | products... | |||||||
Sheep meat, called lamb, is one of the most important products... |
This Russian example demonstrates how attributive constructions in Russian exhibit part of speech fluctuation of the attribute. Typically, noun attributes come in post-position to the head in genitive case (similar to a possessive construction), while adjective attributes precede the noun head. Here, the syntactic function of the word 'sheep' enforces that it is used as a noun in the question (овцы, ovtsy-NOUN-fem.sg.gen) and as an adjective in the answer (овечье, ovechie-ADJ-neut.sg.nom). The noun/adjective fluctuation involves non-trivial and irregular derivation.