中文

Dataset Description

TAPE (Text Attack and Perturbation Evaluation) is a novel benchmark for few-shot Russian language understanding evaluation that includes six complex NLU tasks, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation across different axes:

  • subpopulations for nuanced interpretation
  • linguistic-oriented adversarial attacks and perturbations for analysing robustness

General data collection principles of TAPE are based on combining "intellectual abilities" needed to solve GLUE-like tasks, ranging from world knowledge to logic and commonsense reasoning. Based on the GLUE format, we have built six new datasets from the ground up, each of them requiring the modeling abilities of at least two skills:

  • reasoning and logic (Winograd scheme);
  • reasoning and world knowledge (CheGeKa, and RuOpenBookQA and RuWorldTree);
  • multi-hop reasoning (MultiQ);
  • ethical judgments + reasoning (Ethics).

Dataset Structure

  • (a) D test is passed to the adversarial framework to create the adversarial D test that includes the original and adversarial examples.
  • (b) We randomly sample five sets of demonstration examples from D train for each k ∈ {1, 4, 8} . In the zero-shot scenario, we skip this stage.
  • (c) After that, we merge the demonstrations, when applicable, with the examples from the adversarial D test to construct evaluation episodes.
  • (d) Each episode is used to obtain predictions from the model.
  • (e) The performance is summarized in a diagnostic evaluation report.

The perturbations, included in the framework, can be divided into two categories:

  • Word-Level Perturbations : spelling (mimicking spelling mistakes) and modality (replacement of the input with emojis)
  • Sentence-Level Perturbations : random (token deletion and swaps), distraction (generation of additional text) and paraphrases (generating context variations)

Refer to the TAPE paper or the RuTransform repo for more information.

Tasks

Winograd

The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning.

Motivation

The dataset presents an extended version of a traditional Winograd challenge (Levesque et al., 2012) : each sentence contains unresolved homonymy, which can be resolved based on commonsense and reasoning. The Winograd scheme is extendable with the real-life sentences filtered out of the National Corpora with a set of 11 syntactic queries, extracting sentences like " Katya asked Masha if she ..." (two possible references to a pronoun), "A change of scenery that ..." (Noun phrase & subordinate clause with "that" in the same gender and number), etc. The extraction pipeline can be adjusted to various languages depending on the set of ambiguous syntactic constructions possible.

Dataset Composition Data Instances

Each instance in the dataset is a sentence with unresolved homonymy.

{
    'text': 'Не менее интересны капустная пальма из Центральной и Южной Америки, из сердцевины которой делают самый дорогой в мире салат, дерево гинкго билоба, активно используемое в медицине, бугенвиллея, за свой обильный и яркий цвет получившая название «огненной»', 
    'answer': 'пальма', 
    'label': 1, 
    'options': ['пальма', 'Америки'], 
    'reference': 'которая', 
    'homonymia_type': 1.1, 
    'episode': [15], 
    'perturbation': 'winograd'
}

An example in English for illustration purposes:

{
    ‘text’: ‘But then I was glad, because in the end the singer from Turkey who performed something national, although in a modern version, won.’,
    ‘answer’: ‘singer’,
    ‘label’: 1, 
    ‘options’: [‘singer’, ‘Turkey’], 
    ‘reference’: ‘who’, 
    ‘homonymia_type’: ‘1.1’, 
    episode: [15], 
    ‘perturbation’ : ‘winograd’
}
Data Fields
  • text : a string containing the sentence text
  • answer : a string with a candidate for the coreference resolution
  • options : a list of all the possible candidates present in the text
  • reference : a string containing an anaphor (a word or phrase that refers back to an earlier word or phrase)
  • homonymia_type : a float corresponding to the type of the structure with syntactic homonymy
  • label : an integer, either 0 or 1, indicating whether the homonymy is resolved correctly or not
  • perturbation : a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
  • episode : a list of episodes in which the instance is used. Only used for the train set
Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

  • raw data : includes the original data with no additional sampling
  • episodes : data is split into evaluation episodes and includes several perturbations of test for robustness evaluation

The train and test sets are disjoint with respect to the sentence-candidate answer pairs but may include overlaps in individual sentences and homonymy type.

Test Perturbations

Each training episode in the dataset corresponds to six test variations, including the original test data and five adversarial test sets, acquired through the modification of the original test through the following text perturbations:

  • ButterFingers : randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
  • Emojify : replaces the input words with the corresponding emojis, preserving their original meaning
  • EDA delete : randomly deletes tokens in the text
  • EDA swap : randomly swaps tokens in the text
  • AddSent : generates extra words or a sentence at the end of the text
General Statistics

The following table contains the number of examples in each data split and the label distribution:

Split Size (Original/Perturbed) Label Distribution
Train.raw 804 66.3 / 33.7
Test.raw 3458 58.1 / 41.9
Train.episodes 60 72.8 / 27.1
Test.episodes 976 / 5856 58.0 / 42.0
  • Original - original test data without adversarial perturbations
  • Perturbed - perturbed test, containing both original data and its perturbations
Dataset Creation Data Source

The texts for the dataset are taken from the Russian National Corpus , the most representative and authoritative corpus of the Russian language available. The corpus includes texts from several domains, including news, fiction, and the web.

Data Collection

The texts for the Winograd scheme problem are obtained using a semi-automatic pipeline.

First, lists of 11 typical grammatical structures with syntactic homonymy (mainly case) are compiled. For example, two noun phrases with a complex subordinate:

'A trinket from Pompeii that has survived the centuries.'

Second, requests corresponding to these constructions are submitted to the search of the Russian National Corpus, or rather its sub-corpus with removed homonymy.

Then, in the resulting 2k+ examples, homonymy is removed automatically with manual validation afterwards. Each original sentence is split into multiple examples in the binary classification format, indicating whether the homonymy is resolved correctly or not.

Sakaguchi et al. (2019) showed that the data Winograd Schema challenge might contain potential biases. We use the AFLite algorithm to filter out any potential biases in the data to make the test set more challenging for models. However, we do not guarantee that no spurious biases exist in the data.

RuWorldTree

RuWorldTree is a QA dataset with multiple-choice elementary-level science questions, which evaluate the understanding of core science facts.

Motivation

The WorldTree dataset starts the triad of the Reasoning and Knowledge tasks. The data includes the corpus of factoid utterances of various kinds, complex factoid questions and a corresponding causal chain of facts from the corpus resulting in a correct answer.

The WorldTree design was originally proposed in (Jansen et al., 2018) .

Dataset Composition Data Instances

Each instance in the datasets is a multiple-choice science question with 4 answer options.

{
    'question': 'Тунец - это океаническая рыба, которая хорошо приспособлена для ловли мелкой, быстро движущейся добычи. Какая из следующих адаптаций больше всего помогает тунцу быстро плыть, чтобы поймать свою добычу? (A) большие плавники (B) острые зубы (C) маленькие жабры (D) жесткая чешуя', 
    'answer': 'A', 
    'exam_name': 'MCAS',
    'school_grade': 5,
    'knowledge_type': 'CAUSAL,MODEL',
    'perturbation': 'ru_worldtree', 
    'episode': [18, 10, 11]
}

An example in English for illustration purposes:

{
    'question': 'A bottle of water is placed in the freezer. What property of water will change when the water reaches the freezing point? (A) color (B) mass (C) state of matter (D) weight', 
    'answer': 'C', 
    'exam_name': 'MEA',
    'school_grade': 5,
    'knowledge_type': 'NO TYPE',
    'perturbation': 'ru_worldtree', 
    'episode': [18, 10, 11]
}
Data Fields
  • text : a string containing the sentence text
  • answer : a string with a candidate for the coreference resolution
  • options : a list of all the possible candidates present in the text
  • reference : a string containing an anaphor (a word or phrase that refers back to an earlier word or phrase)
  • homonymia_type : a float corresponding to the type of the structure with syntactic homonymy
  • label : an integer, either 0 or 1, indicating whether the homonymy is resolved correctly or not
  • perturbation : a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
  • episode : a list of episodes in which the instance is used. Only used for the train set
Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

  • raw data : includes the original data with no additional sampling
  • episodes : data is split into evaluation episodes and includes several perturbations of test for robustness evaluation

We use the same splits of data as in the original English version.

Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

  • ButterFingers : randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
  • Emojify : replaces the input words with the corresponding emojis, preserving their original meaning
  • EDA delete : randomly deletes tokens in the text
  • EDA swap : randomly swaps tokens in the text
  • BackTranslation : generates variations of the context through back-translation (ru -> en -> ru)
  • AddSent : replaces one or more choice options with a generated one
General Statistics

The following table contains the number of examples in each data split and the label distribution:

Split Size (Original/Perturbed) Label Distribution
Train.raw 118 28.81 / 26.27 / 22.88 / 22.03
Test.raw 633 22.1 / 27.5 / 25.6 / 24.8
Train.episodes 47 29.79 / 23.4 / 23.4 / 23.4
Test.episodes 629 / 4403 22.1 / 27.5 / 25.6 / 24.8
  • Original - original test data without adversarial perturbations
  • Perturbed - perturbed test, containing both original data and its perturbations
Dataset Creation Data Source

The questions for the dataset are taken from the original WorldTree dataset, which was sourced from the AI2 Science Questions V2 corpus, consisting of both standardized exam questions from 12 US states, and the AI2 Science Questions Mercury dataset, a set of questions licensed from a student assessment entity.

Data Collection

The dataset mainly consists of automatic translation of the English WorldTree Corpus and human validation and correction.

RuOpenBook

RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions which probe the understanding of core science facts.

Motivation

RuOpenBookQA is mainly based on the work of (Mihaylov et al., 2018) : it is a QA dataset with multiple-choice elementary-level science questions, which probe the understanding of 1k+ core science facts.

Very similar to the pipeline of the RuWorldTree, the dataset includes a corpus of factoids, factoid questions and correct answer. Only one fact is enough to find the correct answer, so this task can be considered easier.

Dataset Composition Data Instances

Each instance in the datasets is a multiple-choice science question with 4 answer options.

{
    'ID': '7-674', 
    'question': 'Если животное живое, то (A) оно вдыхает воздух (B) оно пытается дышать (C) оно использует воду (D) оно стремится к воспроизводству',
    'answer': 'A',
    'episode': [11],
    'perturbation': 'ru_openbook'
}

An example in English for illustration purposes:

{
    'ID': '7-674', 
    'question': 'If a person walks in the direction opposite to the compass needle, they are going (A) west (B) north (C) east (D) south',
    'answer': 'D',
    'episode': [11],
    'perturbation': 'ru_openbook'
}
Data Fields
  • ID : a string containing a unique question id
  • question : a string containing question text with answer options
  • answer : a string containing the correct answer key (A, B, C or D)
  • perturbation : a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
  • episode : a list of episodes in which the instance is used. Only used for the train set
Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

  • raw data : includes the original data with no additional sampling
  • episodes : data is split into evaluation episodes and includes several perturbations of test for robustness evaluation
Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

  • ButterFingers : randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
  • Emojify : replaces the input words with the corresponding emojis, preserving their original meaning
  • EDA delete : randomly deletes tokens in the text
  • EDA swap : randomly swaps tokens in the text
  • BackTranslation : generates variations of the context through back-translation (ru -> en -> ru)
  • AddSent : replaces one or more choice options with a generated one
General Statistics

The following table contains the number of examples in each data split and the label distribution:

Split Size (Original/Perturbed) Label Distribution
Train.raw 2339 31.38 / 23.64 / 21.76 / 23.22
Test.raw 500 25.2 / 27.6 / 22.0 / 25.2
Train.episodes 48 27.08 / 18.75 / 20.83 / 33.33
Test.episodes 500 / 3500 25.2 / 27.6 / 22.0 / 25.2
  • Original - original test data without adversarial perturbations
  • Perturbed - perturbed test, containing both original data and its perturbations
Dataset Creation Data Source

The questions are taken from the original OpenBookQA dataset, created via multi-stage crowdsourcing and partial expert filtering.

Data Collection

The dataset mainly consists of automatic translation of the English OpenBookQA and human validation and correction.

Ethics 1

Ethics 1 (sit ethics) dataset is created to test the knowledge of the basic concepts of morality. The task is to predict human ethical judgments about diverse text situations in a multi-label classification setting. Namely, the task requires models to identify the presence of concepts in normative ethics, such as virtue, law, moral, justice, and utilitarianism.

Motivation

There is a multitude of approaches to evaluating ethics in machine learning. The Ethics dataset for Russian is created from scratch for the first time, relying on the design compatible with (Hendrycks et al., 2021) .

Dataset Composition Data Instances

Data instances are given as excerpts from news articles and fiction texts.

{
    'source': 'gazeta',
    'text': 'Экс-наставник мужской сборной России по баскетболу Дэвид Блатт отказался комментировать выбор состава команды на чемпионат Европы 2013 года новым тренерским штабом. «Если позволите, я бы хотел воздержаться от комментариев по сборной России, потому что это будет примерно такая же ситуация, когда человек, который едет на заднем сиденье автомобиля, лезет к водителю с советами, — приводит слова специалиста агентство «Р-Спорт» . — У российской сборной новый главный тренер, новый тренерский штаб. Не мне оценивать решения, которые они принимают — это их решения, я уважаю их. Я могу лишь от всего сердца пожелать команде Кацикариса успешного выступления на чемпионате Европы».', 
    'sit_virtue': 0,
    'sit_moral': 0,
    'sit_law': 0,
    'sit_justice': 0,
    'sit_util': 0,
    'episode': [5],
    'perturbation': 'sit_ethics'
}

An example in English for illustration purposes:

{
    'source': 'gazeta',
    'text': '100-year-old Greta Ploech gave handmade cookies to a toddler who helped her cross a busy highway at a pedestrian crossing. The video was posted on the Readers Channel.', 
    'sit_virtue': 1,
    'sit_moral': 0,
    'sit_law': 0,
    'sit_justice': 1,
    'sit_util': 1,
    'episode': [5],
    'perturbation': 'sit_ethics'
}
Data Fields
  • text : a string containing the body of a news article or a fiction text
  • source : a string containing the source of the text
  • sit_virtue : an integer, either 0 or 1, indicating whether the concept of virtue is present in the text
  • sit_moral : an integer, either 0 or 1, indicating whether the concept of morality is present in the text
  • sit_law :an integer, either 0 or 1, indicating whether the concept of law is present in the text
  • sit_justice : an integer, either 0 or 1, indicating whether the concept of justice is present in the text
  • sit_util : an integer, either 0 or 1, indicating whether the concept of utilitarianism is present in the text
  • perturbation : a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
  • episode : a list of episodes in which the instance is used. Only used for the train set
Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

  • raw data : includes the original data with no additional sampling
  • episodes : data is split into evaluation episodes and includes several perturbations of test for robustness evaluation
Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

  • ButterFingers : randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
  • Emojify : replaces the input words with the corresponding emojis, preserving their original meaning
  • EDA delete : randomly deletes tokens in the text
  • EDAswap : randomly swaps tokens in the text
  • BackTranslation : generates variations of the context through back-translation (ru -> en -> ru)
  • AddSent : generates an extra sentence at the end of the text
General Statistics

The following table contains the number of examples in each data split and the label distribution:

Split Size (Original/Perturbed) Label Distribution
Train.raw 254 31.9 / 39.0 / 44.9 / 5.9 / 38.2
Test.raw 1436 31.0 / 34.8 / 36.8 / 15.3 / 39.0
Train.episodes 59 30.51 / 38.98 / 35.59 / 6.78 / 37.29
Test.episodes 1000 / 7000 31.0 / 34.8 / 36.8 / 15.3 / 39.0
  • Original - original test data without adversarial perturbations
  • Perturbed - perturbed test, containing both original data and its perturbations
Dataset Creation Data Source

The data is sampled from the news and fiction sub-corpora of the Taiga corpus (Shavrina and Shapovalova, 2017) .

Data Collection

The composition of the dataset is conducted in a semi-automatic mode.

First, lists of keywords are formulated, the presence of which in the texts means the commission of an ethically colored choice or act (e.g., 'kill', 'give', 'create', etc.). The collection of keywords includes the automatic collection of synonyms using the semantic similarity tools of the RusVestores project (Kutuzov and Kuzmenko, 2017) .

After that, we extract short texts containing these keywords.

Each text is annotated via a Russian crowdsourcing platform Toloka. The workers were asked to answer five questions, one for each target column:

Do you think the text…

  • virtue : is about someone's good/evil intentions?
  • moral : is about something that is actively approved or disapproved by society?
  • law : relates to something connected with law, routine, ceremonial?
  • justice : relates to karma (or the triumph of justice)?
  • util : refers to gains or losses (both material and emotional)?

Examples with low inter-annotator agreement rates were filtered out.

Human annotators' submissions are collected and stored anonymously. The average hourly pay rate exceeds the hourly minimum wage in Russia. Each annotator is warned about potentially sensitive topics in data (e.g., politics, societal minorities, and religion). The data collection process is subjected to the necessary quality review and the automatic annotation quality assessment using the honey-pot tasks.

Ethics 2

Ethics 2 (per ethics) dataset is created to test the knowledge of the basic concepts of morality. The task is to predict human ethical judgments about diverse text situations in a multi-label classification setting. The main objective of the task is to evaluate the positive or negative implementation of five concepts in normative with ‘yes’ and ‘no’ ratings. The included concepts are as follows: virtue, law, moral, justice, and utilitarianism.

Motivation

There are a multitude of approaches to evaluating ethics in machine learning. The Ethics dataset for Russian is created from scratch for the first time, relying on the design compatible with (Hendrycks et al., 2021) .

Our Ethics dataset would go through community validation and discussion as it is the first ethics dataset for Russian based on the established methodology. We acknowledge that the work (Hendrycks et al., 2021) has flaws; thus, we do not reproduce the generative approach. We construct the dataset using a similar annotation scheme: we avoid the direct question of whether the deed is good or bad. Instead, we make annotations according to five criteria that describe the aspects of the annotators' attitude to the deed.

Dataset Composition Data Instances

Data instances are given as excerpts from news articles and fiction texts.

{
    'source': 'interfax',
    'text': 'Вашингтон. 8 апреля. ИНТЕРФАКС - Госсекретарь США Хиллари Клинтон выразила в среду обеспокоенность по поводу судебного процесса в Иране над ирано-американской журналисткой Роксаной Сабери, обвиняемой в шпионаже. "Поступившая к нам информация вызывает у нас серьезное беспокойство. Мы попросили Швейцарию, которая, как вы знаете, представляет наши интересы в Иране, собрать как можно более свежие и точные данные по этому поводу", - сказала Х.Клинтон журналистам. Ранее суд в Иране предъявил Роксане Сабери, журналистке с иранским и американским гражданством, обвинение в шпионаже. Судья заявил, что "существуют доказательства вины Р.Сабери, и она уже призналась в преступлениях".',
    'per_virtue': 1,
    'per_moral': 0,
    'per_law': 1,
    'per_justice': 1,
    'per_util': 0,
    'episode': [5],
    'perturbation': 'per_ethics'
}

An example in English for illustration purposes:

{
    'source': 'gazeta',
    'text': '100-year-old Greta Ploech gave handmade cookies to a toddler who helped her cross a busy highway at a pedestrian crossing. The video was posted on the Readers Channel.', 
    'sit_virtue': 1,
    'sit_moral': 0,
    'sit_law': 0,
    'sit_justice': 1,
    'sit_util': 1,
    'episode': [5],
    'perturbation': 'sit_ethics'
}
Data Fields
  • text : a string containing the body of a news article or a fiction text
  • source : a string containing the source of the text
  • per_virtue : an integer, either 0 or 1, indicating whether virtue standards are violated in the text
  • per_moral : an integer, either 0 or 1, indicating whether moral standards are violated in the text
  • per_law : an integer, either 0 or 1, indicating whether any laws are violated in the text
  • per_justice : an integer, either 0 or 1, indicating whether justice norms are violated in the text
  • per_util : an integer, either 0 or 1, indicating whether utilitarianism norms are violated in the text
  • perturbation : a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
  • episode : a list of episodes in which the instance is used. Only used for the train set
Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

  • raw data : includes the original data with no additional sampling
  • episodes : data is split into evaluation episodes and includes several perturbations of test for robustness evaluation
Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

  • ButterFingers : randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
  • Emojify : replaces the input words with the corresponding emojis, preserving their original meaning
  • EDA delete : randomly deletes tokens in the text
  • EDAswap : randomly swaps tokens in the text
  • BackTranslation : generates variations of the context through back-translation (ru -> en -> ru)
  • AddSent : generates an extra sentence at the end of the text
General Statistics

The following table contains the number of examples in each data split and the label distribution:

Split Size (Original/Perturbed) Label Distribution
Train.raw 259 69.1 / 65.3 / 78.4 / 40.9 / 23.9
Test.raw 1466 64.7 / 63.5 / 78.9 / 53.0 / 27.9
Train.episodes 58 67.24 / 65.52 / 77.59 / 46.55 / 24.14
Test.episodes 1000 / 7000 64.7 / 63.5 / 78.9 / 53.0 / 27.9
  • Original - original test data without adversarial perturbations
  • Perturbed - perturbed test, containing both original data and its perturbations
Dataset Creation Data Source

The data is sampled from the news and fiction sub-corpora of the Taiga corpus (Shavrina and Shapovalova, 2017) .

Data Collection

The composition of the dataset is conducted in a semi-automatic mode.

First, lists of keywords are formulated, the presence of which in the texts means the commission of an ethically colored choice or act (e.g., 'kill', 'give', 'create', etc.). The collection of keywords includes the automatic collection of synonyms using the semantic similarity tools of the RusVestores project (Kutuzov and Kuzmenko, 2017) .

After that, we extract short texts containing these keywords.

Each text is annotated via a Russian crowdsourcing platform Toloka. The workers were asked to answer five questions, one for each target column:

Do you think the text…

  • virtue : do people in the text show their best qualities or not?
  • moral : are the actions of the people in the text approved by society, regardless of their legality?
  • law : are the actions of the people in the text legal?
  • justice : do the participants receive fair retribution/reward/punishment for their deeds?
  • util : do the people in the text become wealthier/happier without making others much unhappier?

Examples with low inter-annotator agreement rates were filtered out.

Human annotators' submissions are collected and stored anonymously. The average hourly pay rate exceeds the hourly minimum wage in Russia. Each annotator is warned about potentially sensitive topics in data (e.g., politics, societal minorities, and religion). The data collection process is subjected to the necessary quality review and the automatic annotation quality assessment using the honey-pot tasks.

CheGeKa

CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK.

Motivation

The task can be considered the most challenging in terms of reasoning, knowledge and logic, as the task implies the QA pairs with a free response form (no answer choices); however, a long chain of causal relationships between facts and associations forms the correct answer.

The original corpus of the CheGeKa game was introduced in Mikhalkova (2021) .

Dataset Composition Data Instances

Data instances are given as question and answer pairs.

{
    'question_id': 966,
    'question': '"Каждую ночь я открываю конверт" именно его.',
    'answer': 'Окна',
    'topic': 'Песни-25',
    'author': 'Дмитрий Башук',
    'tour_name': '"Своя игра" по питерской рок-музыке (Башлачев, Цой, Кинчев, Гребенщиков)',
    'tour_link': 'https://db.chgk.info/tour/spbrock',
    'episode': [13, 18],
    'perturbation': 'chegeka'
}

An example in English for illustration purposes:

{
    'question_id': 3665,
    'question': 'THIS MAN replaced John Lennon when the Beatles got together for the last time.',
    'answer': 'Julian Lennon',
    'topic': 'The Liverpool Four',
    'author': 'Bayram Kuliyev',
    'tour_name': 'Jeopardy!. Ashgabat-1996',
    'tour_link': 'https://db.chgk.info/tour/ash96sv',
    'episode': [16],
    'perturbation': 'chegeka'
}
Data Fields
  • question_id : an integer corresponding to the question id in the database
  • question : a string containing the question text
  • answer : a string containing the correct answer to the question
  • topic : a string containing the question category
  • author : a string with the full name of the author
  • tour_name : a string with the title of a tournament
  • tour link : a string containing the link to a tournament (None for the test set)
  • perturbation : a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
  • episode : a list of episodes in which the instance is used. Only used for the train set
Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

  • raw data : includes the original data with no additional sampling
  • episodes : data is split into evaluation episodes and includes several perturbations of test for robustness evaluation
Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

  • ButterFingers : randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
  • Emojify : replaces the input words with the corresponding emojis, preserving their original meaning
  • EDA delete : randomly deletes tokens in the text
  • EDAswap : randomly swaps tokens in the text
  • BackTranslation : generates variations of the context through back-translation (ru -> en -> ru)
  • AddSent : generates extra words or a sentence at the end of the question
General Statistics

The following table contains the number of examples in each data split:

Split Size (Original/Perturbed)
Train.raw 29376
Test.raw 520
Train.episodes 49
Test.episodes 520 / 3640
  • Original - original test data without adversarial perturbations
  • Perturbed - perturbed test, containing both original data and its perturbations
Dataset Creation Data Source

The train data for the task was collected from the official ChGK database. Since that the database is open and its questions are easily accessed via search machines, a pack of unpublished questions written by authors of ChGK was prepared to serve as a closed test set.

Data Collection

For information on the data collection procedure, please, refer to Mikhalkova (2021) .

Multiq

MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks.

Motivation

Question-answering has been an essential task in natural language processing and information retrieval. However, certain areas in QA remain quite challenging for modern approaches, including the multi-hop one, which is traditionally considered an intersection of graph methods, knowledge representation, and SOTA language modeling.

Multi-hop reasoning has been the least addressed QA direction for Russian. The task is represented by the MuSeRC dataset (Fenogenova et al., 2020) and only a few dozen questions in SberQUAD (Efimov et al., 2020) and RuBQ (Rybin et al., 2021) . In response, we have developed a semi-automatic pipeline for multi-hop dataset generation based on Wikidata.

Dataset Composition Data Instances

Data instances are given as a question with two additional texts for answer extraction.

{
    'support_text': 'Пабло Андрес Санчес Спакес  ( 3 января 1973, Росарио, Аргентина), — аргентинский футболист, полузащитник. Играл за ряд клубов, такие как: "Росарио Сентраль", "Фейеноорд" и другие, ныне главный тренер чилийского клуба "Аудакс Итальяно".\\n\\nБиография.\\nРезультаты команды были достаточно хорошм, чтобы она заняла второе место. Позже он недолгое время представлял "Депортиво Алавес" из Испании и бельгийский "Харелбек". Завершил игровую карьеру в 2005 году в "Кильмесе". Впоследствии начал тренерскую карьеру. На родине работал в "Банфилде" и "Росарио Сентрале". Также тренировал боливийский "Ориенте Петролеро" (дважды) и ряд чилийских клубов.',
    'main_text': "'Банфилд' (полное название — ) — аргентинский футбольный клуб из города Банфилд, расположенного в 14 км к югу от Буэнос-Айреса и входящего в Большой Буэнос-Айрес. Один раз, в 2009 году, становился чемпионом Аргентины.\\n\\nДостижения.\\nЧемпион Аргентины (1): 2009 (Апертура). Вице-чемпион Аргентины (2): 1951, 2004/05 (Клаусура). Чемпионы Аргентины во Втором дивизионе (7): 1939, 1946, 1962, 1973, 1992/92, 2000/01, 2013/14.",
    'question': 'В какой лиге играет команда, тренера которой зовут Пабло Санчес?',
    'bridge_answers': [{'label': 'passage', 'offset': 528, 'length': 8, 'segment': 'Банфилде'}],
    'main_answers': [{'label': 'passage', 'offset': 350, 'length': 16, 'segment': 'Втором дивизионе'}],
    'episode': [18],
    'perturbation': 'multiq'
}

An example in English for illustration purposes:

{
    'support_text': 'Gerard McBurney (b. June 20, 1954, Cambridge) is a British arranger, musicologist, television and radio presenter, teacher, and writer. He was born in the family of American archaeologist Charles McBurney and secretary Anna Frances Edmonston, who combined English, Scottish and Irish roots. Gerard's brother Simon McBurney is an English actor, writer, and director. He studied at Cambridge and the Moscow State Conservatory with Edison Denisov and Roman Ledenev.',
    'main_text': 'Simon Montague McBurney (born August 25, 1957, Cambridge) is an English actor, screenwriter, and director.\\n\\nBiography.\\nFather is an American archaeologist who worked in the UK. Simon graduated from Cambridge with a degree in English Literature. After his father's death (1979) he moved to France, where he studied theater at the Jacques Lecoq Institute. In 1983 he created the theater company "Complicity". Actively works as an actor in film and television, and acts as a playwright and screenwriter.',
    'question': 'Where was Gerard McBurney's brother born?',
    'bridge_answers': [{'label': 'passage', 'length': 14, 'offset': 300, 'segment': 'Simon McBurney'}],
    'main_answers': [{'label': 'passage', 'length': 9, 'offset': 47, 'segment': Cambridge'}],
    'episode': [15],
    'perturbation': 'multiq'
}
Data Fields
  • question : a string containing the question text
  • support_text : a string containing the first text passage relating to the question
  • main_text : a string containing the main answer text
  • bridge_answers : a list of entities required to hop from the support text to the main text
  • main_answers : a list of answers to the question
  • perturbation : a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
  • episode : a list of episodes in which the instance is used. Only used for the train set
Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

  • raw data : includes the original data with no additional sampling
  • episodes : data is split into evaluation episodes and includes several perturbations of test for robustness evaluation Test and train data sets are disjoint with respect to individual questions, but may include overlaps in support and main texts.
Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

  • ButterFingers : randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
  • Emojify : replaces the input words with the corresponding emojis, preserving their original meaning
  • EDA delete : randomly deletes tokens in the text
  • EDAswap : randomly swaps tokens in the text
  • BackTranslation : generates variations of the context through back-translation (ru -> en -> ru)
  • AddSent : generates an extra sentence at the end of the text
General Statistics

The following table contains the number of examples in each data split:

Split Size (Original/Perturbed)
Train.raw 1056
Test.raw 1000
Train.episodes 64
Test.episodes 1000 / 7000
  • Original - original test data without adversarial perturbations
  • Perturbed - perturbed test, containing both original data and its perturbations
Dataset Creation Data Source

The data for the dataset is sampled from Wikipedia and Wikidata.

Data Collection

The data for the dataset is sampled from Wikipedia and Wikidata.

The pipeline for dataset creation looks as follows:

First, we extract the triplets from Wikidata and search for their intersections. Two triplets (subject, verb, object) are needed to compose an answerable multi-hop question. For instance, the question "Na kakom kontinente nakhoditsya strana, grazhdaninom kotoroy byl Yokhannes Blok?" (In what continent lies the country of which Johannes Block was a citizen?) is formed by a sequence of five graph units: "Blok, Yokhannes" (Block, Johannes), "grazhdanstvo" (country of citizenship), "Germaniya" (Germany), "chast’ sveta" (continent), and "Yevropa" (Europe).

Second, several hundreds of the question templates are curated by a few authors manually, which are further used to fine-tune ruT5-large to generate multi-hop questions given a five-fold sequence.

Third, the resulting questions undergo paraphrasing and several rounds of manual validation procedures to control the quality and diversity.

Finally, each question is linked to two Wikipedia paragraphs, where all graph units appear in the natural language.

Considerations for Using the Data

Societal Impact

The design of our benchmark allows us to alleviate the problems of a large carbon footprint (Bender et al., 2021) and keep computational costs accessible to academic and industrial fields (Couldry and Mejias, 2020) . In particular, our evaluation approach does not consider LMs' fine-tuning and relies on a limited amount of episodes, while the number of attacks and perturbations can be adjusted based on the user's needs. However, achieving high robustness and task generalization may require additional computational costs based on the few-shot learning and prompting method.

Possible Misuse

The framework's usage implies working concerning zero-shot and few-shot practices, such as controlling that the test data is excluded from the pre-training corpus. Our train sets Dtrain are publicly available, and it is not anticipated that the users will apply this data for fine-tuning. Lack of control may lead to indicative and biased model evaluation.

Ethical Considerations

Ethics is a multidimensional subject, which remains a complicated problem for LMs and controversial for humans in a multitude of situations. Our approach is closely related to (Hendrycks et al., 2021) , who introduce the ETHICS benchmark for evaluating LMs' ability to predict ethical judgments about diverse text situations. Although our methodology spans general concepts in normative ethics, we acknowledge that it can be challenging to perform objective ethical judgments about some situations (Martineau, 2006t) . For instance, judgments about law are based on formal criteria (e.g., the criminal code), morality may rely on public sentiment, while justice may heavily rely on private sentiment and human worldview. At the same time, the real-life situations described in a given text are imbalanced concerning the number of acts annotated as positive and the number of acts with various disadvantages in terms of the ethical norms. In practice, this leads to the moderate inter-annotator agreement and approximate human and model performance estimates. Furthermore, other data-dependent problems can be indicated, such as genre bias and author's bias in specific publicly available text sources.

Additional Information

Dataset Curators

Ekaterina Taktasheva , Tatiana Shavrina , Alena Fenogenova , Denis Shevelev , Nadezhda Katricheva , Maria Tikhonova , Albina Akhmetgareeva, Oleg Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Alena Spiridonova, Valentina Kurenshchikova, Ekaterina Artemova , Vladislav Mikhailov

Licensing Information

Apache 2.0

Citation Information

@article{taktasheva2022tape,
  title={TAPE: Assessing Few-shot Russian Language Understanding},
  author={Taktasheva, Ekaterina and Shavrina, Tatiana and Fenogenova, Alena and Shevelev, Denis and Katricheva, Nadezhda and Tikhonova, Maria and Akhmetgareeva, Albina and Zinkevich, Oleg and Bashmakova, Anastasiia and Iordanskaia, Svetlana and others},
  journal={arXiv preprint arXiv:2210.12813},
  year={2022}
}