数据集:

allocine

语言:

fr

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

许可:

mit
中文

Dataset Card for Allociné

Dataset Summary

The Allociné dataset is a French-language dataset for sentiment analysis. The texts are movie reviews written between 2006 and 2020 by members of the Allociné.fr community for various films. It contains 100k positive and 100k negative reviews divided into train (160k), validation (20k), and test (20k).

Supported Tasks and Leaderboards

  • text-classification , sentiment-classification : The dataset can be used to train a model for sentiment classification. The model performance is evaluated based on the accuracy of the predicted labels as compared to the given labels in the dataset. A BERT-based model, tf-allociné , achieves 97.44% accuracy on the test set.

Languages

The text is in French, as spoken by users of the Allociné.fr website. The BCP-47 code for French is fr.

Dataset Structure

Data Instances

Each data instance contains the following features: review and label . In the Hugging Face distribution of the dataset, the label has 2 possible values, 0 and 1 , which correspond to negative and positive respectively. See the Allociné corpus viewer to explore more examples.

An example from the Allociné train set looks like the following:

{'review': 'Premier film de la saga Kozure Okami, "Le Sabre de la vengeance" est un très bon film qui mêle drame et action, et qui, en 40 ans, n'a pas pris une ride.',
 'label': 1}

Data Fields

  • 'review': a string containing the review text
  • 'label': an integer, either 0 or 1 , indicating a negative or positive review, respectively

Data Splits

The Allociné dataset has 3 splits: train , validation , and test . The splits contain disjoint sets of movies. The following table contains the number of reviews in each split and the percentage of positive and negative reviews.

Dataset Split Number of Instances in Split Percent Negative Reviews Percent Positive Reviews
Train 160,000 49.6% 50.4%
Validation 20,000 51.0% 49.0%
Test 20,000 52.0% 48.0%

Dataset Creation

Curation Rationale

The Allociné dataset was developed to support large-scale sentiment analysis in French. It was released alongside the tf-allociné model and used to compare the performance of several language models on this task.

Source Data

Initial Data Collection and Normalization

The reviews and ratings were collected using a list of film page urls and the allocine_scraper.py tool. Up to 30 reviews were collected for each film.

The reviews were originally labeled with a rating from 0.5 to 5.0 with a step of 0.5 between each rating. Ratings less than or equal to 2 are labeled as negative and ratings greater than or equal to 4 are labeled as positive. Only reviews with less than 2000 characters are included in the dataset.

Who are the source language producers?

The dataset contains movie reviews produced by the online community of the Allociné.fr website.

Annotations

The dataset does not contain any additional annotations.

Annotation process

[N/A]

Who are the annotators?

[N/A]

Personal and Sensitive Information

Reviewer usernames or personal information were not collected with the reviews, but could potentially be recovered. The content of each review may include information and opinions about the film's actors, film crew, and plot.

Considerations for Using the Data

Social Impact of Dataset

Sentiment classification is a complex task which requires sophisticated language understanding skills. Successful models can support decision-making based on the outcome of the sentiment analysis, though such models currently require a high degree of domain specificity.

It should be noted that the community represented in the dataset may not represent any downstream application's potential users, and the observed behavior of a model trained on this dataset may vary based on the domain and use case.

Discussion of Biases

The Allociné website lists a number of topics which violate their terms of service . Further analysis is needed to determine the extent to which moderators have successfully removed such content.

Other Known Limitations

The limitations of the Allociné dataset have not yet been investigated, however Staliūnaitė and Bonfil (2017) detail linguistic phenomena that are generally present in sentiment analysis but difficult for models to accurately label, such as negation, adverbial modifiers, and reviewer pragmatics.

Additional Information

Dataset Curators

The Allociné dataset was collected by Théophile Blard.

Licensing Information

The Allociné dataset is licensed under the MIT License .

Citation Information

Théophile Blard, French sentiment analysis with BERT, (2020), GitHub repository, https://github.com/TheophileBlard/french-sentiment-analysis-with-bert

Contributions

Thanks to @thomwolf , @TheophileBlard , @lewtun and @mcmillanmajora for adding this dataset.