数据集:
allocine
任务:
文本分类语言:
fr计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
no-annotation源数据集:
original许可:
mitThe Allociné dataset is a French-language dataset for sentiment analysis. The texts are movie reviews written between 2006 and 2020 by members of the Allociné.fr community for various films. It contains 100k positive and 100k negative reviews divided into train (160k), validation (20k), and test (20k).
The text is in French, as spoken by users of the Allociné.fr website. The BCP-47 code for French is fr.
Each data instance contains the following features: review and label . In the Hugging Face distribution of the dataset, the label has 2 possible values, 0 and 1 , which correspond to negative and positive respectively. See the Allociné corpus viewer to explore more examples.
An example from the Allociné train set looks like the following:
{'review': 'Premier film de la saga Kozure Okami, "Le Sabre de la vengeance" est un très bon film qui mêle drame et action, et qui, en 40 ans, n'a pas pris une ride.', 'label': 1}
The Allociné dataset has 3 splits: train , validation , and test . The splits contain disjoint sets of movies. The following table contains the number of reviews in each split and the percentage of positive and negative reviews.
Dataset Split | Number of Instances in Split | Percent Negative Reviews | Percent Positive Reviews |
---|---|---|---|
Train | 160,000 | 49.6% | 50.4% |
Validation | 20,000 | 51.0% | 49.0% |
Test | 20,000 | 52.0% | 48.0% |
The Allociné dataset was developed to support large-scale sentiment analysis in French. It was released alongside the tf-allociné model and used to compare the performance of several language models on this task.
The reviews and ratings were collected using a list of film page urls and the allocine_scraper.py tool. Up to 30 reviews were collected for each film.
The reviews were originally labeled with a rating from 0.5 to 5.0 with a step of 0.5 between each rating. Ratings less than or equal to 2 are labeled as negative and ratings greater than or equal to 4 are labeled as positive. Only reviews with less than 2000 characters are included in the dataset.
Who are the source language producers?The dataset contains movie reviews produced by the online community of the Allociné.fr website.
The dataset does not contain any additional annotations.
Annotation process[N/A]
Who are the annotators?[N/A]
Reviewer usernames or personal information were not collected with the reviews, but could potentially be recovered. The content of each review may include information and opinions about the film's actors, film crew, and plot.
Sentiment classification is a complex task which requires sophisticated language understanding skills. Successful models can support decision-making based on the outcome of the sentiment analysis, though such models currently require a high degree of domain specificity.
It should be noted that the community represented in the dataset may not represent any downstream application's potential users, and the observed behavior of a model trained on this dataset may vary based on the domain and use case.
The Allociné website lists a number of topics which violate their terms of service . Further analysis is needed to determine the extent to which moderators have successfully removed such content.
The limitations of the Allociné dataset have not yet been investigated, however Staliūnaitė and Bonfil (2017) detail linguistic phenomena that are generally present in sentiment analysis but difficult for models to accurately label, such as negation, adverbial modifiers, and reviewer pragmatics.
The Allociné dataset was collected by Théophile Blard.
The Allociné dataset is licensed under the MIT License .
Théophile Blard, French sentiment analysis with BERT, (2020), GitHub repository, https://github.com/TheophileBlard/french-sentiment-analysis-with-bert
Thanks to @thomwolf , @TheophileBlard , @lewtun and @mcmillanmajora for adding this dataset.