数据集:

Bingsu/KcBERT_Pre-Training_Corpus

中文

KcBERT Pre-Training Corpus (Korean News Comments)

KcBERT

beomi/kcbert-base

Github KcBERT Repo: https://github.com/Beomi/KcBERT KcBERT is Korean Comments BERT pretrained on this Corpus set. (You can use it via Huggingface's Transformers library!)

This Kaggle Dataset contains CLEANED dataset preprocessed with the code below.

import re
import emoji
from soynlp.normalizer import repeat_normalize

emojis = ''.join(emoji.UNICODE_EMOJI.keys())
pattern = re.compile(f'[^ .,?!/@$%~%·∼()\x00-\x7Fㄱ-힣{emojis}]+')
url_pattern = re.compile(
    r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')

def clean(x):
    x = pattern.sub(' ', x)
    x = url_pattern.sub('', x)
    x = x.strip()
    x = repeat_normalize(x, num_repeats=2)
    return x

License

CC BY-SA 4.0

Dataset Structure

Data Instance

>>> from datasets import load_dataset

>>> dataset = load_dataset("Bingsu/KcBERT_Pre-Training_Corpus")
>>> dataset
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 86246285
    })
})

Data Size

download: 7.90 GiB generated: 11.86 GiB total: 19.76 GiB

※ You can download this dataset from kaggle , and it's 5 GiB. (12.48 GiB when uncompressed)

Data Fields

  • text: string

Data Splits

train
# of texts 86246285