w11wo/malaysian-distilbert-small | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

模型:

w11wo/malaysian-distilbert-small

任务:

填充掩码

类库:

PyTorch TensorFlow Transformers Safetensors

数据集:

oscar 3Aoscar

语言:

其他:

distilbert malaysian-distilbert-small AutoTrain Compatible

预印本库:

arxiv:1910.01108

许可:

mit

模型介绍文件清单

英文

马来西亚 DistilBERT 小模型

马来西亚 DistilBERT 小模型是基于 DistilBERT model 的掩码语言模型。它是在 OSCAR 数据集上进行训练的，特别是 unshuffled_original_ms 子集。

模型最初是基于 HuggingFace 预训练的 English DistilBERT model ，并在马来西亚数据集上进行了微调。在验证数据集（数据集的20％）上，其困惑度为10.33。很多使用的技术是基于 Hugging Face 的 notebook 教程（由 Sylvain Gugger 撰写）和 fine-tuning tutorial notebook 教程（由 Pierre Guillou 撰写）。

使用了Hugging Face的 Transformers 库来训练模型 - 使用了基本的DistilBERT模型和Trainer类。训练过程中使用PyTorch作为后端框架，但该模型仍然与TensorFlow兼容。

模型

Model	#params	Arch.	Training/Validation data (text)
malaysian-distilbert-small	66M	DistilBERT Small	OSCAR unshuffled_original_ms Dataset

评估结果

模型进行了1个时期的训练，以下是训练结束时的最终结果。

train loss	valid loss	perplexity	total time
2.476	2.336	10.33	0:40:05

如何使用

作为掩码语言模型

from transformers import pipeline

pretrained_name = "w11wo/malaysian-distilbert-small"

fill_mask = pipeline(
    "fill-mask",
    model=pretrained_name,
    tokenizer=pretrained_name
)

fill_mask("Henry adalah seorang lelaki yang tinggal di [MASK].")

在PyTorch中进行特征提取

from transformers import DistilBertModel, DistilBertTokenizerFast

pretrained_name = "w11wo/malaysian-distilbert-small"
model = DistilBertModel.from_pretrained(pretrained_name)
tokenizer = DistilBertTokenizerFast.from_pretrained(pretrained_name)

prompt = "Bolehkah anda [MASK] Bahasa Melayu?"
encoded_input = tokenizer(prompt, return_tensors='pt')
output = model(**encoded_input)

免责声明

请考虑可能从OSCAR数据集中传递给此模型结果的偏见。

作者

马来西亚 DistilBERT 小模型由 Wilson Wongso 进行了训练和评估。所有的计算和开发都是在谷歌 Colaboratory 上使用免费的GPU资源完成的。

作者:

Wilson Wongso

数据集大小:

857.92 MB