数据集:
bigcode/programming-languages-keywords
以结构化形式呈现 https://github.com/e3b0c442/keywords 的版本
使用的生成工具:
r = requests.get("https://raw.githubusercontent.com/e3b0c442/keywords/main/README.md") keywords = r.text.split("### ")[1:] keywords = [i for i in keywords if not i.startswith("Sources")] keywords = {i.split("\n")[0]:[j for j in re.findall("[a-zA-Z]*", i.split("\n",1)[1]) if j] for i in keywords} keywords = pd.DataFrame(pd.Series(keywords)).reset_index().rename(columns={"index":"language", 0:"keywords"}) keywords['language'] = keywords['language'].str.split("\) ").str[0] keywords['keywords'] = keywords['keywords'].apply(lambda x: sorted(list(set(x)))) ds = Dataset.from_pandas(keywords)