数据集:
Bingsu/laion2B-multi-korean-subset
任务:
特征提取语言:
ko计算机处理:
monolingual大小:
10M<n<100M语言创建人:
crowdsourced批注创建人:
crowdsourced许可:
cc-by-4.0这是一个包含只有韩文的 laion/laion2B-multi 的子集数据。
CC-BY-4.0
>>> from datasets import load_dataset >>> dataset = load_dataset("Bingsu/laion2B-multi-korean-subset") >>> dataset DatasetDict({ train: Dataset({ features: ['SAMPLE_ID', 'URL', 'TEXT', 'HEIGHT', 'WIDTH', 'LICENSE', 'LANGUAGE', 'NSFW', 'similarity'], num_rows: 11376263 }) })
>>> dataset["train"].features {'SAMPLE_ID': Value(dtype='int64', id=None), 'URL': Value(dtype='string', id=None), 'TEXT': Value(dtype='string', id=None), 'HEIGHT': Value(dtype='int32', id=None), 'WIDTH': Value(dtype='int32', id=None), 'LICENSE': Value(dtype='string', id=None), 'LANGUAGE': Value(dtype='string', id=None), 'NSFW': Value(dtype='string', id=None), 'similarity': Value(dtype='float32', id=None)}
下载: 1.56 GiB 生成: 2.37 GiB 总共: 3.93 GiB
train | |
---|---|
# of data | 11376263 |
图像的宽度为 HEIGHT,高度为 WIDTH。
>>> dataset["train"][98] {'SAMPLE_ID': 2937471001780, 'URL': 'https://image.ajunews.com/content/image/2019/04/12/20190412175643597949.png', 'TEXT': '인천시교육청, 인천 시군구발전협의회 임원진과의 간담회 개최', 'HEIGHT': 640, 'WIDTH': 321, 'LICENSE': '?', 'LANGUAGE': 'ko', 'NSFW': 'UNLIKELY', 'similarity': 0.33347243070602417}
# pip install zstandard import pandas as pd from huggingface_hub import hf_hub_url url = hf_hub_url("Bingsu/laion2B-multi-korean-subset", filename="laion2B-multi-korean-subset.csv.zst", repo_type="dataset") # url = "https://huggingface.co/datasets/Bingsu/laion2B-multi-korean-subset/resolve/main/laion2B-multi-korean-subset.csv.zst" df = pd.read_csv(url)
778 MB
import csv import re from datasets import load_dataset from tqdm import tqdm pattern = re.compile(r"[가-힣]") def quote(s: str) -> str: s = s.replace('"""', "") return s def filter_func(example) -> bool: lang = example.get("LANGUAGE") text = example.get("TEXT") if not isinstance(lang, str) or not isinstance(text, str): return False return lang == "ko" or pattern.search(text) is not None file = open("./laion2B-mulit_korean_subset.csv", "w", encoding="utf-8", newline="") ds = load_dataset("laion/laion2B-multi", split="train", streaming=True) dsf = ds.filter(filter_func) header = [ "SAMPLE_ID", "URL", "TEXT", "HEIGHT", "WIDTH", "LICENSE", "LANGUAGE", "NSFW", "similarity", ] writer = csv.DictWriter(file, fieldnames=header) writer.writeheader() try: for data in tqdm(dsf): # total=11378843 data["TEXT"] = quote(data.get("TEXT", "")) if data["TEXT"]: writer.writerow(data) finally: file.close() print("Done!")
运行时间约为8小时。之后,移除了高度或宽度为空的数据,并进行了上传。
可使用 img2dataset 将带有URL的图像转换为数据集格式。