数据集:
py_ast
该数据集包含用于训练和评估DeepSyn工具的解析AST。Python程序是从GitHub存储库收集的,通过删除重复文件、删除项目分叉(复制另一个现有存储库的副本)和保留只解析的程序,并且AST拥有最多30,000个节点,同时我们还试图删除混淆文件。
代码表示,无监督学习
Python
典型的数据点包含Python程序的AST,已解析。主要键是ast,其中存储每个程序的AST。每个子节点都会有以下信息:type(节点类型)、children(枚举给定节点是否有子节点的非空列表)、value(如果给定节点有任何硬编码值,则为该值,否则为"N/A")。例如:
[ {"type":"Module","children":[1,4]},{"type":"Assign","children":[2,3]},{"type":"NameStore","value":"x"},{"type":"Num","value":"7"}, {"type":"Print","children":[5]}, {"type":"BinOpAdd","children":[6,7]}, {"type":"NameLoad","value":"x"}, {"type":"Num","value":"1"} ]
数据分为训练集和测试集。最终的拆分大小如下:
train | validation | |
---|---|---|
py_ast examples | 100000 | 50000 |
[需要更多信息]
[需要更多信息]
[需要更多信息]
资源语言生产者是谁?[需要更多信息]
[需要更多信息]
注释者是谁?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
Raychev, V., Bielik, P., 和 Vechev, M
MIT, BSD 和 Apache
@InProceedings{OOPSLA ’16, ACM,title = {Probabilistic Model for Code with Decision Trees.},authors={Raychev, V., Bielik, P., 和 Vechev, M.},year={2016}}
@inproceedings{10.1145/2983990.2984041, author = {Raychev, Veselin and Bielik, Pavol and Vechev, Martin}, title = {Probabilistic Model for Code with Decision Trees}, year = {2016}, isbn = {9781450344449}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/2983990.2984041}, doi = {10.1145/2983990.2984041}, booktitle = {Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications}, pages = {731–747}, numpages = {17}, keywords = {Code Completion, Decision Trees, Probabilistic Models of Code}, location = {Amsterdam, Netherlands}, series = {OOPSLA 2016} }
感谢 @reshinthadithyan 添加了该数据集。