This dataset contains source code and deployed bytecode for Solidity Smart Contracts that have been verified on Etherscan.io, along with a classification of their vulnerabilities according to the Slither static analysis framework.
The language annotations are in English, while all the source codes are in Solidity.
Each data instance contains the following features: address , source_code and bytecode . The label comes in two configuration, either a plain-text cleaned up version of the output given by the Slither tool or a multi-label version, which consists in a simple list of integers, each one representing a particular vulnerability class. Label 4 indicates that the contract is safe.
An example from a plain-text configuration looks as follows:
{ 'address': '0x006699d34AA3013605d468d2755A2Fe59A16B12B' 'source_code': 'pragma solidity 0.5.4; interface IERC20 { function balanceOf(address account) external ...' 'bytecode': '0x608060405234801561001057600080fd5b5060043610610202576000357c0100000000000000000000000000000000000000000000000000000000900...' 'slither': '{"success": true, "error": null, "results": {"detectors": [{"check": "divide-before-multiply", "impact": "Medium", "confidence": "Medium"}]}}' }
An example from a multi-label configuration looks as follows:
{ 'address': '0x006699d34AA3013605d468d2755A2Fe59A16B12B' 'source_code': 'pragma solidity 0.5.4; interface IERC20 { function balanceOf(address account) external ...' 'bytecode': '0x608060405234801561001057600080fd5b5060043610610202576000357c0100000000000000000000000000000000000000000000000000000000900...' 'slither': [ 4 ] }
The dataset comes in 6 configurations and train, test and validation splits are only provided for those configurations that do not include all- in their names. Test and Validation splits are both about 15% of the total.
slither-audited-smart-contracts was built to provide a freely available large scale dataset for vulnerability detection and classification on verified Solidity smart contracts. Indeed, the biggest open source dataset for this task at the moment of writing is SmartBugs Wild , containing 47,398 smart contracts that were labeled with 9 tools withing the SmartBugs framework.
The dataset was constructed started from the list of verified smart contracts provided at Smart Contract Sanctuary . Then, smart contract source code was either downloaded from the aforementioned repo or downloaded via Etherscan and flattened using the Slither contract flattener. The bytecode was downloaded using the Web3.py library, in particular the web3.eth.getCode() function and using INFURA as our endpoint. Finally, every smart contract was analyzed using the Slither static analysis framework. The tool found 38 different vulnerability classes in the collected contracts and they were then mapped to 9 labels according to what is shown in the file label_mappings.json . These mappings were derived by following the guidelines at Decentralized Application Security Project (DASP) and at Smart Contract Weakness Classification Registry . They were also inspired by the mappings used for Slither's detection by the team that labeled the SmartBugs Wild dataset, which can be found here .
The dataset was initially created by Martina Rossini during work done for the project of the course Blockchain and Cryptocurrencies of the University of Bologna (Italy).
The license in the file LICENSE applies to all the files in this repository, except for the Solidity source code of the contracts. These are still publicly available, were obtained using the Etherscan APIs, and retain their original licenses.
If you are using this dataset in your research and paper, here's how you can cite it:
@misc{rossini2022slitherauditedcontracts, title = {Slither Audited Smart Contracts Dataset}, author={Martina Rossini}, year={2022} }
Thanks to @mwritescode for adding this dataset.