SoReL-20M, a database of 20 million Windows Portable Executable files, including 10 million malware samples, was announced by Sophos and ReversingLabs on Monday.
The website offers metadata, labels, and functionality for the files inside and allows interested parties to download the available malware samples for further analysis, aimed at promoting security enhancements across the industry.
The publicly available dataset is supposed to help accelerate machine learning research for malware detection by containing a curated and labelled collection of samples and related metadata.
While machine learning models are focused on knowledge, the security sector lacks a normal, large-scale dataset that can easily be accessed by all forms of users (from independent researchers to laboratories and corporations), which has so far slowed down development, Sophos argues.
It is both costly and difficult to procure a vast number of selected, labelled samples, and exchanging data sets is also difficult due to intellectual property concerns and the possibility of supplying unknown third parties with malicious software. As a result, most published malware detection articles operate on proprietary, internal databases, with findings that can not be correlated explicitly with each other the company says.
The SoReL-20M dataset, a production-scale dataset covering 20 million samples, including 10 million disarmed pieces of malware, aims to fix the problem.
The dataset contains features that have been extracted for each sample based on the EMBER 2.0 dataset, labels, identification metadata, and full binaries for the malware samples used.
In addition, models of PyTorch and LightGBM that have already been trained as baselines on this data are provided, along with scripts required to load and iterate the data, as well as to load, train, and test the models.
It will take knowledge, skill, and time to reconstitute” and run, Sophos says, provided that the malware being released has been disarmed.
The business acknowledges that qualified attackers are likely to benefit from these samples or use them to build attack methods, but maintains that “there are already many other sources that could be leveraged by attackers to gain access to malware data and samples that are simpler, faster and more cost-effective to use.”
The organisation also claims that the samples disarmed are more useful for security researchers trying to advance their independent defences.
Samples of disabled malware, which have been in the wild for a time, are supposed to call back on the dismantled infrastructure. In addition, most anti-virus vendors can also detect them. It is expected that identification would increase with metadata published alongside the samples.
As an industry, we recognise that malware is not confined to Windows or even executable files, which is why further detail is still required by researchers and protection teams,” said ReversingLabs, which claims to provide a reputable database of more than 12 billion files of goodware and malware.”