Skip to main content

Vulnerabilities in Deep Learning File Formats

Artikel von:
0 Min. Lesezeit

Neural networks are trained via a process of backpropagation. After feeding in a sample input with a known correct output, an error is computed by comparing the network's prediction to the desired output (for example, by calculating their squared difference). Neural networks can be thought of as a chain of simple operations – individual neurons. Using the chain rule, we can compute how a small change to any individual neuron affects the final error. By adjusting the strength of the connection amongst each of the neurons in the appropriate direction, the network learns to produce outputs closer to the correct answers when given similar inputs again.

This process of making incremental modifications results in a set of decimal numbers– floating point numbers– that are referred to as neural network weights (and biases). Some frameworks, like PyTorch, offer ways to separate the neural network weights into a file that is separate from the file describing the architecture of the network; other formats always combine both sets of information into one file. In any case, any neural network is going to have a set of weights.

Serialization, deserialization, and python pickles

In many frameworks, PyTorch included, it is common to store these neural network weights in a serialized format. The object serialization format built into Python is a library called pickle, and that is what we will build an example around today. But, keep in mind that a similar vulnerability exists in many other object serialization paradigms across many other languages, including npz format in NumPy (.npz), RData in R (.rds, .rdata, .rda), Julia serialization format (.jls, .jldata, .juliaserial), and the built-in serialization in Java (.ser). 

Arbitrary Code Execution in Pickle files

What these formats have in common, is that they allow arbitrary code in their respective language to be included in the serialized data, and this code will be executed during the deserialization of this data.

Vulnerabilities_in_Deep_Learning_File_Formats_1

Scanning Pickle files for Arbitrary Code Execution

If you just want to know whether it is safe to download and work with a pickle ‘.pkl’ file, the consensus is that there is no 100% bullet-proof solution to verifying the safety of a pickle file without execution. If you can’t avoid downloading a pickle from an untrusted source, the main tool in your toolbelt will be fickling, which can scan pickle files for well-known dangerous patterns, but cannot guarantee detection for more complex patterns (such as tensor steganography). 

# Basic scan
fickling pickle_file.pkl

# Scan and display trace
fickling --trace pickle_file.pkl

We have covered pickle vulnerabilities before, but not in the context of deep learning. In a .pt or .pth file, a similar vulnerability exists. The underlying implementation of ‘pth’ files relies on the same pickle library mentioned above as a dependency. For these files, fickling cannot be directly used to conduct a scan, though you may be able to unpack them and scan the included pickle files themselves. However, because fickling cannot always guarantee the safety of a pickle file on its own, it would be advisable to test files among any of these types in a sandbox environment, at least in any situation where you cannot trust the source. 

Underlying File Format (pth/pt)

Description

Uses Pickle

fickling support for inserting code

PyTorch v1.3

ZIP file containing data.pkl (1 pickle file)

Yes

Yes

TorchScript v1.3

ZIP file with data.pkl and constants.pkl

Yes

Yes

TorchScript v1.4

ZIP file with data.pkl, constants.pkl, and version set at 2 or higher (2 pickle files and a folder)

Yes

Experimental

A more secure approach with Safetensors

A more secure representation of neural network weights, and one which has been widely adopted, would be safetensors. The safetensors format only contains raw tensor data and associated metadata. 

The security benefit with the safetensors format is that it doesn’t allow serializing of arbitrary Python code, and the architecture of the neural network is defined separately.

Vulnerabilities_in_Deep_Learning_File_Formats_2

Formats like ONNX and tensorflow’s pb are also among the file formats that are safer than pickles, because they do not serialize the weights in a format that can be exploited for arbitrary code execution. Each of these can still contain custom neural network layers, though, the implementation of which could include arbitrary python code. In contrast, the safetensors format only contains raw tensor data, so there are no custom neural network layers in the file itself. 

File Format

Associated Ecosystems

Can Contain Deserialization Code Execution Exploit(s)

Arrow

Spark

No

dill

scikit-learn

Yes

HDF5 (h5)

Keras

Yes

Java serialization

Java language

Yes

joblib

scikit-learn

Yes

json

Multiple

No

Julia Serialization

Flux.jl

Yes

MOJO

H2O.ai

Yes

MsgPack

Flax

No

Numpy

Numpy

Yes

ONNX

Multiple

Yes

pickle

PyTorch, scikit-learn

Yes

POJO

H2O.ai

Yes

RDS

R language

Yes

SafeTensors

Multiple

No

SavedModel

Tensorflow

No

TFLite (FlatBuffers)

Tensorflow

No

TorchScript

PyTorch

Yes

Snyk Open Source Security for Deep Learning Libraries

Standing up a secure service reliant on neural networks begins with managing neural network weights in a secure format. Then, one must consider the possibility of malicious code in custom-implemented layers within the architecture of the network. Finally, even if the file representation is understood to be safe, and no malicious layers are present in the architecture, out-of-date versions of popular deep learning libraries can still have other vulnerabilities. Historical examples of this include the ONNX Directory Traversal vulnerability, the Torchserve ShellTorch vulnerability, and a number of historical tensorflow vulnerabilities.

Snyk vulnerability database findings

You can search for lots of relevant vulnerabilities in the Snyk Security database, just by searching for the name of the package, like torchserve. The scope of this article is focused on the secure representation of neural network weights themselves, though – particularly those affected by pickle. In the next article in this series, we will explore poisoning pickles with malicious code, with some hands-on examples. 

Absicherung für Ihre Anwendungen

Mit Snyk sichern Ihre Developer Ihre Anwendungen vom ersten Tag nahtlos ab.