Prevent dataset from being stolen but still allowed to train

Question

I'm making information systems about domain-specific models.

There is an actor trainer. Suppose the data provider is concerned his data is being stolen by the trainer.

How to keep a trainer to train on the entire data but prevent stealing the entire data?

If there is the least restriction in data provider, I think one sample is enough to give it to the trainer. Since neural networks are about tensor shape, all trainers need to know is only about the input shape and output shape.

Thus, taking the first dimension, a.k.a. sample dimension (axis=0), specifically, I can just allow the trainer to take one sample data only, and the trainer can train that one sample before he gives the deep learning architecture to me to be trained on entire samples.

For the summary, I'm (the system) just a middleman here to facilitate (user interfacing) between the trainer and data provider. So, is there any clever way to solve this issue?

I accept specific tech stack solution such as using AWS/GCP/Azure there is feature to handle that, but I prefer generic way like giving one sample only to trainer.

score 2 · Accepted Answer · edited Mar 27 '25 at 01:14

There're several standard generic approaches for your training data privacy problem such as Federated Learning, Split Neural Networks, Homomorphic Encryption, and Differential Privacy.

Your proposed idea to provide the trainer with a single sample to understand tensor shapes and develop architecture is straightforward, however, this only works for simple static architectures, not iterative deep learning designs where the trainer usually has to perform hyperparameter optimization effectively via cross-validating on many training or validating samples. Data providers don't have the ability to do such hyperparameter optimization, otherwise they can do the turnkey training themselves in the first place. Also there're possibly many deep learning cases where one sample's tensor shape cannot completely decide the learning architecture, such as multi-model training and time-series/text-sequence with indefinite length.

From your description, it seems PySyft Split Neural Networks fits your need best.

there is a growing appetite for learning techniques to be applied to domains where data is traditionally sensitive or private, i.e healthcare, operational logistics or finance... Traditionally, PySyft has been used to facilitate federated learning. However, we can also leverage the tools included in this framework to implement distributed neural networks. These allow for researchers to process data held remotely and compute predictions in a radically decentralised way.

Here's another blog to compare Split NN vs Federated Learning.

In split learning, a deep neural network is split into multiple sections, each of which is trained on a different client. The data being trained on might reside on one supercomputing resource or might reside in the multiple clients taking part in the collaborative training. But none of the clients involved in training the deep neural network can “see” each other’s data. Techniques are applied on the data which encode data into a different space before transmitting it to train a deep neural network.

Prevent dataset from being stolen but still allowed to train

1 Answers1

Linked