How does Knowledge Distillation help Federated Learning?

Question

As per my understanding, typically in FL, there is a global server that interacts with various client devices. The global server and the client both possess a ML models. The client(s) update their models locally and then send the weights across to the server where it is averaged.

The paper, Communication-Efficient On-Device Machine Learning: Federated Distillation and Augmentation under Non-IID Private Data, has the following paragraph - "To rectify this, each device in FD stores per-label mean logit vectors, and periodically uploads these local-average logit vectors to a server. For each label, the uploaded local-average logit vectors from all devices are averaged, resulting in a global-average logit vector per label."

I am really lost with what one can do with "mean logit vectors" of a label. According to me, that's like saying, a dataset consists of 2 labels with the first label coming up 40% of the time and the second 60%. How does this help with prediction? Perhaps my understanding is wrong here.

score 1 · Answer 1 · answered Oct 03 '23 at 11:00

The logit vectors in the aforementioned paper are actually the outputs of the models and not the reference values in the dataset:

Each model output is a set of logit values normalized via a softmax function, hereafter denoted as a logit vector

The average here serves as a proxy for communication efficient comparison of "teacher" and "student" outputs.

How does Knowledge Distillation help Federated Learning?

1 Answers1