One of the key challenges of machine learning is the need for large amounts of data. Gathering training datasets for machine learning models poses privacy, security, and processing risks that organizations would rather avoid.
The traditional process for developing machine learning applications is to gather a large dataset, train a model on the data, and run the trained model on a cloud server that users can reach through different applications such as web search, translation, text generation, and image processing.
In many cases, sending data to the server is inevitable. For example, this paradigm is inevitable for content recommendation systems because part of the data and content needed for machine learning inference resides on the cloud server.
Image by bdtechtalks.com
But in applications such as text autocompletion or facial recognition, the data is local to the user and the device. In these cases, it would be preferable for the data to stay on the user’s device instead of being sent to the cloud.
On-device machine learning has several added benefits. These applications can continue to work even when the device is not connected to the internet. They also provide the benefit of saving bandwidth when users are on metered connections. And in many applications, on-device inference is more energy-efficient than sending data to the cloud.
On-device inference is an important privacy upgrade for machine learning applications. But one challenge remains: Developers still need data to train the models they will push on users’ devices. This doesn’t pose a problem when the organization developing the models already owns the data (e.g., a bank owns its transactions) or the data is public knowledge (e.g., Wikipedia or news articles).
But if a company wants to train machine learning models that involve confidential user information such as emails, chat logs, or personal photos, then collecting training data entails many challenges. The company will have to make sure its collection and storage policy is conformant with the various data protection regulations and is anonymized to remove personally identifiable information (PII).
This is the problem federated learning addresses.
Image by bdtechtalks.com
The main idea behind federated learning is to train a machine learning model on user data without the need to transfer that data to cloud servers.
In the next stage, several user devices volunteer to train the model. These devices hold user data that is relevant to the model’s application, such as chat logs and keystrokes.
These devices download the base model at a suitable time, for instance when they are on a wi-fi network and are connected to a power outlet (training is a compute-intensive operation and will drain the device’s battery if done at an improper time). Then they train the model on the device’s local data.
After training, they return the trained model to the server. Popular machine learning algorithms such as deep neural networks and support vector machines is that they are parametric. Once trained, they encode the statistical patterns of their data in numerical parameters and they no longer need the training data for inference. Therefore, when the device sends the trained model back to the server, it doesn’t contain raw user data.
Once the server receives the data from user devices, it updates the base model with the aggregate parameter values of user-trained models.
The federated learning cycle must be repeated several times before the model reaches the optimal level of accuracy that the developers desire. Once the final model is ready, it can be distributed to all users for on-device inference.
Federated learning does not apply to all machine learning applications. If the model is too large to run on user devices, then the developer will need to find other workarounds to preserve user privacy.
When the training data is on the user’s device, the data engineers have no way of evaluating the data and making sure it will be beneficial to the application. For this reason, federated learning must be limited to applications where the user data does not need preprocessing.
Another limit of federated machine learning is data labeling. Most machine learning models are supervised, which means they require training examples that are manually labeled by human annotators. For example, the ImageNet dataset is a crowdsourced repository that contains millions of images and their corresponding classes.
In federated learning, unless outcomes can be inferred from user interactions (e.g., predicting the next word the user is typing), the developers can’t expect users to go out of their way to label training data for the machine learning model. Federated learning is better suited for unsupervised learning applications such as language modeling.
While sending trained model parameters to the server is less privacy-sensitive than sending user data, it doesn’t mean that the model parameters are completely clean of private data.
One important remedy to the privacy concerns of federated learning is to discard the user-trained models after they are integrated into the central model. The cloud server doesn’t need to store individual models once it updates its base model.
Another measure that can help is to increase the pool of model trainers. For example, if a model needs to be trained on the data of 100 users, the engineers can increase their pool of trainers to 250 or 500 users. For each training iteration, the system will send the base model to 100 random users from the training pool. This way, the system doesn’t collect trained parameters from any single user constantly.
Finally, by adding a bit of noise to the trained parameters and using normalization techniques, developers can considerably reduce the model’s ability to memorize users’ data.
Federated learning is gaining popularity as it addresses some of the fundamental problems of modern artificial intelligence. Researchers are constantly looking for new ways to apply federated learning to new AI applications and overcome its limits. It will be interesting to see how the field evolves in the future.
© 2021 LeackStat.com
2025 © Leackstat. All rights reserved