Distributed Keras


View on GitHub

Introduction

Distributed Keras (DK) is a distributed deep learning framework built on top of Apache Spark and Keras with the goal to significantly reduce the training using distributed machine learning algorithms. We designed the framework in such a way that a developer could implement a new distributed optimizer with ease, and thus enabling a person to focus on research and model development. This project was started in collaboration with the CMS experiment during my internship at CERN. What CMS basically is trying to do, is to built a deep learning model for the the high level trigger in order to be able to handle the data rates for LHC run 3 and up (increased luminosity, but identical (13-14 TeV) collision energy). Furthermore, they would like to be able to train their models faster using distributed algorithms (thus allowing them to tune the models with an increased frequency), and train those models on their complete dataset, which is in the order of several TB.

Our distributed deep learning approach mainly follows the data-parallel approach proposed in Large Scale Distributed Deep Networks by Jeffrey Dean et al. In this approach, copies of a model are replicated over different "trainers". These trainers could be distributed over different computing nodes, but it is also possible that several trainers share the resources of a single machine. Furthermore, the dataset will be partitioned in a way such that every replicated model will be trained on a different partition of the complete dataset. This is shown in the Figure below.

In order to meet the CMS use-case and to stay in realm of Big Data processing, we have chosen to built this framework on top of Apache Spark. Not only is this becoming the industry standard for big data processing and analysis, it also provides several convenient processing utilities in order build a complete machine learning pipeline. Furthermore, these algorithms are distributed as well, thus speeding up the data (pre-)processing pipeline as well.

Installation

We focussed as wel on a straightforward installation process. Depending on your personal preferences, there are 2 ways to install DK. Furthermore, we also assume that an up and running instance of Apache Spark has already been installed.

Pip

You can use this method when you only require the framework. We are planning to put DK in the Pip repository for future convenience, however, before this happens, we first need to document the code a little bit better so it can be understood by everyone.

          pip install git+https://github.com/JoeriHermans/dist-keras.git
          

Git

Using this approach, you will have access to all the examples.

          git clone https://github.com/JoeriHermans/dist-keras
          

However, in order to install possible missing dependencies, and to compile the DK modules, we need to run Pip.

          cd dist-keras
pip install -e .
          

This command will tell Pip to use the setup.py file in the root directory of the DK Git repository.

References

Zhang, S., Choromanska, A. E., & LeCun, Y. (2015). Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems (pp. 685-693). [1]

Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. arXiv preprint arXiv:1511.06051. [2]

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., ... & Ng, A. Y. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223-1231). [3]