Distributed deep learning
Deep learning models training can be significantly accelerated with distributed computing on GPUs. IBM Watson Machine Learning provides support for speeding up training using data parallelism three types of distributed learning:
Restrictions:
- Distributed deep learning is in beta phase now.
- Lite plan of IBM Watson Machine Learning does not support distributed deep learning.
- TensorFlow is only supported at this stage.
- Online deployment (scoring) is only supported for native TensorFlow.
TensorFlow
Tensorflow supports distributed using a notion of parameter server and worker approach, think of this as more of a master worker approach where workers are responsible for carrying out the work and master is responsible for sharing the learnings (weights calculated) amongst the workers.
Our current approach is that all the nodes start as equal and are provided with an id, that you can refer using environment variable $LEARNER_ID and have a host name that has a prefix $LEARNER_NAME_PREFIX and full host name as $LEARNER_NAME_PREFIX-$LEARNER_ID. Similarly you can find the total number of nodes using $NUM_LEARNERS. It is users responsibility to write code which can then designate some of these nodes as parameter servers (at least 1 needed) and workers (at least 1 needed). We are providing a sample launcher script that shows one approach of extract this infomation out.
When a distributed learning is started, these nodes come up per distributed TensorFlow setup. The grpc server is started on these nodes on port 2222. The command that is provided by the user as the part of the manifest is execute. Refer to the example to see how the launcher script can be used to control how to provide appropriate task id and job name to these nodes and how different nodes can act as a worker or parameter server depending on the learner id.
Requirements
API
To run TensorFlow distributed deep learning:
FRAMEWORK_NAMEshould be set totensorflowFRAMEWORK_VERSIONshould be1.13COMPUTE_CONFIGURATIONname should be any of:k80x2,k80x4, orv100x2- Number of
nodeshigher than one
IBM Distributed Deep Learning Library for TensorFlow
The IBM Distributed Deep Learning (DDL) library for TensorFlow automatically distributes the computatin across multiple nodes and multiple GPUs. Users are required to start with a single node GPU training code. Users modify their code with a few statement to active DDL based distribution of their code to leverage the multi-GPU training.
Requirements
User program must be written for single GPU training.
API
To run ddl distributed deep learning:
FRAMEWORK_NAMEshould be set totensorflow-ddlFRAMEWORK_VERSIONshould be1.13COMPUTE_CONFIGURATIONname should be any of:k80x2,k80x4, orv100x2- Number of
nodeshigher than one
User code
See here for more details on DDL and step by step process to modify users code to enable DDL based training and scoring in WML.
Horovod
Similar to IBM DDL, horovod is also based on a similar approach of no parameter server and workers talking amongst each other and learning from each other. Horovod is installed and configured for use if you decide to use horovod. As a user there is no need for you to do any installation or the need to run the underlying mpi commands to orchestrate the process. You can simply run your command and we take care of setting up the underlying infrastructure and orchestration.
Requirements
API
To run horovod distributed deep learning:
FRAMEWORK_NAMEshould be set totensorflow-horovodFRAMEWORK_VERSIONshould be1.13COMPUTE_CONFIGURATIONname should be any of:k80x2,k80x4, orv100x2- Number of
nodeshigher than one
Usage tip: When invoking tf.Session(), you may see the following error in your training log:
with tf.Session() as sess:
File "/opt/anaconda/envs/wmlce/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/opt/anaconda/envs/wmlce/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 699, in __init__
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0].
while setting up XLA_GPU_JIT device number 1
To resolve this, pass the config argument when invoking tf.Session(config=config). For example:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
with tf.Session(config=config) as sess:
...
...
Next Steps
- Create your own new training runs.
- Go in depth with the following Developer Works article: Introducing deep learning and long-short term memory networks.