This guide will walk you through using MPI for training
If you haven’t already done so please follow the Getting Started Guide to deploy Kubeflow.
An alpha version of MPI support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow newer than 0.2.0.
Check that the MPI Job custom resource is installed
kubectl get crd
The output should include mpijobs.kubeflow.org
NAME AGE
...
mpijobs.kubeflow.org 4d
...
If it is not included you can add it as follows
cd ${KSONNET_APP}
ks pkg install kubeflow/mpi-job
ks generate mpi-operator mpi-operator
ks apply ${ENVIRONMENT} -c mpi-operator
You can create an MPI Job by defining an MPIJob config file. See Tensorflow benchmark example config file. You may change the config file based on your requirements.
cat examples/tensorflow-benchmarks.yaml
Deploy the MPIJob resource to start training:
kubectl create -f examples/tensorflow-benchmarks.yaml
You should now be able to see the created pods matching the specified number of GPUs.
kubectl get pods -l mpi_job_name=tensorflow-benchmarks-16
Training should run for 100 steps and takes a few minutes on a gpu cluster. Logs can be inspected to see its training progress.
PODNAME=$(kubectl get pods -l mpi_job_name=tensorflow-benchmarks-16,mpi_role_type=launcher -o name)
kubectl logs -f ${PODNAME}
kubectl get -o yaml mpijobs tensorflow-benchmarks-16
See the status section to monitor the job status. Here is sample output when the job is successfully completed.
apiVersion: kubeflow.org/v1alpha1
kind: MPIJob
metadata:
clusterName: ""
creationTimestamp: 2018-08-14T19:48:44Z
generation: 1
name: tensorflow-benchmarks-16
namespace: default
resourceVersion: "7670207"
selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/mpijobs/tensorflow-benchmarks-16
uid: 0d24b791-9ffb-11e8-9b38-029ed2ab0d38
spec:
gpus: 16
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
resources: {}
status:
launcherStatus: Succeeded