Chainer Training

This guide will walk you through using Chainer for training

What is Chainer?

Chainer is a powerful, flexible and intuitive deep learning framework.

Chainer supports CUDA computation. It only requires a few lines of code to leverage a GPU. It also runs on multiple GPUs with little effort.
Chainer supports various network architectures including feed-forward nets, convnets, recurrent nets and recursive nets. It also supports per-batch architectures.
Forward computation can include any control flow statements of Python without lacking the ability of backpropagation. It makes code intuitive and easy to debug.

ChainerMN is an additional package for Chainer, a flexible deep learning framework. ChainerMN enables multi-node distributed deep learning with the following features:

Scalable — it makes full use of the latest technologies such as NVIDIA NCCL and CUDA-Aware MPI,
Flexible — even dynamic neural networks can be trained in parallel thanks to Chainer’s flexibility, and
Easy — minimal changes to existing user code are required.

This blog post provides a benchmark results using up to 128 GPUs.

Installing Chainer Operator

If you haven’t already done so please follow the Getting Started Guide to deploy Kubeflow.

An alpha version of Chainer support was introduced with Kubeflow 0.3.0. You must be using a version of Kubeflow newer than 0.3.0.

Verify that Chainer support is included in your Kubeflow deployment

Check that the Chainer Job custom resource is installed

kubectl get crd

The output should include chainerjobs.kubeflow.org

NAME                                       AGE
...
chainerjobs.kubeflow.org                   4d
...

If it is not included you can add it as follows

cd ${KSONNET_APP}
ks pkg install kubeflow/chainer-job
ks generate chainer-operator chainer-operator
ks apply ${ENVIRONMENT} -c chainer-operator

Creating a Chainer Job

You can create an Chainer Job by defining an ChainerJob config file. First, please create a file example-job-mn.yaml like below:

apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
  name: example-job-mn
spec:
  backend: mpi
  master:
    mpiConfig:
      slots: 1 
    activeDeadlineSeconds: 6000
    backoffLimit: 60
    template:
      spec:
        containers:
        - name: chainer
          image: everpeace/chainermn:1.3.0
          command:
          - sh
          - -c
          - |
            mpiexec -n 3 -N 1 --allow-run-as-root --display-map  --mca mpi_cuda_support 0 \
            python3 /train_mnist.py -e 2 -b 1000 -u 100            
  workerSets:
    ws0:
      replicas: 2
      mpiConfig:
        slots: 1
      template:
        spec:
          containers:
          - name: chainer
            image: everpeace/chainermn:1.3.0
            command:
            - sh
            - -c
            - |
                            while true; do sleep 1 & wait; done

See examples/chainerjob-reference.yaml for definitions of each attributes. You may change the config file based on your requirements. By default, the example job is distributed learning with 3 nodes (1 master, 2 workers).

Deploy the ChainerJob resource to start training:

kubectl create -f example-job-mn.yaml

You should now be able to see the created pods which consist of the chainer job.

kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn

The training should run only for 2 epochs and takes within a few minutes even on cpu only cluster. Logs can be inspected to see its training progress.

PODNAME=$(kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn,chainerjob.kubeflow.org/role=master -o name)
kubectl logs -f ${PODNAME}

Monitoring an Chainer Job

kubectl get -o yaml chainerjobs example-job-mn

See the status section to monitor the job status. Here is sample output when the job is successfully completed.

apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
  name: example-job-mn
...
status:
  completionTime: 2018-09-01T16:42:35Z
  conditions:
  - lastProbeTime: 2018-09-01T16:42:35Z
    lastTransitionTime: 2018-09-01T16:42:35Z
    status: "True"
    type: Complete
  startTime: 2018-09-01T16:34:04Z
  succeeded: 1