Version v0.2 of the documentation is no longer actively maintained. The site that you are currently viewing is an archived snapshot. For up-to-date documentation, see the latest version.
The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.
Instructions for optimizing and deploying Kubeflow on GKE.
Running Kubeflow on GKE comes with the following advantages:
Create an OAuth Client ID to be used to identify IAP when requesting access to user’s email to verify their identity.
Set up your OAuth consent screen:
kubeflow
. <project>.cloud.goog
where <project> is your GCP project id.
On the Credentials screen:
https://<hostname>/_gcp_gatekeeper/authenticate
<name>.endpoints.<project>.cloud.goog
After you enter the details, click Create.
Create environment variable from the the OAuth client ID and secret:
export CLIENT_ID=<CLIENT_ID from OAuth page>
export CLIENT_SECRET=<CLIENT_SECRET from OAuth page>
Run the following steps to deploy Kubeflow.
Run the deploy script to create GCP and K8s resources
export KUBEFLOW_VERSION=0.2.5
curl https://raw.githubusercontent.com/kubeflow/kubeflow/v${KUBEFLOW_VERSION}/scripts/gke/deploy.sh | bash
Check resources deployed in namespace kubeflow
kubectl -n kubeflow get all
Kubeflow will be available at
https://<name>.endpoints.<Project>.cloud.goog/
kubectl proxy
& kubectl port-forward
to connect to services in the cluster.The deployment script will create the following directories containing your configuration.
The setup process makes it easy to customize GCP or Kubeflow for your particular use case. Under the hood deploy.sh
This makes it easy to change your configuration by updating the config files and reapplying them.
Deployment manager uses YAML files to define your GCP infrastructure. deploy.sh creates a copy of these files in ${DEPLOYMENT_NAME}_deployment_manager_config
You can modify these files and then update your deployment.
CONFIG_FILE=${DEPLOYMENT_NAME}_deployment_manager_config/cluster-kubeflow.yaml
gcloud deployment-manager --project=${PROJECT} deployments update ${DEPLOYMENT_NAME} --config=${CONFIG_FILE}
Add GPU nodes to your cluster
To use VMs with more CPUs or RAM
To grant additional users IAM permissions to access Kubeflow
After making the changes you need to update your deployment.
For more information please refer to the deployment manager docs.
If you want to use your own doman instead of ${name}.endpoints.${project}.cloud.goog follow these instructions.
Modify your ksonnet application to remove the cloud-endpoints
component
cd ${DEPLOYMENT_NAME}_ks_app
ks delete default -c cloud-endpoints
ks component rm cloud-endpoints
Set the domain for your ingress to be the fully qualified domain name
ks param set iap-ingress hostname ${FQDN}
ks apply default -c iap-ingress
Get the address of the static ip created
IPNAME=${DEPLOYMENT_NAME}-ip
gcloud --project=${PROJECT} addresses describe --global ${IPNAME}
Use your DNS provider to map the fully qualified domain specified in the first step to the ip address reserved in GCP.
To delete your deployment and reclaim all resources
gcloud deployment-manager --project=${PROJECT} deployments delete ${DEPLOYMENT_NAME}
gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}
Here are some tips for troubleshooting IAP.
This section provides troubleshooting information for 404s, page not found, being return by the central dashboard which is served at
https://${KUBEFLOW_FQDN}/
kubectl -n ${NAMESPACE} get pods -l service=envoy
NAME READY STATUS RESTARTS AGE
envoy-76774f8d5c-lx9bd 2/2 Running 2 4m
envoy-76774f8d5c-ngjnr 2/2 Running 2 4m
envoy-76774f8d5c-sg555 2/2 Running 2 4m
https://${KUBEFLOW_FQDN}/whoami
https://${KUBEFLOW_FQDN}/tfjobs/ui
https://${KUBEFLOW_FQDN}/hub
Check that the centraldashboard is running
kubectl get pods -l app=centraldashboard
NAME READY STATUS RESTARTS AGE
centraldashboard-6665fc46cb-592br 1/1 Running 0 7h
Check a service for the central dashboard exists
kubectl get service -o yaml centraldashboard
Check that an Ambassador route is properly defined
kubectl get service centraldashboard -o jsonpath='{.metadata.annotations.getambassador\.io/config}'
apiVersion: ambassador/v0
kind: Mapping
name: centralui-mapping
prefix: /
rewrite: /
service: centraldashboard.kubeflow,
Check the logs of Ambassador for errors. See if there are errors like the following indicating an error parsing the route.If you are using the new Stackdriver Kubernetes monitoring you can use the following filter in the stackdriver console
resource.type="k8s_container"
resource.labels.location=${ZONE}
resource.labels.cluster_name=${CLUSTER}
metadata.userLabels.service="ambassador"
"could not parse YAML"
A 502 usually means traffic isn’t even making it to the envoy reverse proxy. And it usually indicates the loadbalancer doesn’t think any backends are healthy.
In Cloud Console select Network Services -> Load Balancing
ingress.kubernetes.io/url-map
annotation on your ingress object
URLMAP=$(kubectl --namespace=${NAMESPACE} get ingress envoy-ingress -o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/url-map}')
echo ${URLMAP}
This will show you the backend services associated with the load balancer
NODE_PORT=$(kubectl --namespace=${NAMESPACE} get svc envoy -o jsonpath='{.spec.ports[0].nodePort}')
BACKEND_NAME=$(gcloud compute --project=${PROJECT} backend-services list --filter=name~k8s-be-${NODE_PORT}- --format='value(name)')
gcloud compute --project=${PROJECT} backend-services get-health --global ${BACKEND_ID}
Make sure the load balancer reports the backends as healthy
Check that health checks are properly configured
Check firewall rules to ensure traffic isn’t blocked from the GCP loadbalancer
gcloud compute firewall-rules create $NAME \
--project $PROJECT \
--allow tcp:$PORT \
--target-tags $NODE_TAG \
--source-ranges 130.211.0.0/22,35.191.0.0/16
# From the GKE cluster get the name of the managed instance group
gcloud --project=$PROJECT container clusters --zone=$ZONE describe $CLUSTER
# Get the template associated with the MIG
gcloud --project=kubeflow-rl compute instance-groups managed describe --zone=${ZONE} ${MIG_NAME}
# Get the instance tags from the template
gcloud --project=kubeflow-rl compute instance-templates describe ${TEMPLATE_NAME}
For more info see GCP HTTP health check docs
In Stackdriver Logging look at the Cloud Http Load Balancer logs
Logs are labeled with the forwarding rule
The forwarding rules are available via the annotations on the ingress
ingress.kubernetes.io/forwarding-rule
ingress.kubernetes.io/https-forwarding-rule
Verify that requests are being properly routed within the cluster
Connect to one of the envoy proxies
kubectl exec -ti `kubectl get pods --selector=service=envoy -o jsonpath='{.items[0].metadata.name}'` /bin/bash
Installl curl in the pod
apt-get update && apt-get install -y curl
Verify access to the whoami app
curl -L -s -i curl -L -s -i http://envoy:8080/noiap/whoami
If this doesn’t return a 200 OK response; then there is a problem with the K8s resources