The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.
Instructions for optimizing and deploying Kubeflow on GKE.
Running Kubeflow on GKE comes with the following advantages:
Create an OAuth Client ID to be used to identify IAP when requesting access to user’s email to verify their identity.
Set up your OAuth consent screen:
Configure the consent screen.
Under Email address, select the address that you want to display as a public contact. You must use either your email address or a Google Group that you own.
In the Product name box, enter a suitable name like kubeflow
.
Under Authorized domains, enter
<project>.cloud.goog
where <project> is your GCP project id.
Click Save.
On the Credentials screen:
Click Create credentials, and then click OAuth client ID.
Under Application type, select Web application.
In the Name box enter any name.
In the Authorized redirect URIs box, enter
https://<hostname>/_gcp_gatekeeper/authenticate
<hostname> will be used later for iap-ingress, and should be in the format
<name>.endpoints.<project>.cloud.goog
<name> and <project> will be set in the next step when you run deploy.sh
After you enter the details, click Create.
Create environment variable from the the OAuth client ID and secret:
export CLIENT_ID=<CLIENT_ID from OAuth page>
export CLIENT_SECRET=<CLIENT_SECRET from OAuth page>
Run the following steps to deploy Kubeflow.
Run the deploy script to create GCP and K8s resources
export KUBEFLOW_VERSION=0.2.5
curl https://raw.githubusercontent.com/kubeflow/kubeflow/v${KUBEFLOW_VERSION}/scripts/gke/deploy.sh | bash
Check resources deployed in namespace kubeflow
kubectl -n kubeflow get all
Kubeflow will be available at
https://<name>.endpoints.<Project>.cloud.goog/
kubectl proxy
& kubectl port-forward
to connect to services in the cluster.The deployment script will create the following directories containing your configuration.
The setup process makes it easy to customize GCP or Kubeflow for your particular use case. Under the hood deploy.sh
This makes it easy to change your configuration by updating the config files and reapplying them.
Deployment manager uses YAML files to define your GCP infrastructure. deploy.sh creates a copy of these files in ${DEPLOYMENT_NAME}_deployment_manager_config
You can modify these files and then update your deployment.
CONFIG_FILE=${DEPLOYMENT_NAME}_deployment_manager_config/cluster-kubeflow.yaml
gcloud deployment-manager --project=${PROJECT} deployments update ${DEPLOYMENT_NAME} --config=${CONFIG_FILE}
Add GPU nodes to your cluster
To use VMs with more CPUs or RAM
To grant additional users IAM permissions to access Kubeflow
After making the changes you need to update your deployment.
For more information please refer to the deployment manager docs.
If you want to use your own doman instead of ${name}.endpoints.${project}.cloud.goog follow these instructions.
Modify your ksonnet application to remove the cloud-endpoints
component
cd ${DEPLOYMENT_NAME}_ks_app
ks delete default -c cloud-endpoints
ks component rm cloud-endpoints
Set the domain for your ingress to be the fully qualified domain name
ks param set iap-ingress hostname ${FQDN}
ks apply default -c iap-ingress
Get the address of the static ip created
IPNAME=${DEPLOYMENT_NAME}-ip
gcloud --project=${PROJECT} addresses describe --global ${IPNAME}
Use your DNS provider to map the fully qualified domain specified in the first step to the ip address reserved in GCP.
To delete your deployment and reclaim all resources
gcloud deployment-manager --project=${PROJECT} deployments delete ${DEPLOYMENT_NAME}
gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}
Here are some tips for troubleshooting IAP.
This section provides troubleshooting information for 404s, page not found, being return by the central dashboard which is served at
https://${KUBEFLOW_FQDN}/
Since we were able to sign in this indicates the Ambassador reverse proxy is up and healthy we can confirm this is the case by running the following command
kubectl -n ${NAMESPACE} get pods -l service=envoy
NAME READY STATUS RESTARTS AGE
envoy-76774f8d5c-lx9bd 2/2 Running 2 4m
envoy-76774f8d5c-ngjnr 2/2 Running 2 4m
envoy-76774f8d5c-sg555 2/2 Running 2 4m
Try other services to see if their accessible e.g
https://${KUBEFLOW_FQDN}/whoami
https://${KUBEFLOW_FQDN}/tfjobs/ui
https://${KUBEFLOW_FQDN}/hub
If other services are accessible then we know its a problem specific to the central dashboard and not ingress
Check that the centraldashboard is running
kubectl get pods -l app=centraldashboard
NAME READY STATUS RESTARTS AGE
centraldashboard-6665fc46cb-592br 1/1 Running 0 7h
Check a service for the central dashboard exists
kubectl get service -o yaml centraldashboard
Check that an Ambassador route is properly defined
kubectl get service centraldashboard -o jsonpath='{.metadata.annotations.getambassador\.io/config}'
apiVersion: ambassador/v0
kind: Mapping
name: centralui-mapping
prefix: /
rewrite: /
service: centraldashboard.kubeflow,
Check the logs of Ambassador for errors. See if there are errors like the following indicating an error parsing the route.If you are using the new Stackdriver Kubernetes monitoring you can use the following filter in the stackdriver console
resource.type="k8s_container"
resource.labels.location=${ZONE}
resource.labels.cluster_name=${CLUSTER}
metadata.userLabels.service="ambassador"
"could not parse YAML"
A 502 usually means traffic isn’t even making it to the envoy reverse proxy. And it usually indicates the loadbalancer doesn’t think any backends are healthy.
In Cloud Console select Network Services -> Load Balancing
Click on the load balancer (the name should contain the name of the ingress)
The exact name can be found by looking at the ingress.kubernetes.io/url-map
annotation on your ingress object
URLMAP=$(kubectl --namespace=${NAMESPACE} get ingress envoy-ingress -o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/url-map}')
echo ${URLMAP}
Click on your loadbalancer
This will show you the backend services associated with the load balancer
There is 1 backend service for each K8s service the ingress rule routes traffic too
The named port will correspond to the NodePort a service is using
NODE_PORT=$(kubectl --namespace=${NAMESPACE} get svc envoy -o jsonpath='{.spec.ports[0].nodePort}')
BACKEND_NAME=$(gcloud compute --project=${PROJECT} backend-services list --filter=name~k8s-be-${NODE_PORT}- --format='value(name)')
gcloud compute --project=${PROJECT} backend-services get-health --global ${BACKEND_ID}
Make sure the load balancer reports the backends as healthy
If the backends aren’t reported as healthy check that the pods associated with the K8s service are up and running
Check that health checks are properly configured
Check firewall rules to ensure traffic isn’t blocked from the GCP loadbalancer
The firewall rule should be added automatically by the ingress but its possible it got deleted if you have some automatic firewall policy enforcement. You can recreate the firewall rule if needed with a rule like this
gcloud compute firewall-rules create $NAME \
--project $PROJECT \
--allow tcp:$PORT \
--target-tags $NODE_TAG \
--source-ranges 130.211.0.0/22,35.191.0.0/16
To get the node tag
# From the GKE cluster get the name of the managed instance group
gcloud --project=$PROJECT container clusters --zone=$ZONE describe $CLUSTER
# Get the template associated with the MIG
gcloud --project=kubeflow-rl compute instance-groups managed describe --zone=${ZONE} ${MIG_NAME}
# Get the instance tags from the template
gcloud --project=kubeflow-rl compute instance-templates describe ${TEMPLATE_NAME}
For more info see GCP HTTP health check docs
In Stackdriver Logging look at the Cloud Http Load Balancer logs
ingress.kubernetes.io/forwarding-rule
ingress.kubernetes.io/https-forwarding-rule
Verify that requests are being properly routed within the cluster
Connect to one of the envoy proxies
```
kubectl exec -ti `kubectl get pods --selector=service=envoy -o jsonpath='{.items[0].metadata.name}'` /bin/bash
```
Installl curl in the pod
apt-get update && apt-get install -y curl
curl -L -s -i curl -L -s -i http://envoy:8080/noiap/whoami