Hyperparameter Tuning (Katib)
The Katib project is inspired by Google vizier. Katib is a scalable and flexible hyperparameter tuning framework and is tightly integrated with Kubernetes. It does not depend on any specific deep learning framework (such as TensorFlow, MXNet, or PyTorch).
Installing Katib
To run Katib jobs, you must install the required packages as shown in this section.
In your ksonnet application’s root directory, run the following commands:
export KF_ENV=default
ks env set ${KF_ENV} --namespace=kubeflow
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
The KF_ENV
environment variable represents a conceptual deployment environment
such as development, test, staging, or production, as defined by
ksonnet. For this example, we use the default
environment.
You can read more about Kubeflow’s use of ksonnet in the Kubeflow ksonnet component guide.
TFJob (tf-operator)
To install a TensorFlow job operator, run the following commands:
ks pkg install kubeflow/tf-training
ks pkg install kubeflow/common
ks generate tf-job-operator tf-job-operator
ks apply ${KF_ENV} -c tf-job-operator
PyTorch operator
To install a PyTorch job operator, run the following commands:
ks pkg install kubeflow/pytorch-job
ks generate pytorch-operator pytorch-operator
ks apply ${KF_ENV} -c pytorch-operator
Katib
Then run the following commands to install Katib:
ks pkg install kubeflow/katib
ks generate katib katib
ks apply ${KF_ENV} -c katib
If you want to use Katib outside Google Kubernetes Engine (GKE) and you don’t have a StorageClass for dynamic volume provisioning in your cluster, you must create a persistent volume (PV) to bind your persistent volume claim (PVC).
This is the YAML file for a PV:
apiVersion: v1
kind: PersistentVolume
metadata:
name: katib-mysql
labels:
type: local
app: katib
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/katib
After deploying the Katib package, run the following command to create the PV:
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha1/pv/pv.yaml
Running examples
After deploying everything, you can run some examples.
Example using random algorithm
You can create a StudyJob for Katib by defining a StudyJob config file. See the random algorithm example.
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha1/random-example.yaml
Running this command launches a StudyJob. The study job runs a series of training jobs to train models using different hyperparameters and save the results.
The configurations for the study (hyper-parameter feasible space, optimization parameter, optimization goal, suggestion algorithm, and so on) are defined in random-example.yaml.
In this demo, hyper-parameters are embedded as args.
You can embed hyper-parameters in another way (for example, environment values)
by using the template defined in WorkerSpec.GoTemplate.RawTemplate
.
It is written in go template format.
This demo randomly generates 3 hyper parameters:
- Learning Rate (–lr) - type: double
- Number of NN Layer (–num-layers) - type: int
- optimizer (–optimizer) - type: categorical
Check the study status:
$ kubectl -n kubeflow describe studyjobs random-example
Name: random-example
Namespace: kubeflow
Labels: controller-tools.k8s.io=1.0
Annotations: <none>
API Version: kubeflow.org/v1alpha1
Kind: StudyJob
Metadata:
Creation Timestamp: 2019-01-18T16:30:46Z
Finalizers:
clean-studyjob-data
Generation: 5
Resource Version: 1777650
Self Link: /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/random-example
UID: 687a67f9-1b3e-11e9-a0c2-c6456c1f5f0a
Spec:
Metricsnames:
accuracy
Objectivevaluename: Validation-accuracy
Optimizationgoal: 0.88
Optimizationtype: maximize
Owner: crd
Parameterconfigs:
Feasible:
Max: 0.03
Min: 0.01
Name: --lr
Parametertype: double
Feasible:
Max: 5
Min: 2
Name: --num-layers
Parametertype: int
Feasible:
List:
sgd
adam
ftrl
Name: --optimizer
Parametertype: categorical
Requestcount: 4
Study Name: random-example
Suggestion Spec:
Request Number: 3
Suggestion Algorithm: random
Suggestion Parameters:
Name: SuggestionCount
Value: 0
Worker Spec:
Go Template:
Raw Template: apiVersion: batch/v1
kind: Job
metadata:
name: {{.WorkerID}}
namespace: kubeflow
spec:
template:
spec:
containers:
- name: {{.WorkerID}}
image: katib/mxnet-mnist-example
command:
- "python"
- "/mxnet/example/image-classification/train_mnist.py"
- "--batch-size=64"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
restartPolicy: Never
Status:
Condition: Running
Early Stopping Parameter Id:
Last Reconcile Time: 2019-01-18T16:30:46Z
Start Time: 2019-01-18T16:30:46Z
Studyid: y456536bd1e0ad5e
Suggestion Count: 1
Suggestion Parameter Id: i31c2adcab54f891
Trials:
Trialid: ka897d189e024460
Workeridlist:
Completion Time: <nil>
Condition: Running
Kind: Job
Start Time: 2019-01-18T16:30:46Z
Workerid: ma76ebe2b23fec02
Trialid: v9ec0edbb16befd7
Workeridlist:
Completion Time: <nil>
Condition: Running
Kind: Job
Start Time: 2019-01-18T16:30:46Z
Workerid: yc5053df337dbeec
Trialid: be68860be22cfce3
Workeridlist:
Completion Time: <nil>
Condition: Running
Kind: Job
Start Time: 2019-01-18T16:30:46Z
Workerid: v095e6b93d87e9eb
Events: <none>
The demo should start a study and run three jobs with different parameters.
When the spec.Status.Condition
changes to Completed, the StudyJob is
finished.
TensorFlow operator example
To run the TensorFlow operator example, you must install a volume.
If you are using GKE and default StorageClass, you must create this PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tfevent-volume
namespace: kubeflow
labels:
type: local
app: tfjob
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
If you are not using GKE and you don’t have StorageClass for dynamic volume provisioning in your cluster, you must create a PVC and a PV:
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pv.yaml
Now you can run the TensorFlow operator example:
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfjob-example.yaml
You can check the status of the study:
kubectl -n kubeflow describe studyjobs tfjob-example
PyTorch example
This is an example for the PyTorch operator:
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/pytorchjob-example.yaml
You can check the status of the study:
kubectl -n kubeflow describe studyjobs pytorchjob-example
Monitoring results
You can monitor your results in the Katib UI. To access the Katib UI, you must install Ambassador.
In your ksonnet application’s root directory, run the following commands:
ks generate ambassador ambassador
ks apply ${KF_ENV} -c ambassador
Then port-forward the Ambassador service:
-
For Kubernetes version 1.9 and later:
kubectl port-forward svc/ambassador -n kubeflow 8080:80
-
For Kubernetes version 1.8 and earlier:
kubectl get pods -n kubeflow # Find one of the Ambassador pods kubectl port-forward [Ambassador pod] -n kubeflow 8080:80
Now you can access the Katib UI at this URL: http://localhost:8080/katib/
.
Cleanup
Delete the installed components:
ks delete ${KF_ENV} -c katib
ks delete ${KF_ENV} -c pytorch-operator
ks delete ${KF_ENV} -c tf-job-operator
If you created a PV for Katib, delete it:
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha1/pv/pv.yaml
If you created a PV and PVC for the TensorFlow operator, delete it:
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pv.yaml
If you deployed Ambassador, delete it:
ks delete ${KF_ENV} -c ambassador
Metrics collector
Katib has a metrics collector to take metrics from each worker. Katib collects
metrics from stdout of each worker. Metrics should print in the following
format: {metrics name}={value}
. For example, when your objective value name
is loss
and the metrics are recall
and precision
, your training container
should print like this:
epoch 1:
loss=0.3
recall=0.5
precision=0.4
epoch 2:
loss=0.2
recall=0.55
precision=0.5
Katib collects all logs of metrics.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.