You are viewing documentation for Kubeflow 0.5

This is a static snapshot from the time of the Kubeflow 0.5 release.
For up-to-date information, see the latest version.

ModelDB

ModelDB - A system to manage machine learning models

Introduction

ModelDB is an end-to-end system to manage machine learning models. It ingests models and associated metadata as models are being trained, stores model data in a structured format, and surfaces it through a web-frontend for rich querying. ModelDB can be used with any ML environment via the ModelDB Light API. ModelDB native clients can be used for advanced support in spark.ml and scikit-learn.

For more info see here.

Deploying ModelDB

Use the below commands to deploy ModelDB.

ks generate modeldb modeldb
ks apply default -c modeldb

Concepts

ModelDB organizes model data in a 3-level model hierarchy, from bottom to top -

  1. ExperimentRun: every execution of a script/program creates an ExperimentRun.
  2. Experiment: related ExperimentRuns can be grouped into an Experiment (e.g., “running hyperparameter optimization for the Neural Network”).
  3. Project: Finally, all Experiments and ExperimentRuns belong to a Project (e.g., “churn prediction”).

Classes -

  1. Datasets takes filepaths and optional metadata. Associate a tag (key) for each Dataset (value).
  2. Model takes model type, model and path to model as arguments.
  3. ModelConfig takes model type and model config.
  4. ModelMetrics takes what metric to use as argument.

Using ModelDB

After ModelDB is deployed and modeldb-db, modeldb-backend and modeldb-frontend pods are running -

  1. Install ModelDB

    Modeldb is now a part of the verta library. verta is compatible with python 3.5+ and the latest verta releases are available as source packages over pip. When using pip it is generally recommended to install packages in a virtual environment to avoid modifying system state.

  • Check your python version :

    python --version

  • Creating and activating new environment :

    python -m venv .env

    source .env/bin/activate

  • Install Verta :

    pip install verta==versionNumber

  1. Setup

    Get the host and port details of the modelDB backend proxy.

    kubectl get service modeldb-backend-proxy --namespace kubeflow
    

    Configure HOST and PORT to connect to the modelDB backend.

    from verta import ModelDBClient
    HOST = ""
    PORT = ""
    
    client = ModelDBClient(HOST, PORT)
    
  2. Creating a project

    Begin by creating a project and adding all the models as runs within the project. Each run can represent a strategy to solve the problem.

    project = client.set_project(proj_name="My Project")  # a project is a goal
    experiment = client.set_experiment(expt_name="My Experiment")  # strategy for project
    run = client.set_experiment_run(run_name="First run")
    
  3. Logging hyperparameters, metrics and datasets

    Use run.log_xxx() in your code to record metrics, hyperparameters, datasets etc.

    #Hyperparameters
     param_grid = {'n_estimators': [100],
               'learning_rate':[ 0.1, 0.02],
               'max_depth' : [6, 4],
               'max_leaf_nodes': [3, 15],
               'max_features': [1.0, 0.1]
              }
     for h, v in param_grid.items():
       run.log_hyperparameter(h, v)
    
     #Metrics
     model = GradientBoostingRegressor(**hyperparameters)
     model.fit(X_train, y_train)
     y_pred = model.predict(X_test)
     train_score = model.score(X_train, y_train)
     test_score = model.score(X_test, y_test)
     run.log_metric("Accuracy_train", train_score)
     run.log_metric("Accuracy_test", test_score)
    
     #Datasets
     #save models with either joblib or pickle
     from sklearn.externals import joblib
     filename_2 = "simple_model_gbr_2.joblib"
     joblib.dump(model, filename_2)
     run.log_model("model_gbr_2", filename_2)
    
  4. View your models in the webapp

    Get the IP address of the modelDB webapp service and open it in the browser

    kubectl get service modeldb-webapp --namespace kubeflow
    

Samples

These notebooks show how each dataset, model, model configuration, and model metrics can be initialized and logged into modelDB -

  • TensorFlow Open In Colab
  • Pytorch Open In Colab
  • sklearn Open In Colab