honeydipper

Workflow Composing Guide

DipperCL is the control language that Honeydipper uses to configure data, assets and logic for its operation. It is basically a YAML with a Honeydipper specific schema.

Composing Workflows

workflow defines what to do and how to perform when an event is triggered. workflow can be defined in rules directly in the do section, or it can be defined independently with a name so it can be re-used/shared among multiple rules and workflows. A workflow can be as simple as invoking a single driver rawAction. It can also contains complicate logics, procedures dealing with various scenarios. All workflows are built with the same building blocks, follow the same process, and they can be stacked/combined with each other to achieve more complicated goals.

An example of a workflow defined in a rule calling an rawAction:

---
rules:
  - when:
      driver: webhook
      conditions:
        url: /test1
    do:
      call_driver: redispubsub.broadcast
      with:
        subject: internal
        channel: foo
        key: bar

An example of named workflow that can be invoked from other workflows.

---
workflows:
  foo:
    call_function: example.execute
    with:
      key1: val2
      key2: val2

rules:
  - when:
      source:
        system: example
        trigger: happened
    do:
      call_workflow: foo

Simple Actions

There are 4 types of simple actions that a workflow can perform.

They can not be combined.

A function can also have no action at all. {} is a perfectly legit no-op workflow.

Complex Actions

Complex actions are groups of multiple workflows organized together to do some complex work.

These can not be combined with each other or with any of the simple actions.

When using steps or threads, you can control the behaviour of the workflow upon failure or error status through fields on_failure or on_error. The allowed values are continue and exit. By default, on_failure is set to continue while on_error is set to exit. When using threads, exit means that when one thread returns error, the workflow returns without waiting for other threads to return.

Iterations

Any of the actions can be combined with an iterate or iterate_parallel field to be executed multiple times with different values from a list. The current element of the list will be stored in a local contextual data item named current. Optionally, you can also customize the name of contextual data item using iterate_as. The elements of the lists to be iterated don’t have to be simple strings, it can be a map or other complex data structures.

For example:

---
workflows:
  foo:
    iterate:
      - name: Peter
        role: hero
      - name: Paul
        role: villain
    call_workflow: announce
    with:
      message: '{{ .ctx.current.name }} is playing the role of `{{ .ctx.current.role }}`.'

Conditions

We can also specify the conditions that the workflow checks before taking any action.

Some examples for using skeleton data matching:

---
workflows:
  do_foo:
    if_match:
      foo: bar
    call_workflow: do_something

  do_bar:
    unless_match:
      team: :regex:engineering-.*
    call_workflow: complaint
    with:
      message: Only engineers are allowed here.

  do_something:
    if_match:
      user:
        - privileged_user1
        - privileged_user2
    call_workflow: assert
    with:
      message: you are either privileged_user1 or priviledged_user2

  do_some_other-stuff:
    if_match:
      user:
        age: 13
    call_workflow: assert
    with:
      message: .ctx.user matchs a data strucure with age field equal to 13

Please note how we use regular expression, list of options to match the contextual data, and how to match a field deep into the data structure.

Below are some examples of using list of conditions:

---
workflows:
  run_if_all_meets:
    if:
      - $ctx.exits # ctx.exits must not be empty and not one of such strings `false`, `nil`, `{}`, `[]`, `0`. 
      - $ctx.also  # ctx.also must also be truy
    call_workflow: assert
    with:
      message: `exits` and `also` are both truy

  run_if_either_meets:
    if_any:
      - '{{ empty .ctx.exists | not }}'
      - '{{ empty .ctx.also | not }}'
    call_workflow: assert
    with:
      message: at least one of `exits` or `also` is not empty

Looping

We can also repeat the actions in the workflow through looping fields

For example:

---
workflows:
  retry_func: # a simple forever retry
    on_error: continue
    on_failure: exit
    with:
      success: false
    until:
      - $ctx.success
    steps:
      - call_function: $ctx.func
      - export:
          success: '{{ eq .labels.status "success" }}'
    no_export:
      - success

  retry_func_count_with_exp_backoff:
    on_error: continue
    on_failure: exit
    with:
      success: false
      backoff: 0
      count-: 2
    until:
      - $ctx.success
      - $ctx.count
    steps:
      - if:
          - $ctx.backoff
        wait: '{{ .ctx.backoff }}s'
      - call_function: $ctx.func
      - export:
          count: '{{ sub (int .ctx.count) 1 }}'
          success: '{{ eq .labels.status "success" }}'
          backoff: '{{ .ctx.backoff | default 10 | int | mul 2 }}'
    no_export:
      - success
      - count
      - backoff

Hooks

Hooks are child workflows executed at a specified moments in the parent workflow’s lifecycle. It is a great way to separate auxiliary work, such as sending heartbeat, sending slack messages, making an announcement, clean up, data preparation etc., from the actual work. Hooks are defined through context data, so it can be pulled in through predefined contexts, which makes the actual workflow seems less cluttered.

For example,

contexts:
  _events:
    '*':
      hooks:
        - on_first_action: workflow_announcement
  opsgenie:
    '*':
      hooks:
        - on_success:
            - snooze_alert

rules:
  - when:
      source:
        system: foo
        trigger: bar
    do:
      call_workflow: do_something

  - when:
      source:
        system: opsgenie
        trigger: alert
    do:
      context: opsgenie
      call_workflow: do_something

In the above example, although not specifically spelled out in the rules, both events will trigger the execution of workflow_announcement workflow before executing the first action. And if the workflow responding to the opsgenie.alert event is successful, snooze_alert workflow will be executed.

The supported hooks:

Contextual Data

Contextual data is the key to stitch different events, functions, drivers and workflows together.

Sources

Every workflow receives contextual data from a few sources:

Since the data are received in that particular order listed above, the later source can override data from previous sources. Child workflow context data is independent from parent workflow, anything defined in with or inherited will only be in effect during the life cycle of current workflow, except the exported data. Once a field is exported, it will be available to all outer workflows. You can override this by specifying the list of fields that you don’t want to export.

Pay attention to the example retry_func_count_with_exp_backoff in the previous section. In order to not contaminate parent context with temporary fields, we use no_export to block the exporting of certain fields.

Interpolation

We can use interpolation in workflows to make the workflow flexible and versatile. You can use interpolation in most of the fields of a workflow. Besides contextual data, other data available for interpolation includes:

It is recommended to avoid using event and data in workflows, and stick to ctx as much as possible. The raw unexposed data might eventually be deprecated and hidden. They may still be available in system definition.

DipperCL provides following ways of interpolation:

See interpolation guide for detail on how to use interpolation.

Merging Modifier

When data from different data source is merged, by default, map structure is deeply merged, while all other type of data with the same name is replaced by the newer source. One exception is that if the data in the new source is not the same type of the existing data, the old data stays in that case.

For example, undesired merge behaviour:

---
workflows
  merge:
    - export:
        data: # original
          foo: bar
          foo_map:
            key1: val1
          foo_list:
            - item1
            - item2
          foo_param: "a string"
    - export:
        data: # overriding
          foo: foo
          foo_map:
            key2: val2
          foo_list:
            - item3
            - item4
          foo_param: # type inconsistent
            key: val

After merging with the second step, the final exported data will be like below. Notice the fields that are replaced.

data: # final
  foo: foo
  foo_map:
    key1: val1
    key2: val2
  foo_list:
    - item3
    - item4
  foo_param: "a string"

We can change the behaviour by using merging modifiers at the end of the overriding data names.

Usage:

var is an example name of the overriding data, the following character indicates what type of merge modifier to use.

Essential Workflows

We have made a few helper workflows available in the honeydipper-config-essentials repo. Hopefully, they will make it easier for you to write your own workflows.

notify

Sending a chat message using configured system. The chat system can be anything that provides a say and a reply function.

Required context fields

workflow_announcement

This workflow is intended to be invoked through on_first_action hook to send a chat message to announce what will happen.

Required context fields

Besides the fields above, this workflow also uses a few context fields that are set internally from host workflow(not the hook itself) definition.

workflow_status

This workflow is intended to be invoked through on_exit, on_error, on_success or on_failure. Required context fields

Besides the fields above, this workflow also uses a few context fields and labels that are set internally from host workflow(not the hook itself).

send_heartbeat

This workflow can be used in on_success hooks or as a stand-alone step. It sends a heartbeat to the alerting system

Required context fields

snooze_alert

This workflow can be used in on_success hooks or as a stand-alone step. It snooze the alert that triggered the workflow.

Running a Kubernetes Job

We can use a predefined run_kubernetes workflow from honeydipper-config-essentials repo to run kubernetes jobs. A simple example is below

---
workflows:
  kubejob:
    run_kubernetes:
      system: samplecluster
      steps:
        - type: python
          command: |
            ...python script here...
        - type: bash
          shell: |
            ...shell script here...

Basic of run_kubernetes

run_kubernetes workflow requires a system context field that points to a predefined system. The system must be extended from kubernetes system so that it has createJob, waitForJob and getJobLog function defined. The predefined system should also have the information required to connect to the kubernetes cluster, the namespace to use etc.

The required steps context field should tell the workflow what containers to define in the kubernetes job. If there are more that one step, the steps before the last step are all defined in initContainters section of the pod, and the last step is defined in containers.

Each step of the job has its type, which defines what docker image to use. The workflow comes with a few types predefined.

A step can be defined using a command or a shell. A command is a string or a list of strings that are passed to the default entrypoint using args in the container spec. A shell is a string or a list of strings that passed to a customized shell script entrypoint.

For example

---
workflows:
  samplejob:
    run_kubernetes:
      system: samplecluster
      steps:
        - type: python3
          command: 'print("hello world")'
        - type: python3
          shell: |
            cd /opt/app
            pip install -r requirements.txt
            python main.py

The first step uses the command to directly passing a python command or script to the container, while the second step uses shell to run a script using the same container image.

There is a shared emptyDir volumes mounted at /honeydipper to every step, so that the steps can use the shared storage to pass on information. One thing to be noted is that the steps don’t honour the default WORKDIR defined in the image, instead all the steps are using /honeydipper as workingDir in the container spec. This can be customized using workingDir in the step definition itself.

The workflow will return success in .labels.status when the job finishes successfully. If it fails to create a job or fails to get the status or job output, the status will be error. If the job is created, but failed to complete or return non-zero status code, the .labels.status will be set to failure. The workflow will export a log context field that contains a map from pod name to a map of container name to log output. A simple string version of the output that contains all the concatenated logs are exported as output context field.

Environment Variables and Volumes

You can define environments and volumes to be used in each step or as a global context field to share them across steps. For example,

---
workflows:
  samplejob:
    run_kubernetes:
      system: samplecluster
      env:
        - name: CLOUDSDK_CONFIG
          value: /honeydipper/.config/gcloud
      steps:
        - git-clone
        - type: gcloud
          shell: |
            gcloud auth activate-service-account $GOOGLE_APPLICATION_ACCOUNT --key-file=$GOOGLE_APPLICATION_CREDENTIALS
          env:
            - name: GOOGLE_APPLICATIION_ACCOUNT
              value: sample-service-account@foo.iam.gserviceaccount.com
            - name: GOOGLE_APPLICATION_CREDENTIALS
              value: /etc/gcloud/service-account.json
          volumes:
            - mountPath: /etc/gcloud
              volume:
                name: credentials-volume
                secret:
                  defaultMode: 420
                  secretName: secret-gcloud-service-account
                
        - type: tf
          shell: |
            terraform plan -no-color

Please note that, the CLOUDSDK_CONFIG environment is shared among all the steps. This ensures that all steps use the same gcloud configuration directory. The volume definition here is a combining of volumes and volumeMounts definition from pod spec.

Predefined Step

To make writing kubernetes job workflows easier, we have created a few predefined_steps that you can use instead of writing your own from scratch. To use the predefined_step, just replace the step definition with the name of the step. See the example from the previous section, where the first step of the job is git-clone.

This step clones the given repo into the shared volume /honeydipper/repo folder. It requires that the system contains a few field to identify the repo to be cloned. That includes:

We can also use the predefined step as a skeleton to create our steps by overriding the settings. For example,

---
workflows:
  samplejob:
    run_kubernetes:
      system: samplecluster
      steps:
        - use: git-clone
          volumes: [] # no need for secret volumes when cloning a public repo
          env:
            - name: REPO
              value: https://github.com/honeydipper/honeydipper
            - name: BRANCH
              value: DipperCL
        - ...

Pay attention to use field of the step.

Expanding run_kubernetes

If run_kubernetes only supports built-in types or predefined steps, it won’t be too useful in a lot of places. Luckily, it is very easy to expand the workflow to support more things.

To add a new step type, just extend the _default context under start_kube_job in the script_types field.

For example, to add a type with the rclone image,

---
contexts:
  _default:
    start_kube_job:
      script_types:
        rclone:
          image: kovacsguido/rclone:latest
          command_prefix: []
          shell_entry: [ "/bin/ash", "-c" ]

Supported fields in a type:

Similarly, to add a new predefined step, extend the predefined_steps field in the same place.

For example, to add a rclone step

---
contexts:
  _default:
    start_kube_jobs:
      predefined_steps:
        rclone:
          name: backup-replicate
          type: rclone
          command:
            - copy
            - --include
            - '{{ coalesce .ctx.pattern (index (default (dict) .ctx.patterns) (default "" .ctx.from)) "*" }}'
            - '{{ coalesce .ctx.source (index (default (dict) .ctx.sources) (default "" .ctx.from)) }}'
            - '{{ coalesce .ctx.destination (index (default (dict) .ctx.destinations) (default "" .ctx.to)) }}'
          volumes:
            - mountPath: /root/.config/rclone
              volume:
                name: rcloneconf
                secret:
                  defaultMode: 420
                  secretName: rclone-conf-with-ca

See Defining steps on how to define a step

Using run_kubernetes in GKE

GKE is a google managed kubernetes cluster service. You can use run_kubernetes to run jobs in GKE as you would any kubernetes cluster. There are a few more helper workflows, predefined steps specifically for GKE.

If the context variable google_credentials_secret is defined, this workflow will add a step in the steps list to activate the service account. The service account must exist in the kubernetes cluster as a secret, the service account key can be specified using google_credentials_secret_key and defaults to service-account.json. This is a great way to run your job with a service account other than the default account defined through the GKE node pool. This step has to be executed before you call run_kubernetes, and the following steps in the job have to be added through append modifier.

For example:

---
workflows:
  create_cluster:
    steps:
      - call_workflow: use_google_credentials
      - call_workflow: run_kubernetes
        with:
          steps+: # using append modifier here
            - type: gcloud
              shell: gcloud container clusters create {{ .ctx.new_cluster_name }}

This workflow is used for adding a step to run gcloud container clusters get-credentials to fetch the kubeconfig data for GKE clusters. This step requires that the cluster context variable is defined and describing a GKE cluster with fields like project, cluster, zone or region.

For example:

---
workflows:
  delete_job:
    with:
      cluster:
        type: gke # specify the type of the kubernetes cluster
        project: foo
        cluster: bar
        zone: us-central1-a
    steps:
      - call_workflow: use_google_credentials
      - call_workflow: use_gcloud_kubeconfig
      - call_workflow: run_kubernetes:
        with:
          steps+:
            - type: gcloud
              shell: kubectl delete jobs {{ .ctx.job_name }}

This workflow is used for adding a step to clear the kubeconfig file so kubectl can use default in-cluster setting to work on local cluster.

For example:

---
workflows:
  copy_deployment_to_local:
    steps:
      - call_workflow: use_google_credentials
      - call_workflow: use_gcloud_kubeconfig
        with:
          cluster:
            project: foo
            cluster: bar
            zone: us-central1-a
      - export:
          steps+:
            - type: gcloud
              shell: kubectl get -o yaml deployment {{ .ctx.deployment }} > kuberentes.yaml
      - call_workflow: use_local_kubeconfig
      - call_workflow: run_kubernetes
        with:
          steps+:
            - type: gcloud
              shell: kubectl apply -f kubernetes.yaml

Slash Commands

The new version of DipperCL comes with integration with Slack, including slash commands, right out of the box. Once the integration is setup, we can easily add/customize the slash commands. See integration guide (coming soon) for detailed instruction. There are a few predefined commands that you can try out without need of any further customization.

Predefined Commands

Adding New Commands

Let’s say that you have a new workflow that you want to trigger through slash command. Define or extend a _slashcommands context to have something like below.

contexts:
  _slashcommands:
    slashcommand:
      slashcommands:
        <command>:
          workflow: <workflow>
          usage: just some brief intro to your workflow
          contexts: # optionally you can run your workflow with these contexts
            - my_context

Replace the content in <> with your own content.

Mapping Parameters

Most workflows expect certain context variables to be available in order to function, for example, you may need to specify which DB to backup or restore using a DB context variable when invoking a backup/restore workflow. When a slash command is defined, a parameters context variable is made available as a string that can be accessed through $ctx.parameters using path interpolation or {{ .ctx.parameters }} in go templates. We can use the _slashcommands context to transform the parameters context variable into the actual variables the workflow requires.

For an simple example,

contexts:
  _slashcommands:
    slashcommand:
      slashcommands:
        my_greeting:
          workflow: greeting
          usage: respond with greet, take a single word as greeter

    greeting: # here is the context applied to the greeting workflow
      greeter: $ctx.parameters # the parameters is transformed into the variable required

In case you want a list of words,

contexts:
  _slashcommands:
    slashcommand:
      slashcommands:
        my_greeting:
          workflow: greeting
          usage: respond with greet, take a list of greeters

    greeting: # here is the context applied to the greeting workflow
      greeters: :yaml:{{ splitList " " .ctx.parameters }} # this generates a list

Some complex example, command with subcommands

contexts:
  _slashcommands:
    slashcommand:
      slashcommands:
        jobs:
          workflow: jobHandler
          usage: handling internal jobs

    jobHandler:
      command: '{{ splitList " " .ctx.parameters | first }}'
      name: '{{ splitList " " .ctx.parameters | rest | first }}'
      jobParams: ':yaml:{{ splitList " " .ctx.parameters | slice 2 | toJson }}'

Messages and notifications

By default, a slashcommand will send acknowledgement and return status message to the channel where the command is launched. The messages will only be visible to the sender, in other words, is ephemeral. We can define a list of channels to receive the acknowledgement and return status in addition to the sender. This increases the visibility and auditability. This is simply done by adding a slash_notify context variable to the slashcommand workflow in the _slashcommands context.

For example,

contexts:
  _slashcommands:
    slashcommand:
      slash_notify:
        - "#my_team_channel"
        - "#security"
        - "#dont_tell_the_ceo"
      slashcommands:
        ...

Secure the commands

When defining each command, we can use allowed_channels field to define a whitelist of channels from where the command can be launched. For example, it is recommended to override the reload command to be launched only from the whitelist channels like below.

contexts:
  _slashcommands:
    slashcommand:
      slashcommands:
        reload: # predefined
          allowed_channels:
            - "#sre"
            - "#ceo"