Clear Some Doubts on Health Probes for Pod

Zhimin Wen
5 min readJul 3, 2020

--

I was asked about some fundamental concepts of the Pod’s health probes such as liveness, readiness probes,

Q1. When are the liveness and the readiness probes starting to probe? Will the readiness probe start probing only after liveness success?

Q2. Will the readiness probe failure triggering the restart of the pod?

Q3. How the timeout affects the probe?

To answer these questions, other than referring them to the Kubernetes document, additionally, I created some test cases to clear the doubts.

The Test Program

Let us create a quick go hello world Http handler program for the testing. (The hello world handler is skipped here.)

The liveness probe handler is listed as below. It will log the request’s remote address so that we will be able to know where the probe comes from and the triggering time. If the sleep parameter is not zero, we sleep some seconds to simulate the timeout scenario. Based on a flag, we return the request success or failure accordingly.

func live(w http.ResponseWriter, r *http.Request) {
logrus.Infof("Liveness probed from %s", r.RemoteAddr)

if livenessSleepTime > 0 {
logrus.Infof("liveness sleeping %d seconds", livenessSleepTime)
time.Sleep(time.Duration(livenessSleepTime) * time.Second)
}
if livenessFlag == true {
fmt.Fprintf(w, "Live")
return
}
w.WriteHeader(http.StatusInternalServerError)
w.Write([]byte("Not live"))
}

The same applies to the readiness handler.

We also have a handler to dynamically change the flag values after the program starts.

func updateProbe(w http.ResponseWriter, r *http.Request) {
probe := r.FormValue("probe")
timeOut := r.FormValue("timeout")
success := r.FormValue("success")
if timeOut != "" {
logrus.Infof("setting timeout")
sleepTime, err := strconv.Atoi(timeOut)
if err != nil {
w.WriteHeader(http.StatusInternalServerError)
logrus.Errorf("Failed to convert to int:%v", err)
fmt.Fprintf(w, "Failed to convert timeout value to int", err)
return
}
mutex.Lock()
if probe == "liveness" {
livenessSleepTime = sleepTime
} else if probe == "readiness" {
readinessSleepTime = sleepTime
}
mutex.Unlock()
}
if success != "" {
var result bool
if success == "true" {
result = true
} else if success == "false" {
result = false
}
mutex.Lock()
if probe == "liveness" {
livenessFlag = result
} else if probe == "readiness" {
readinessFlag = result
}
mutex.Unlock()
}
fmt.Fprintf(w, "probe setting updated")
}

Set the initial value of the parameters and mount the handlers in the main function,

func main() {
logrus.Infof("Program started...")
livenessFlag = true
readinessFlag = true
for _, env := range quote.Word(`LIVENESS_SLEEP READINESS_SLEEP`) {
s := os.Getenv(env)
i, err := strconv.Atoi(s)
if err != nil {
logrus.Fatalf("Failed to convert %s: %v", env, err)
}
if env == "LIVENESS_SLEEP" {
livenessSleepTime = i
} else if env == "READINESS_SLEEP" {
readinessSleepTime = i
}
}
logrus.Infof("sleep for liveness:%d. sleep for readiness: %d", livenessSleepTime, readinessSleepTime)
http.HandleFunc("/", greet)
http.HandleFunc("/alive", live)
http.HandleFunc("/ready", ready)

http.ListenAndServe(":8080", nil)
}

Compile the program and deploy it as a K8s deployment. Particularly, we set the following readiness probe and liveness probe.

livenessProbe:
httpGet:
path: /alive
port: 8080
initialDelaySeconds: 3
periodSeconds: 5
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 3
periodSeconds: 5
timeoutSeconds: 5
env:
- name: LIVENESS_SLEEP
value: "0"
- name: READINESS_SLEEP
value: "0"

Q1. When are the liveness and the readiness probes starting to probe? Will the readiness probe start probing only after liveness success?

Based on the pod description, we can find out the startup time of the pod.

$ kubectl describe pods probe-767d896f-hbz6f | grep Started:
Started: Fri, 03 Jul 2020 13:18:34 +0000

List the first few line logs of the pod, where our probes record the request.

$ kubectl logs probe-767d896f-hbz6f | head -10
time="2020-07-03T13:18:34Z" level=info msg="Program started..."
time="2020-07-03T13:18:34Z" level=info msg="sleep for liveness:0. sleep for readiness: 0"
time="2020-07-03T13:18:39Z" level=info msg="Liveness probed from 10.131.0.1:43630"
time="2020-07-03T13:18:40Z" level=info msg="Readiness probed from 10.131.0.1:43634"
time="2020-07-03T13:18:44Z" level=info msg="Liveness probed from 10.131.0.1:43668"
time="2020-07-03T13:18:45Z" level=info msg="Readiness probed from 10.131.0.1:43674"
time="2020-07-03T13:18:49Z" level=info msg="Liveness probed from 10.131.0.1:43706"
time="2020-07-03T13:18:50Z" level=info msg="Readiness probed from 10.131.0.1:43710"
time="2020-07-03T13:18:54Z" level=info msg="Liveness probed from 10.131.0.1:43744"
time="2020-07-03T13:18:55Z" level=info msg="Readiness probed from 10.131.0.1:43750"

The logs tell that both liveness and readiness probes are running independently right after the container starts. The readiness is not depending on the liveness and vice versa.

We can confirm this, by delaying the liveness probe startup with the value initialDelaySeconds: 30

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
probe-6f65dd6c79-6fzgq 0/1 Terminating 0 93s
probe-6f65dd6c79-swk7q 1/1 Running 0 7s
$ kubectl logs probe-6f65dd6c79-swk7q | head -10
time="2020-07-03T13:27:05Z" level=info msg="Program started..."
time="2020-07-03T13:27:05Z" level=info msg="sleep for liveness:0. sleep for readiness: 0"
time="2020-07-03T13:27:08Z" level=info msg="Readiness probed from 10.131.0.1:49488"
time="2020-07-03T13:27:13Z" level=info msg="Readiness probed from 10.131.0.1:49520"
time="2020-07-03T13:27:18Z" level=info msg="Readiness probed from 10.131.0.1:49556"
time="2020-07-03T13:27:23Z" level=info msg="Readiness probed from 10.131.0.1:49590"
time="2020-07-03T13:27:28Z" level=info msg="Readiness probed from 10.131.0.1:49626"

As we delay the liveness probe, the readiness probe is running without any impact, it still starts after the initial 3 seconds delay. It is as expected but yet interesting to notice that even the liveness probe is not triggered, the pod is ready to serve the request.

Q2. Will the readiness probe failure triggering the restart of the pod?

Let’s use the tester to find out the answer. Exec into the pod to set the readiness flag to false, so that the readiness probe will return false value.

kubectl exec -it probe-6f65dd6c79-swk7q -- curl "http://localhost:8080/updateProbe?probe=readiness&success=false"

After 3 failures (failureThreshold default is 3), the pod is marked as not ready but still running, as what is expected. And it will not be restarted.

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
probe-6f65dd6c79-swk7q 1/1 Running 0 16m
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
probe-6f65dd6c79-swk7q 0/1 Running 0 17m
$ kubetl get pods
NAME READY STATUS RESTARTS AGE
probe-6f65dd6c79-swk7q 0/1 Running 0 23m

Q3. How the timeout affects the probe?

If the probe could not return value within timeout value, the probe result is considered as failure.

First, let's make the probe working again from Q2.

$ kubectl exec -it probe-6f65dd6c79-swk7q -- curl "http://localhost:8080/updateProbe?probe=readiness&success=true"

Set the sleep time more than the timeout value to simulate timeout.

oc exec -it probe-6f65dd6c79-swk7q -- curl "http://localhost:8080/updateProbe?probe=readiness&timeout=6"

After 3 times of timeout, the pod is not ready.

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
probe-6f65dd6c79-swk7q 0/1 Running 0 30m

The log shows,

$ kubectl logs probe-6f65dd6c79-swk7q | tail -10
time="2020-07-03T13:57:43Z" level=info msg="Readiness sleeping 6 seconds"
time="2020-07-03T13:57:47Z" level=info msg="Liveness probed from 10.131.0.1:34786"
time="2020-07-03T13:57:48Z" level=info msg="Readiness probed from 10.131.0.1:34802"
time="2020-07-03T13:57:48Z" level=info msg="Readiness sleeping 6 seconds"
time="2020-07-03T13:57:52Z" level=info msg="Liveness probed from 10.131.0.1:34822"
time="2020-07-03T13:57:53Z" level=info msg="Readiness probed from 10.131.0.1:34836"
time="2020-07-03T13:57:53Z" level=info msg="Readiness sleeping 6 seconds"
time="2020-07-03T13:57:57Z" level=info msg="Liveness probed from 10.131.0.1:34862"
time="2020-07-03T13:57:58Z" level=info msg="Readiness probed from 10.131.0.1:34878"
time="2020-07-03T13:57:58Z" level=info msg="Readiness sleeping 6 seconds"

It is noticed that the readiness probe is running at a fixed interval even the previous probe is not finished yet.

--

--