1 Bash:Waitpid Function
2 =====================
3
4 This last week I ran into a bit of shell scripting that really caused me some
5 grief for a few days. The person who wrote it had named the script
6 *stopService.sh*. Now, one should assume that if a script as simple as a
7 service stop script has been working for the last two years and nothing has
8 changed, it would continue working. It turns out that with various other
9 environmental changes, a bug introduced two years ago was suddenly showing up.
10
11 That said, I took apart this stop script and noticed at the bottom that the
12 original author had written a check for a service to stop that looked like...
13
14 .. code-block:: sh
15
16 # Get the pid of tomcat
17 pid=$(ps -ef | grep tomcat | grep -v grep | tr -s ' ' | cut -d ' ' -f 2 | head -1)
18
19 # Send kill command to the process
20 kill -0 ${pid}
21
22 while [ true ]; do
23 sleep 15
24 ps -f ${pid} 2>/dev/null 1>/dev/null
25
26 if [[ $? -gt 0 ]]; then
27 break
28 fi
29
30 break
31 done
32
33
34 If you read much bash or code at all, you'll notice that while loop there and
35 think "Oh, how clever - looping until the process exits". Then you'll arrive at
36 the break statement at the end and think "Wait, why put all this code in a loop
37 and always break on the first iterration?".
38
39 When I saw that block of code at the end of a 243 line script designed to stop
40 a single process type, I again realized that, regardless of title (senior
41 principle architect in this case), not everyone who scripts knows how to do
42 process management or understands basic programming logic.
43
44 I don't claim to be an expert at all. I do however have many years of
45 experience with this, so perhaps I can contribute something new so some
46 people's knowledge. If you disagree with how I go about solving this problem,
47 please feel free to send me an email. I'd be happy to learn something new if
48 you've got a better way to do it! With that, let's get started.
49
50
51 What's wrong with that excerpt?
52 -------------------------------
53
54 First, there were several problems that code excerpt, that with a little
55 knowledge could be solved easily and in a portable, reproducible, and
56 maintainable manner. The problems are more than just technical as well. Here
57 are the problems that I see.
58
59 * It doesn't ensure the process stops before proceeding
60
61 * Kill signal 0 does nothing except allow for error checking (eg: if the
62 process is still running). Check ``man 1 kill`` for more information. TLDR;
63 Kill -0 shouldn't be used for asking a process to stop.
64
65 * The pid status check relies on the output from a subshell
66
67 * There is no contingency for when the process won't shut down.
68
69 * The code exists outside of a function, and thus is more difficult to reuse
70
71
72 Writing a waitpid function
73 --------------------------
74
75 Whether shutting down a service or simply blocking a process until another
76 exits (like waiting for a backgrounded download to finish for instance), the
77 humble waitpid function can often help out.
78
79 The concept of a waitpid function is actually standardized in many places, such
80 as posix c and glibc. That said, let's write our own for bash.
81
82
83 .. code-block:: sh
84
85 #!/usr/bin/env bash
86 set -e
87
88 #
89 # Waits the requested time for the specified pid to exit. If the pid does not
90 # exit in that time, the function return code is 1 (error). If the specified
91 # pid does exit without the given threshold, then return code is 0.
92 #
93 # @param pid Pid to wait for exit
94 # @param threshold Max amount of time in seconds to wait for the pid to exit
95 #
96 waitpid() {
97 local _pid="${1:-}"
98 local _threshold="${2:-}"
99
100 # Check that arguments were specified
101 [ -z "${_pid}" ] && printf "Pid required\n" && return 1
102 [ -z "${_threshold}" ] && printf "Wait threshold required\n" && return 1
103
104 # Check every second up to the threshold wait time
105 for (( i=0; i<${_threshold}; i++ )); do
106 [ ! -d "/proc/${_pid}" ] && return 0
107 sleep 1
108 done
109 return 1
110 }
111
112
113 The benefits of this function
114 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
115
116 Function Arguments
117 ^^^^^^^^^^^^^^^^^^
118
119 This function takes two arguments: the first is the pid number, the second is
120 the wait threshold. This particularly useful because the code can be re-used
121 without being rewritten.
122
123 Wait Threshold
124 ^^^^^^^^^^^^^^
125 If you don't want your program waiting forever for a process that's stuck, you
126 can specify a wait threshold and it will return within that time with an error
127 code if the process did not exit within the specified time (it returns 0 if the
128 process did exit within the wait threshold).
129
130 Return Codes
131 ^^^^^^^^^^^^
132 This function makes use of return codes. This is useful because, as mentioned,
133 it tells you whether the process exited in the specified time or not. This is
134 useful becasue we can write a process that checks if the process exited in the
135 specified time, and if not, sends a ``kill -9`` to the process. Something like
136 this...
137
138 **How to kill a stubborn process**
139
140 .. code-block:: sh
141
142 pid=7932 # Pid to wait for shutdown
143 threshold=12 # Wait threshold in seconds
144
145 # Send SIGTERM
146 kill -15 "${pid}"
147
148 # Wait for pid exit. If waitpid returned 1, send SIGKILL
149 waitpid "${pid}" "${threshold}" || kill -9 "${pid}"
150
151
152 That code excerpt will wait 12 seconds for the process to exit. If it does not
153 exit within that time, it sends a SIGKILL signal to the process, forcing it to
154 shut down.
155
156
157 Process State Check does not Rely on a Subshell
158 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
159
160 In the original code excerpt, the check for the process' presence in the
161 process table was done via a very complicated 'ps' command daisy chained to
162 several greps, trim, cut, and head. This is dangerous for myriad reasons.
163
164 The better way to do it is to read the process from a pid file that was written
165 at startup (if you're not writing pid files at startup, you should be). This
166 process number is then state checked by looking in the system's proc filesystem
167 at ``/proc/${pid}``. This is much safer, and doesn't rely on screenscraping the
168 output of a tool that has several different 'standard' (eg: gnu, bsd, posix,
169 etc) versions.
170
171
172 One final to note: if you want functionality that simply waits for a process to
173 exit without limit, read about the ``wait`` POSIX shell builtin. The downside to
174 this though, is it will wait without a timeout, so it may never exit.
175
176
177 More about Kill Signals
178 -----------------------
179
180 As I mentioned, the kill signal ``0`` sends no signal, but rather is useful for
181 checking if a process is still running. If you want to request a process exit
182 like a real friend, use ``kill -15``. If you are interested in what other
183 signals are available, check out the man page for ``signal``, section ``7``
184 (``man 7 signal``). It contains a full list of standard signals and that they
185 do (very useful for doing things like we are in this blog post).
186
187 There are really two signals we're interested in for the purposes of this post
188 though that will *most of the time* ensure your process quits (it doesn't
189 account for zombies). Those signals are ``15`` and ``9``.
190
191 **Signal 15** is called SIGTERM. It effectively requests a given process to
192 exit nicely. This is the default signal sent by the kill (man 1 kill) command.
193
194 **Signal 9** is called SIGKILL. Per the signal man page, this signal cannot be
195 caught, so the process has no choice but to exit.
|