summaryrefslogtreecommitdiff
path: root/posts/bash:waitpid_function.rst
blob: dae296baf9586bfea2b6ff38c9980ebb4991a79f (plain)
    1 Bash:Waitpid Function
    2 =====================
    3 
    4 This last week I ran into a bit of shell scripting that really caused me some
    5 grief for a few days. The person who wrote it had named the script
    6 *stopService.sh*. Now, one should assume that if a script as simple as a
    7 service stop script has been working for the last two years and nothing has
    8 changed, it would continue working. It turns out that with various other
    9 environmental changes, a bug introduced two years ago was suddenly showing up.
   10 
   11 That said, I took apart this stop script and noticed at the bottom that the
   12 original author had written a check for a service to stop that looked like...
   13 
   14 .. code-block:: sh
   15 
   16   # Get the pid of tomcat
   17   pid=$(ps -ef | grep tomcat | grep -v grep | tr -s ' ' | cut -d ' ' -f 2 | head -1)
   18 
   19   # Send kill command to the process
   20   kill -0 ${pid}
   21 
   22   while [ true ]; do
   23     sleep 15
   24     ps -f ${pid} 2>/dev/null 1>/dev/null
   25     
   26     if [[ $? -gt 0 ]]; then
   27       break
   28     fi
   29 
   30     break
   31   done
   32 
   33 
   34 If you read much bash or code at all, you'll notice that while loop there and
   35 think "Oh, how clever - looping until the process exits". Then you'll arrive at
   36 the break statement at the end and think "Wait, why put all this code in a loop
   37 and always break on the first iterration?".
   38 
   39 When I saw that block of code at the end of a 243 line script designed to stop
   40 a single process type, I again realized that, regardless of title (senior
   41 principle architect in this case), not everyone who scripts knows how to do
   42 process management or understands basic programming logic.
   43 
   44 I don't claim to be an expert at all. I do however have many years of
   45 experience with this, so perhaps I can contribute something new so some
   46 people's knowledge. If you disagree with how I go about solving this problem,
   47 please feel free to send me an email. I'd be happy to learn something new if
   48 you've got a better way to do it! With that, let's get started.
   49 
   50 
   51 What's wrong with that excerpt?
   52 -------------------------------
   53 
   54 First, there were several problems that code excerpt, that with a little
   55 knowledge could be solved easily and in a portable, reproducible, and
   56 maintainable manner. The problems are more than just technical as well. Here
   57 are the problems that I see.
   58 
   59 * It doesn't ensure the process stops before proceeding
   60 
   61 * Kill signal 0 does nothing except allow for error checking (eg: if the
   62   process is still running). Check ``man 1 kill`` for more information. TLDR;
   63   Kill -0 shouldn't be used for asking a process to stop.
   64 
   65 * The pid status check relies on the output from a subshell
   66 
   67 * There is no contingency for when the process won't shut down.
   68 
   69 * The code exists outside of a function, and thus is more difficult to reuse
   70 
   71 
   72 Writing a waitpid function
   73 --------------------------
   74 
   75 Whether shutting down a service or simply blocking a process until another
   76 exits (like waiting for a backgrounded download to finish for instance), the
   77 humble waitpid function can often help out.
   78 
   79 The concept of a waitpid function is actually standardized in many places, such
   80 as posix c and glibc. That said, let's write our own for bash.
   81 
   82 
   83 .. code-block:: sh
   84 
   85   #!/usr/bin/env bash
   86   set -e
   87 
   88   #
   89   # Waits the requested time for the specified pid to exit. If the pid does not
   90   # exit in that time, the function return code is 1 (error). If the specified
   91   # pid does exit without the given threshold, then return code is 0.
   92   #
   93   # @param pid       Pid to wait for exit
   94   # @param threshold Max amount of time in seconds to wait for the pid to exit
   95   #
   96   waitpid() {
   97     local _pid="${1:-}"
   98     local _threshold="${2:-}"
   99 
  100     # Check that arguments were specified
  101     [ -z "${_pid}" ] && printf "Pid required\n" && return 1
  102     [ -z "${_threshold}" ] && printf "Wait threshold required\n" && return 1
  103 
  104     # Check every second up to the threshold wait time
  105     for (( i=0; i<${_threshold}; i++ )); do
  106       [ ! -d "/proc/${_pid}" ] && return 0
  107       sleep 1
  108     done
  109     return 1
  110   }
  111 
  112 
  113 The benefits of this function
  114 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  115 
  116 Function Arguments
  117 ^^^^^^^^^^^^^^^^^^
  118 
  119 This function takes two arguments: the first is the pid number, the second is
  120 the wait threshold. This particularly useful because the code can be re-used
  121 without being rewritten.
  122 
  123 Wait Threshold
  124 ^^^^^^^^^^^^^^
  125 If you don't want your program waiting forever for a process that's stuck, you
  126 can specify a wait threshold and it will return within that time with an error
  127 code if the process did not exit within the specified time (it returns 0 if the
  128 process did exit within the wait threshold).
  129 
  130 Return Codes
  131 ^^^^^^^^^^^^
  132 This function makes use of return codes. This is useful because, as mentioned,
  133 it tells you whether the process exited in the specified time or not. This is
  134 useful becasue we can write a process that checks if the process exited in the
  135 specified time, and if not, sends a ``kill -9`` to the process.  Something like
  136 this...
  137 
  138 **How to kill a stubborn process**
  139 
  140 .. code-block:: sh
  141 
  142   pid=7932      # Pid to wait for shutdown
  143   threshold=12  # Wait threshold in seconds
  144 
  145   # Send SIGTERM
  146   kill -15 "${pid}"
  147 
  148   # Wait for pid exit. If waitpid returned 1, send SIGKILL
  149   waitpid "${pid}" "${threshold}" || kill -9 "${pid}"
  150 
  151 
  152 That code excerpt will wait 12 seconds for the process to exit. If it does not
  153 exit within that time, it sends a SIGKILL signal to the process, forcing it to
  154 shut down.
  155 
  156 
  157 Process State Check does not Rely on a Subshell
  158 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  159 
  160 In the original code excerpt, the check for the process' presence in the
  161 process table was done via a very complicated 'ps' command daisy chained to
  162 several greps, trim, cut, and head. This is dangerous for myriad reasons.
  163 
  164 The better way to do it is to read the process from a pid file that was written
  165 at startup (if you're not writing pid files at startup, you should be). This
  166 process number is then state checked by looking in the system's proc filesystem
  167 at ``/proc/${pid}``. This is much safer, and doesn't rely on screenscraping the
  168 output of a tool that has several different 'standard' (eg: gnu, bsd, posix,
  169 etc) versions.
  170 
  171 
  172 One final to note: if you want functionality that simply waits for a process to
  173 exit without limit, read about the ``wait`` POSIX shell builtin. The downside to
  174 this though, is it will wait without a timeout, so it may never exit.
  175 
  176 
  177 More about Kill Signals
  178 -----------------------
  179 
  180 As I mentioned, the kill signal ``0`` sends no signal, but rather is useful for
  181 checking if a process is still running. If you want to request a process exit
  182 like a real friend, use ``kill -15``. If you are interested in what other
  183 signals are available, check out the man page for ``signal``, section ``7``
  184 (``man 7 signal``).  It contains a full list of standard signals and that they
  185 do (very useful for doing things like we are in this blog post).
  186 
  187 There are really two signals we're interested in for the purposes of this post
  188 though that will *most of the time* ensure your process quits (it doesn't
  189 account for zombies). Those signals are ``15`` and ``9``.
  190 
  191 **Signal 15** is called SIGTERM. It effectively requests a given process to
  192 exit nicely. This is the default signal sent by the kill (man 1 kill) command.
  193 
  194 **Signal 9** is called SIGKILL. Per the signal man page, this signal cannot be
  195 caught, so the process has no choice but to exit.

Generated by cgit