Supervisors in Erlang

Erlang (or, to be specific, the OTP) provides built-in support for monitoring and restarting processes that error out or terminate unexpectedly through the supervisor behavior. This is a pretty handy feature when building concurrent, distributed systems with strict uptime SLAs. Let's take a look at a simple demonstration of this functionality. First up, a simple program that fails in a deterministic fashion:

			
 -module(my_supervisor_example).
 -export([loop/1]).

 loop(Count) ->
 	io:fwrite("Counting down...currently at ~p.~n", [Count]),
	timer:sleep(1000),
	case Count of
		0 -> erlang:raise(exit, "Boom!", []);
		_ -> loop(Count - 1)
	end.

If we compile and run this example, we get a little countdown and then the program exits with an error:

	
 1> c(my_supervisor_example).
 {ok,my_supervisor_example}
 2> my_supervisor_example:loop(5).  
 Counting down...currently at 5.
 Counting down...currently at 4.
 Counting down...currently at 3.
 Counting down...currently at 2.
 Counting down...currently at 1.
 Counting down...currently at 0.
 ** exception exit: "Boom!"

Since we are concerned with the lifecycle of a given process, we will want to run this loop as a separate process using spawn_link/3. This function takes a function as an argument, forks a process to run it, and returns the Pid of the forked process. That Pid is what we will pass to the supervisor so it can monitor that process, watching for an exit. So, rather than invoking our exploding loop directly, let's create a wrapper function that starts the loop using spawn_link:

 -module(my_supervisor_example).
 -export([start/1]).
 -export([loop/1]).
 .
 .
 .
 start(Count) ->
	io:fwrite("Starting...~n"),
 	Pid=spawn_link(my_supervisor_example, loop, [Count]),
	{ok, Pid}.

Note the return type of this function is of the form {ok, Pid}. This is the pattern that the supervisor is expecting the process id to be passed back as. If it get's a different pattern back, the supervisor will exit with an error, so be careful.

Supervisor is a behaviour in Erlang. Implementing a behavior is similar to inheriting from an abstract class in Java, where you must implement certain functions in order to compile without raising a warning, or to run without raising an error. These functions are callbacks that the underlying behavior will use to interact with your program. In the case of supervisor, we must implement the function init/1 which returns a tuple containing the information the supervisor needs to determine the monitoring and restart policy for this process. Here is what we will add to our program:

 -module(my_supervisor_example).
 -behaviour(supervisor).
 -export([init/1])
 -export([start/1]).
 -export([loop/1]).
 .
 .
 . 
 init([Count]) ->
	{ok, {{one_for_one, 1,60},
		[{my_supervisor_example, {my_supervisor_example, start, [Count]},
			permanent, brutal_kill, worker, [my_supervisor_example]}]}}.

Now we can start our program using the supervisor framework:

	
 1> c(my_supervisor_example).
 {ok,my_supervisor_example}
 2> supervisor:start_link(my_supervisor_example, [5]).
 Starting...
 Counting down...currently at 5.
 {ok,<0.38.0>}
 Counting down...currently at 4.
 Counting down...currently at 3.
 Counting down...currently at 2.
 Counting down...currently at 1.
 Counting down...currently at 0.
 Starting...
 Counting down...currently at 5.
 Counting down...currently at 4.
 Counting down...currently at 3.
 Counting down...currently at 2.
 Counting down...currently at 1.
 Counting down...currently at 0.
 ** exception error: shutdown
 3>

Huh?!? It looks like the loop was successfully restarted once, but then it errored out at the end of it's second run! How come the supervisor didn't restart again? Well, the answer lies within the restart policy we returned in the init/1 function. Specifically, the behavior defined by the tuple {one_for_one, 1,60}. one_for_one is a behavior that says, in effect: "replace every exited process with one started process", and the 1 and the 60 state that we should limit the restarts to at most 1 every 60 seconds. If we exceed that threshhold, do not start any more processes. In our example above, we exceeded that threshold after the first restart. To run indefinitely, choose an appropriate interval for your purposes. For example, if we set those parameters to ..or_one, 30,60 ... then if we recompile and run:

	
 1> c(my_supervisor_example).                         
 {ok,my_supervisor_example}
 2> supervisor:start_link(my_supervisor_example, [5]).
 Starting...
 Counting down...currently at 5.
 {ok, <0.53.0>}
 Counting down...currently at 4.
 Counting down...currently at 3.
 Counting down...currently at 2.
 Counting down...currently at 1.
 Counting down...currently at 0.
 Starting...
 Counting down...currently at 5.
 Counting down...currently at 4.
 Counting down...currently at 3.
 Counting down...currently at 2.
 Counting down...currently at 1.
 Counting down...currently at 0.
 Starting...
 Counting down...currently at 5.
 Counting down...currently at 4.
 Counting down...currently at 3.
 Counting down...currently at 2.
 Counting down...currently at 1.
 Counting down...currently at 0.
 Starting...
 Counting down...currently at 5.
 Counting down...currently at 4.
 Counting down...currently at 3.
 Counting down...currently at 2.
 Counting down...currently at 1.
 Counting down...currently at 0.
 Starting...
 .
 .
 .

And there you go, a repeatedly failing process repeatedly restarted. While this example is trivial, you can probably see it's usefulness in many real-world scenarios.