Feb 05, 2012

Process supervision

daemontools, god, monit, ruby, supervisor

If you ever had to administer some servers you necessarily stumbled upon this problem:
How can I make sure all my applications are running and restarted on crash ?

There are three different schools I know of for doing this:

My applications never crashs

let's call this one the silly man, you are sure your application will never ever crash and so a basic init script will work for you but if something fails you will probably never know until a client calls your support.

In this category we have the init system used on many linux/bsd hosts, they vary in execution but the idea stays the same: you have a script, when called with start as argument it starts the application and when called with stop it stops it.
This approach works well if you are sure the applications will behave well no matter what and cannot crash (or have their how supervisor mechanism).

Polling (by PIDs)

The second way of doing it is to have a dedicated application watching process on a regular basis using their pids, for me the problem with this approach is the lag between the time the process crash and the time the external supervisor notices that the application is not running.

Some seconds can look ridiculous but it is not even with a low traffic system especially with persistant connections.

In this category we have (These are the main one I know, I am sure there is a lot more):

monit
god (ruby application)
bluepill (ruby too)

The first one is a C application and works more or less but I dropped it the first day I noticed it did not restart a process as I requested it to.

The other two have an incredibly horrible syntax to configure what you want to monitor so I never really tried any of them but I am not really fond of the way those three do things anyway.

Subprocesses

Daemontools

The last way of supervising processes is to spawn them as child of the supervisor, I used daemontools until now and it works really well. If one of your application crash daemontools is notified right away and can restart the process without delay.
The things I always missed with daemontools is the ability to control (with something else than command line) and be notified of what actually happens and the lack of resources usage monitoring (if you want to to restart a process taking 100% for too much time).

Daemontools have another particularity: it supposes your daemon will not fork in the background and output its logs to stdout so they can be piped to a dedicated process writing and rotating them on disk.

I really love to consider a daemon that way for many reasons:

since you do not need to go in the background you can run your application in production the same way as in development plus you don't need to invest time figuring out how to go in the background.
no need for any specialized logger class/object, just write on stdout and you are good to go, the default process for handling logs with daemontools can add the timestamp in front of the lines you write on stdout so you can even remove that from your code.
no need to write a pid file anywhere, this is not that the task is hard but it is one less thing you have to worry about.
the daemontools logger process can guarantee you that the space taken by the log files will never exceed what you allow (X files of Y bytes) unlike syslog (newsyslog to be precise) for example which only do checks on a regular basis and cannot prevent an application to flood your disk (at least not the syslog installed by default on most distributions).

Supervisor

In this category I only had daemontools for a long time but I recently stumbled on another interesting alternative: supervisor.

it does most if not all the things described above but add some nice ones:

an XLM-RPC interface allowing full control over supervisor from another process.
you can register notification process which will be notified of any state change for your applications (start, stop, restart).
you can use the registration functionality to register specialized process monitoring the resources used by each application and act on it with the XML-RPC interface
process have more than one state, if a process crash on start it will be put aside and it will try to restart it later instead of burning the cpu like daemontools does.

I am just in testing phases for supervisord currently but I have high hopes for it.

If you have other interesting supervision application I would gladly hear about them if they have interesting options to offer.

Comments

Sep 25, 2011

On fibers and threads

fiber, ruby, thread

Now that ruby 1.9 is gaining more attention and more and more people are saying goodbye to ruby 1.8 to welcome ruby 1.9.2 (don't even try 1.9.1 the latest patch release does not even work correctly...) they rediscover the new concurrency tool 1.9 gave us: the Fiber.

So what is it all about and why can it be useful ?

Threads

First let's start with some facts about ruby 1.9 threads:

Ruby threads are now mapped 1:1 with a native os threads, it may sounds great but not that much because of the evil Global Interpreter Lock which guarantee only one thread will ever have the hand so no real concurrency (in short the garbage collector was not designed to handle multithreading so the GIL allows it to work).

The problem is that while you not gain real concurrency you still have to deal with the pain associated with threads in any language, specifically your thread can be put to sleep anytime and another thread will start/continue its job.

Here is an example of such behavior:

require 'thread'

MUTEX = Mutex.new

def msg(str)
  MUTEX.synchronize { puts str }
end

th1 = Thread.new do
  100.times {|n| msg "[Thread 1] Tick #{n}" }
end


th2 = Thread.new do
  100.times {|n| msg "[Thread 2] Tick #{n}" }
end


th1.join
th2.join

Just paste this code in a text file an execute it with ruby, what you will see depends on many factors but the sure thing is that you will see tangled lines starting with Thread 1 and Thread 2 which means that none of thread did is job in one go, here is the result on my computer (I only kept the context changes):

[Thread 1] Tick 0
[...]
[Thread 1] Tick 67
[Thread 2] Tick 0
[...]
[Thread 2] Tick 99
[Thread 1] Tick 70
[...]
[Thread 1] Tick 99

This example just use the standard input (and have to use a mutex to keep both threads to write at the same time) but imagine what it can be with concurrent database queries...

I did not dug really far in ruby sources to see how threads are implemented so this explanation my be slightly off but my understanding so far is that a ruby thread is given a fixed amount of time to do its work, if in the given amount it is not finished the scheduler suspends it and wakeup another thread waiting in the queue, it goes on and on until all threads are done or the program is stopped.

Fibers

So now that we saw what Threads had to offer let's talk about their counterpart: the Fiber, first I want to say I have no idea how fibers are implemented but it does not change the validity of the following.

You can think of fibers as threads without automatic scheduling, when a fiber is running nothing else is, this is a core concept and a really important one indeed but we will talk more about this later,
let's see an example before that:

def msg(str)
  puts str
end

fb1 = Fiber.new do
  100.times {|n| msg "[Fiber 1] Tick #{n}" }
end

fb2 = Fiber.new do
  100.times {|n| msg "[Fiber 2] Tick #{n}" }
end

fb1.resume
fb2.resume

What I did was just convert the thread example from before to fibers, now the result is really different and most importantly does not depends on external factors like threads do, when executing this example here is what you will get on any machine:

[Fiber 1] Tick 0
[...]
[Fiber 1] Tick 99
[Fiber 2] Tick 0
[...]
[Fiber 2] Tick 99

When we called resume on the first fiber it started its job and completed it, then the next line was executed and the second fiber was started and finished its job before the program completed, no magic, no random, just what you wrote !

Now there are things you should know about fibers, the power you have on them comes at a price:
You need to be sure than none of your fiber will block the whole application, if you are making a server doing heavy computation in ruby on each request you can forget fibers as only one client will be served at once but if you are working on a server working with I/O typically the network and one or more databases then you are good to go (with proper 1.9 drivers).

There is also some calls to avoid in the standard library, the first to come to mind is "sleep",
here is an example of the problem with it:

require 'fiber'

fb1 = Fiber.new do
  puts "[Fiber 1] Started"
  sleep 2
  puts "[Fiber 1] Completed"
end

fb2 = Fiber.new do
  10.times {|n| puts "[Fiber 2] tick #{n}" }
end

fb1.resume
fb2.resume

This program will sleep 2 seconds and then output 10 times a message, we will see in the next section how EventMachine can help us with this specific case, here is the output for this one:

[Fiber 1] Started
[Fiber 1] Completed
[Fiber 2] tick 0
[...]
[Fiber 2] tick 9

EventMachine

Short introduction

If you already know what EventMachine is you can jump to the next section.

When you write a standard ruby application your program once executed will run from the start to the end in a linear way, now this is where EventMachine slightly change things: when you execute an EventMachine based ruby application you are no longer in charge of the "main loop" which is now C code, instead you register for events and do action based on those events.
A simple example would be a console program: when you are waiting for user input the application is doing nothing and that would be where the EventMachine main loop is, then when a user type something a callback in your application is called with the text, after you handled it the EventMachine main loop takes back control.

How can it help us

Fibers power are best put to use in an asynchronous environment, in ruby we have the great EventMachine but working in asynchronous mode is a real pain, here is an example from the em-http-request gem:

require 'eventmachine'
require 'em-http-request'

EM.run do
  http = EM::HttpRequest.new('http://google.com/').get(
      :query => {'keyname' => 'value'}
    )

  http.errback { p 'Uh oh'; EM.stop }
  http.callback do
    p http.response_header.status
    p http.response_header
    p http.response

    EM.stop
  end

  puts "Done."
end

The callback and errback block you provide will be executed respectively when a result or an error arrives and the execution will continue so "Done." will be displayed before any of the two blocks.
We can use Threads/Fibers to transform any asynchronous code path to match the synchronous code path we are used to work with, here is an example:

require 'fiber'
require 'eventmachine'
require 'em-http-request'

EM.run do
  Fiber.new do
    fb = Fiber.current
    http = EM::HttpRequest.new('http://google.com/').get(
        :query => {'keyname' => 'value'}
      )

    http.errback  { fb.resume }
    http.callback { fb.resume }

    # suspend the fiber
    Fiber.yield

    if http.error
      puts 'An error occured, damn !'
    else
      p http.response_header.status
      p http.response_header
      p http.response
    end

    EM::stop()
  end.resume
end

See what we did ? The program execution is now linear in the fiber, this code may run a little slower than the asynchronous code I never did any serious benchmarks but the program is so much easier to write this way that you will see immediate gains right away !

Note that the same can be done with threads

require 'eventmachine'
require 'em-http-request'

EM.run do
  Thread.new do
    th = Thread.current
    http = EM::HttpRequest.new('http://google.com/').get(
        :query => {'keyname' => 'value'}
      )

    http.errback  { th.wakeup }
    http.callback { th.wakeup }

    # suspend the thread until explicit
    # wakeup
    sleep

    if http.error
      puts 'An error occured, damn !'
    else
      p http.response_header.status
      p http.response_header
      p http.response
    end

    EM::stop()
  end
end

Fiber have another limitation though, they only have access to a limited stack size (4Kb in 1.9.2), while you may use fibers without even noticing there is a catch: rails 3.1 which hit stable release recently can hit the boundaries of a fiber stack (see my Rails 3.1 test application) resulting in a stack overflow error and the sad thing is that there nothing you can really do about this if you are not running your own server with a modified ruby interpreter extending the stack.

A last example to show you how to do a sleep in an EventMachine fibered application:

require 'fiber'
require 'eventmachine'

def em_sleep(n)
  fb = Fiber.current
  EM::add_timer(n){ fb.resume }
  Fiber.yield
end

EM::run do
  fb1 = Fiber.new do
    puts "[Fiber 1] Started"
    em_sleep 2
    puts "[Fiber 1] Completed"
    EM::stop()
  end

  fb2 = Fiber.new do
    10.times {|n| puts "[Fiber 2] tick #{n}" }
  end

  fb1.resume
  fb2.resume
end

And here is the output:

[Fiber 1] Started
[Fiber 2] tick 0
[...]
[Fiber 2] tick 9
[Fiber 1] Completed

One last thing to know about fibers is that you can only one running as I already said but this limitation is per thread so you could have more fibers running but I see little use fot this case since you get get back to the same problems you would have with threads.

My experience in this field

I designed and implemented the core and network library of my current company's ruby servers, these servers form a telecommunication platform and interface themselves with an heavy client on user's computers as well as an Asterisk server, only one out of now 4/5 application servers is a Ruby On Rails application the others are headless EventMachine servers communicating with each other.

When development started ruby 1.9 was not yet released and so I started the work on 1.8 with threads and then later switched to 1.9 + fibers to get away from the thread concurrency hell, our productivity raised by a great factor and the overall performances too (faster queries) but ruby 1.9 by itself is faster than 1.8 so both implementations cannot be fairly compared.

Interesting links