Sunday, January 11, 2009

Changed to apache

as some of you may have noticed on the last week of 2008/starting of 2009 we had some major stability issues. The main site was unavailable for several hours on a daily basis.

After our major hardware upgrade last year, the main server is running VMWARE ESXi to host multiple virtual servers. During this time we have experienced several hangs on the web VM (100% CPU), the problem was so severe that we where unable to shutdown the specific VM, the entire VM server had to be restarted.
This events happened randomly, there were no system logs for the crash and because the console was not responding we were unable to request kernel dumps.

There are multiple VMs on this server and none of them presented this problem, so it was something specific to our configuration. After several days testing change and moving services to a different VM we have found that the problem happened with the lighttpd/fastcgi services.

As a workaround we have changed the http server software to apache, with this change the server as been running stable.

