Compute Rack - Docker maintenance

This article describes how to perform common operations on Nex!™ compute racks. Compute racks use Docker to run containers. Maintenance and troubleshooting activities are sometimes required on the Docker daemon. The steps described below aim at helping with these.







1 - Upgrade Docker version

First you need to lookup the Compute Rack causing issues and SSH onto it. For reference, note the list of currently running containers with the command

docker ps


As Docker is continuously evolving, it may be a good idea to upgrade the Docker service version as this may fix some issues.

If you upgrade the Docker daemon, make sure it there is no breaking changes in the Docker release. Verify that it all works on a test environment


Login to a Compute Rack:

nex-cli racks:ssh < rack ip address | e.g. 10.0.0.1 >


Run the following command to upgrade Docker:

On Ubuntu
# Update APT repository information
sudo apt-get -y update


# Upgrade docker package
sudo apt-get install --only-upgrade -y docker-engine
On RHEL / CentOS
# Upgrade docker package
sudo yum update -y docker-engine


And restart the Docker daemon with

service docker restart


At this point of time, all the containers will be stopped and the applications will become unavailable for a few minutes. All containers should be automatically restarted by Monit. You can track container restarts by running:

# Verify list of running containers
sudo docker ps

# Verify list of failed containers
sudo docker ps | grep Exited

2 - Restart all applications on a Compute Rack from the Nex!™ console

First access the rails console after logging in to one of the Nex!™ machines

cd /app/nex/current
bundle exec rails c < environment | e.g 'production' >


Run the following script:

restart cubes
# Select Compute Rack by instance id (e.g. AWS instance id)
rack = ComputeRack.find_by(machine_id: 'i-a8daec64')

# OR select Compute Rack by IP address
rack = ComputeRack.find_by(private_ip_address: '10.0.0.15')
# Issue a command to restart all the CubeInstances attached to it
rack.cube_instances.each do |cube|
  cube.restart_on_grid
end


Verify the containers are running on the Docker server:

# Use the nex-cli on your local machine to login to the server
nex-cli racks:ssh 10.0.0.15

# Verify list of running containers
sudo docker ps

# Verify list of failed containers
sudo docker ps | grep Exited

3 - AUFS error on creating new containers

It may happen that the Docker daemon on a compute rack becomes unstable and cannot start some containers returning device mounting errors:

error creating aufs mount to 
/var/lib/docker/aufs/mnt/54f1c9be7a4455861a3412e2f93c84eaf82c0e50eff1565f34e0f06f7a01eea1-init:
 invalid argument

Most of the time, restarting the Docker daemon will fix the issue. But doing so will stop all the running containers and they will have to be restarted.

service docker restart

4 - Investigating container CPU bursts

When NewRelic triggers alert on a Compute Rack CPU burst, the first thing is to identify the process causing the issue.

Running a top command will list the CPU intensive processes

$ top
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                    
13669 www-data  20   0  690404 167904   5556 R  24.9  1.1   3:21.81 apache2                                                                                                                                    
  938 root      20   0 6336560  91664   6568 S   7.2  0.6 636:35.68 docker                                                                                                                                     
13573 www-data  20   0  703196  70080   5468 R  50.7  0.5   6:01.42 apache2                                                                                                                                    
13880 www-data  20   0  647140  67988   5560 R  24.9  0.4   4:00.75 apache2                                                                                                                                    
19546 message+  20   0 1420272  66292   3676 S   0.0  0.4   4:48.02 mysqld 

Then find the parent process which should be a Docker container (for example PID 13669)

$ ps -axfo pid,uname,cmd
...
12823 root      \_ /bin/bash /root/init.sh /root/init.sh
 1555 root      |   \_ /bin/sh /usr/bin/mysqld_safe
29084 message+  |   |   \_ /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock 
29085 root      |   |   \_ logger -t mysqld -p daemon.error
10684 root      |   \_ /usr/sbin/apache2 -k start
32604 www-data  |   |   \_ /usr/sbin/apache2 -k start
13423 www-data  |   |   \_ /usr/sbin/apache2 -k start
13437 www-data  |   |   \_ /usr/sbin/apache2 -k start
13669 www-data  |   |   \_ /usr/sbin/apache2 -k start
11462 root      |   \_ cron
16713 root      |   |   \_ CRON
16775 root      |   |       \_ /bin/sh -c   [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifeti
17754 root      |   |           \_ /bin/sh /usr/lib/php5/sessionclean /var/lib/php5 24
17755 root      |   |               \_ /usr/bin/lsof -w -l +d /var/lib/php5
17966 root      |   |               |   \_ /usr/bin/lsof -w -l +d /var/lib/php5
17756 root      |   |               \_ awk -- { if (NR > 1) { print $9; } }
17757 root      |   |               \_ xargs -i touch -c {}
...

And then find the container

$ docker ps -q | xargs docker inspect --format '{{.State.Pid}}, {{.ID}}, {{.Name}}' | grep "12823"
12823, d3201560c34aa170658ecede4c1f5a0ea5c6822352616a652cfbae133a084a78, /mcube-qcb