Compute Rack - Docker maintenance
This article describes how to perform common operations on Nex!™ compute racks. Compute racks use Docker to run containers. Maintenance and troubleshooting activities are sometimes required on the Docker daemon. The steps described below aim at helping with these.
1 - Upgrade Docker version
First you need to lookup the Compute Rack causing issues and SSH onto it. For reference, note the list of currently running containers with the command
docker ps
As Docker is continuously evolving, it may be a good idea to upgrade the Docker service version as this may fix some issues.
If you upgrade the Docker daemon, make sure it there is no breaking changes in the Docker release. Verify that it all works on a test environment
Login to a Compute Rack:
nex-cli racks:ssh < rack ip address | e.g. 10.0.0.1 >
Run the following command to upgrade Docker:
# Update APT repository information sudo apt-get -y update # Upgrade docker package sudo apt-get install --only-upgrade -y docker-engine
# Upgrade docker package sudo yum update -y docker-engine
And restart the Docker daemon with
service docker restart
At this point of time, all the containers will be stopped and the applications will become unavailable for a few minutes. All containers should be automatically restarted by Monit. You can track container restarts by running:
# Verify list of running containers sudo docker ps # Verify list of failed containers sudo docker ps | grep Exited
2 - Restart all applications on a Compute Rack from the Nex!™ console
First access the rails console after logging in to one of the Nex!™ machines
cd /app/nex/current bundle exec rails c < environment | e.g 'production' >
Run the following script:
# Select Compute Rack by instance id (e.g. AWS instance id) rack = ComputeRack.find_by(machine_id: 'i-a8daec64') # OR select Compute Rack by IP address rack = ComputeRack.find_by(private_ip_address: '10.0.0.15') # Issue a command to restart all the CubeInstances attached to it rack.cube_instances.each do |cube| cube.restart_on_grid end
Verify the containers are running on the Docker server:
# Use the nex-cli on your local machine to login to the server nex-cli racks:ssh 10.0.0.15 # Verify list of running containers sudo docker ps # Verify list of failed containers sudo docker ps | grep Exited
3 - AUFS error on creating new containers
It may happen that the Docker daemon on a compute rack becomes unstable and cannot start some containers returning device mounting errors:
error creating aufs mount to /var/lib/docker/aufs/mnt/54f1c9be7a4455861a3412e2f93c84eaf82c0e50eff1565f34e0f06f7a01eea1-init: invalid argument
Most of the time, restarting the Docker daemon will fix the issue. But doing so will stop all the running containers and they will have to be restarted.
service docker restart
4 - Investigating container CPU bursts
When NewRelic triggers alert on a Compute Rack CPU burst, the first thing is to identify the process causing the issue.
Running a top command will list the CPU intensive processes
$ top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13669 www-data 20 0 690404 167904 5556 R 24.9 1.1 3:21.81 apache2 938 root 20 0 6336560 91664 6568 S 7.2 0.6 636:35.68 docker 13573 www-data 20 0 703196 70080 5468 R 50.7 0.5 6:01.42 apache2 13880 www-data 20 0 647140 67988 5560 R 24.9 0.4 4:00.75 apache2 19546 message+ 20 0 1420272 66292 3676 S 0.0 0.4 4:48.02 mysqld
Then find the parent process which should be a Docker container (for example PID 13669)
$ ps -axfo pid,uname,cmd ... 12823 root \_ /bin/bash /root/init.sh /root/init.sh 1555 root | \_ /bin/sh /usr/bin/mysqld_safe 29084 message+ | | \_ /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock 29085 root | | \_ logger -t mysqld -p daemon.error 10684 root | \_ /usr/sbin/apache2 -k start 32604 www-data | | \_ /usr/sbin/apache2 -k start 13423 www-data | | \_ /usr/sbin/apache2 -k start 13437 www-data | | \_ /usr/sbin/apache2 -k start 13669 www-data | | \_ /usr/sbin/apache2 -k start 11462 root | \_ cron 16713 root | | \_ CRON 16775 root | | \_ /bin/sh -c [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifeti 17754 root | | \_ /bin/sh /usr/lib/php5/sessionclean /var/lib/php5 24 17755 root | | \_ /usr/bin/lsof -w -l +d /var/lib/php5 17966 root | | | \_ /usr/bin/lsof -w -l +d /var/lib/php5 17756 root | | \_ awk -- { if (NR > 1) { print $9; } } 17757 root | | \_ xargs -i touch -c {} ...
And then find the container
$ docker ps -q | xargs docker inspect --format '{{.State.Pid}}, {{.ID}}, {{.Name}}' | grep "12823" 12823, d3201560c34aa170658ecede4c1f5a0ea5c6822352616a652cfbae133a084a78, /mcube-qcb