Investigating a Frontend downtime
Problem
A frontend (myfrontend.maestrano.io in our example) is reported as down by the alerting system, or a user reports it
Step by step resolution and investigation
The assumption is that the Frontend infrastructure is deployed on Nex!
As a reminder, Nex! environment needs to be properly setup in order to run command line tools:
export NEX_API_KEY=your Api Key
export NEX_ENV=production
Resolution
The first step is to get the Frontend online in order to minimize user impact.
nex-cli will be used to determine app name and retrieving its status
nex-cli domains | grep myfrontend.maestrano.io
will output something similar to this:
| 5569ff29-0c86-4228-ba20-5e9a8a853c30 | myfrontend.maestrano.io | true | persevering-dolphin-5 |
so now we want to check its status
export app=persevering-dolphin-5
nex-cli apps:info $app
+----------------+--------+--------------------+------+---------+------------------+------+-------+------------+-----------------+------------------+
| App Details |
+----------------+--------+--------------------+------+---------+------------------+------+-------+------------+-----------------+------------------+
| NAME | STATUS | IMAGE | SSL | STORAGE | PREFERRED REGION | SIZE | NODES | OWNER | DESCRIPTION | TAGS |
+----------------+--------+--------------------+------+---------+------------------+------+-------+------------+-----------------+------------------+
| persevering-dolphin-5 | active | maestrano/web-ruby | true | false | ap-southeast-1 | 4 | 0/12 | u:tartempion | My cool frontend | development,ruby |
+----------------+--------+--------------------+------+---------+------------------+------+-------+------------+-----------------+------------------+
...
+--------------------------------------+------------------+------------+----------------+---------+----+------+---------+
| Cube Instances |
+--------------------------------------+------------------+------------+----------------+---------+----+------+---------+
| ID | STATUS | CONTAINER | REGION | STORAGE | IP | PORT | CLUSTER |
+--------------------------------------+------------------+------------+----------------+---------+----+------+---------+
| d2a2d9d9-ef8b-41ea-9e45-3cc911254a80 | stopped | terminated | ap-southeast-1 | false | - | - | - |
| 9b3d4de3-405d-4624-bced-80145bac28da | stopped | terminated | ap-southeast-1 | false | - | - | - |
+--------------------------------------+------------------+------------+----------------+---------+----+------+---------+
So we see here that no node is running (0/12 nodes runnings, and no cube started)
First thing we want to do is to restart the service , unless you can allow to keep it down longer, to avoid erasing clues on why it went down.
nex-cli apps:up $app
Now that we have a least one node up (we will bring a second one later on), let's try to understand what happened.
Root cause analysis
Now that the service is back we can have a look at the logs to see what went wrong
nex-cli apps:logs $app
Node: 5f621ba9-e282-4396-8dba-8e0f6abd761f
--------------------------------------------------
2017-01-20T00:45:25.623127175Z 00:45:25 web.1 | * Listening on tcp://0.0.0.0:3000
2017-01-20T00:45:25.623150529Z 00:45:25 web.1 | Use Ctrl-C to stop
2017-01-20T00:45:36.100704636Z 00:45:36 web.1 | Started GET "/" for 127.0.0.1 at 2017-01-20 00:45:36 +0000
2017-01-20T00:45:36.106560955Z 00:45:36 web.1 | Processing by StaticPagesController#home as */*
2017-01-20T00:45:36.121811713Z 00:45:36 web.1 | Rendered static_pages/home.html.erb within layouts/application (1.2ms)
2017-01-20T00:45:36.134399184Z 00:45:36 web.1 | Rendered layouts/_shim.html.erb (0.3ms)
2017-01-20T00:45:36.139367950Z 00:45:36 web.1 | Rendered layouts/_header.html.erb (1.0ms)
2017-01-20T00:45:36.143246075Z 00:45:36 web.1 | Rendered layouts/_footer.html.erb (0.4ms)
2017-01-20T00:47:06.398298961Z 00:47:06 web.1 | Rendered layouts/_shim.html.erb (0.0ms)
2017-01-20T00:47:06.398304191Z 00:47:06 web.1 | Rendered layouts/_header.html.erb (0.6ms)
2017-01-20T00:47:06.398309625Z 00:47:06 web.1 | Rendered layouts/_footer.html.erb (0.2ms)
2017-01-20T00:47:06.398314432Z 00:47:06 web.1 | Completed 200 OK in 8ms (Views: 6.1ms | ActiveRecord: 0.2ms)
In case there is nothing to see here, or we erased the useful logs by restarting the nodes, we will have to check in cubes logs directly.
For each cubes (listed in nex-cli apps:info command) we can run
nex-cli cubes:events --tail 200 d2a2d9d9-ef8b-41ea-9e45-3cc911254a80
2017-01-19 23:59:04 UTC | container | ERROR | I, [2017-01-19T05:19:45.167568 #509] INFO -- : Writing /app/public/assets/mno_enterprise/config-d760054ad0107c504a224466629f81cb8145bd7c8897bfea3e22d76c72dde01a.js
2017-01-19 23:59:04 UTC | container | ERROR | I, [2017-01-19T05:19:45.168649 #509] INFO -- : Writing /app/public/assets/mno_enterprise/config-d760054ad0107c504a224466629f81cb8145bd7c8897bfea3e22d76c72dde01a.js.gz
2017-01-19 23:59:04 UTC | container | ERROR | ng-process: /usr/local/bundle/bundler/gems/mno-enterprise-d0b9588ff879/frontend/app/assets/javascripts/mno_enterprise/application_lib.js
2017-01-19 23:59:04 UTC | container | ERROR | ng-ignore: /usr/local/bundle/gems/jquery-rails-4.2.2/vendor/assets/javascripts/jquery_ujs.js
2017-01-19 23:59:04 UTC | container | ERROR | ng-process: /usr/local/bundle/bundler/gems/mno-enterprise-d0b9588ff879/frontend/app/assets/javascripts/mno_enterprise/lib/sortable.js
2017-01-19 23:59:04 UTC | container | ERROR | I, [2017-01-19T05:19:48.130491 #509] INFO -- : Writing /app/public/assets/mno_enterprise/mail-2d6d5c1db70669c34d3438c4a36dd26220f3f08f7958f8b5fa10fab46fb1ec29.css.gz
2017-01-19 23:59:04 UTC | container | ERROR | rake aborted!
...
2017-01-19 23:59:06 UTC | container | ERROR | /usr/local/bundle/gems/sprockets-3.7.1/lib/sprockets/processor_utils.rb:75:in `call_processor'
2017-01-19 23:59:12 UTC | container | ERROR | /usr/local/bundle/gems/sprockets-rails-2.3.3/lib/sprockets/rails/task.rb:70:in `block (3 levels) in define'
2017-01-19 23:59:13 UTC | container | ERROR | /usr/local/bundle/gems/sprockets-3.7.1/lib/rake/sprocketstask.rb:147:in `with_logger'
2017-01-19 23:59:13 UTC | container | ERROR | /usr/local/bundle/gems/sprockets-rails-2.3.3/lib/sprockets/rails/task.rb:69:in `block (2 levels) in define'
2017-01-19 23:59:13 UTC | container | ERROR | /usr/local/bundle/gems/rake-12.0.0/exe/rake:27:in `<top (required)>'
2017-01-19 23:59:13 UTC | container | ERROR | /usr/local/bin/bundle:22:in `load'
2017-01-19 23:59:13 UTC | container | ERROR | /usr/local/bin/bundle:22:in `<main>'
2017-01-19 23:59:13 UTC | container | ERROR | Tasks: TOP => assets:precompile
2017-01-19 23:59:13 UTC | container | ERROR | (See full trace by running task with --trace)
2017-01-19 23:59:13 UTC | container | ERROR | Could not start container. Aborting. Please see the nex-start logs for more details.
Gotcha!
An untested code version was pushed, causing the container to fail starting.