Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This article describes how to monitor and troubleshoot Maestrano's infrastructure with New Relic




Table of Contents
stylenone



1 - Overview

Server monitoring enables you to supervise the actual infrastructure behind the applicationscollects metrics about the servers running on the Maestrano environment

Drilling down to into a specific server shows its details

Image Removed

2. Troubleshooting an alert 

Severs alerts are related to either high CPU usage, memory usages, I/O or load.

Looking at the server by itself is usually not enough to understand the root cause of the issue, so it needs to be correlated to other events on the application side.

2.1 Non critical example: Memory Alert

Image Removed

Here, we have an alert on Memory usage on one of our servers, this is a none critical alert but we will make sure no action is needed.

From the Apps section, we can see that this is MnoHub.

Drilling down to processes helps us see what is going on

Image Removed

Everything looks fine, but to ensure business services are not impacted, we are going back to APM 

Image Removed

All good.details such as CPU and Memory usage, Disk and Network I/O and Load average

Image Added

The NewRelic server agent is installed by Nex! on all the racks.

1.1 - CPU

Chart displays the average CPU usage for the time reported.

CPU States:

  • IO Wait: Time that the CPU is idle and there is at least one input or output operation in progress.
  • Stolen: CPU time "stolen" from this virtual machine by the hypervisor for other tasks (such as running another virtual machine). New Relic will only show increased stolen activity when the app has activity. New Relic does not count stolen time for CPU activity alerts. For example, if resources are stolen but a virtual machine is not actively processing, you see no stolen load. However, if resources are being stolen and a virtual machine is even slightly active, the load spikes proportionally. The more stolen resources there are, the less active the virtual machine needs to be to generate a high load rating.
  • System: Time used by the kernel and its associated processes. This is mostly system housekeeping, but things like RAID rebuilding, and handling network transmission and checksums fall into this category as well.
  • User: Time the CPU has spent running users' processes.
  • Idle: Anything between the top of your graphed usage and 100% (in white) is time when the CPU is not doing anything at all.

For web applications, the average CPU usage is expected to be lower than 20%. Job servers CPU usage would vary depending on the jobs executed. If a server CPU usage reaches 100% for long period of times (more than 5 minutes) the root cause should be investigated as there may be performance issues.

1.2 - Load

These values represent the average system load in the last 1, 5, and 15 minutes. The chart in the New Relic Server monitor page is the 1 minute value at the time the data is sampled.

1.3 - Memory

The physical memory chart illustrates the percentage of physical memory and swap space being used. If the Memory usage exceeds 100% of the available RAM, the swap usage will increase which will degrade the application performances. Fine tuning of the memory usage should be done per application server.

1.4 - Disk I/O Utilization

This chart represents the percentage of input versus output for the disks on a server. This is not a representation of throughput. Note that if the server swaps memory, the Disk I/O will be under heavy usage. On AWS this may slow down performances when using EBS.

1.5 - Network I/O

The Network I/O chart illustrates the input and output being transmitted and received, in units of Megabytes per second.

2 - Server alerts policies

NewRelic provides a set of default alerts on the servers based on the CPU and Memory usage. Depending on the applications running on the servers, you may want to tune these alerts.

Image Added

Go to NewRelic Dashboard > Servers > Alerts > Server Policies

Select the Policy group you want to edit or create a new one. These are the recommended settings

  • CPU: Send alert after 5 minutes > 95 %
  • Disk I/O: Send alert after 10 minutes > 75 %
  • Memory: Send alert after 10 minutes > 100 % (this includes swap)
  • Fullest disk: Send alert after 30 minutes > 90 %

It is highly recommended to send alerts to the Slack channel #alerts