0 - New Relic
New Relic is a tool used to give deep performance analytics for every part of your platform: applications and servers. You can easily view and analyse massive amounts of data, and gain actionable insights in real-time. It's a major tool to use in case of an incident.
1 - Applications / APM
To know more about APMs, see Application Performance Monitoring (APM)
Important KPIs to follow on an APM:
- Web or non-web Transactions
- Response time
- Error rate
- APDEX score
- Availability
2 - Servers
To know more about Servers, see coming soon
Important KPIs to follow on a server:
- CPU usage
- Memory
- Load
3 - Synthetics
To know more about Synthetics, see: Synthetics
4 - Alerting
KPIs
Ping
Channels: Slack & emails
5 - Process
Now that alerts are configured correctly, when alerts are received, here are a few tips to follow:
- Acknowledge the alert (click on the link sent with the alert and click on ""). To know more: https://docs.newrelic.com/docs/alerts/new-relic-alerts/reviewing-alert-incidents/acknowledge-alert-incidents
- Use the channel #incident and @here on Slack to communicate around the incident resolution and ask for help
- Contact someone else from the team to help you resolve the incident, starting with the Operations Manager so that he/she can communicate appropriately about the incident (inside the engineer team / to the customer success team / on the incident status page / to enterprise customers) and help you (directly or indirectly).
- If possible, focus first on resolving the incident, and then the investigation (example for a platform outage: focus on bringing the platform up before investigating the root cause)
- If the incident is resolved, or if someone new needs to replace you on the incident investigation (e.g.: time zone problem):
- Raise a ticket "Incident" in JIRA describing: "Incident description and consequences", "Actions taken", "Root Cause", "Recommendation"
- If the incident is closed, close the JIRA Ticket, otherwise, change to the correct status.
- If the incident is closed but the root cause has not been identified yet or has not been corrected yet, raise a ticket "Problem" in JIRA, linked to the incident ("caused by")
- Before leaving an investigation, make sure that someone from the Operations Management and/or an investigator is accross the incident and ready to investigate.
- Assign the JIRA tickets appropriately, publish the JIRA ticket(s) on Slack on the #incident Channel
- If you started investigating on an issue after being assigned on a JIRA ticket. Describe your progress using the "Comments" section in the JIRA ticket(s)