Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »



0 - New Relic

New Relic is a tool used to give deep performance analytics for every part of your platform: applications and servers. You can easily view and analyse massive amounts of data, and gain actionable insights in real-time. It's a major tool to use in case of an incident.

1 - Applications / APM

To know more about APMs, see Application Performance Monitoring (APM)

Important KPIs to follow on an APM:

  • Web or non-web Transactions
  • Response time
  • Error rate
  • APDEX score
  • Availability

2 - Servers

To know more about Servers, see coming soon

Important KPIs to follow on a server:

  • CPU usage
  • Memory
  • Load

3 - Synthetics

To know more about Synthetics, see: Synthetics

4 - Alerting

KPIs

Ping

Channels: Slack & emails

5 - Process

Now that alerts are configured correctly, when alerts are received, here are a few tips to follow: 

  1. Acknowledge the alert (click on the link sent with the alert and click on ""). To know more: https://docs.newrelic.com/docs/alerts/new-relic-alerts/reviewing-alert-incidents/acknowledge-alert-incidents
  2. Use the channel #incident and @here on Slack to communicate around the incident resolution and ask for help
  3. Contact someone else from the team to help you resolve the incident, starting with the Operations Manager so that he/she can communicate appropriately about the incident (inside the engineer team / to the customer success team / on the incident status page / to enterprise customers) and help you (directly or indirectly).
  4. If possible, focus first on resolving the incident, and then the investigation (example for a platform outage: focus on bringing the platform up before investigating the root cause)
  5. If the incident is resolved, or if someone new needs to replace you on the incident investigation (e.g.: time zone problem): 
    1. Raise a ticket "Incident" in JIRA describing: "Incident description and consequences", "Actions taken", "Root Cause", "Recommendation"
    2. If the incident is closed, close the JIRA Ticket, otherwise, change to the correct status.
    3. If the incident is closed but the root cause has not been identified yet or has not been corrected yet, raise a ticket "Problem" in JIRA, linked to the incident ("caused by")
    4. Before leaving an investigation, make sure that someone from the Operations Management and/or an investigator is accross the incident and ready to investigate. 
    5. Assign the JIRA tickets appropriately, publish the JIRA ticket(s) on Slack on the #incident Channel
  6. If you started investigating on an issue after being assigned on a JIRA ticket. Describe your progress using the "Comments" section in the JIRA ticket(s)
  • No labels