Centralised Monitoring For Compliance

Mar 22, 20245 min read

End to end monitoring can help reduce infrastructure issues and downtimes through proactive identification of fault lines and bottlenecks in infrastructure.

A centralised monitoring dashboard can ensure everyone from engineers to CXOs are on the same page when it comes to the state of infra.

I want to tell you about a very interesting conversation from my sysadmin days.

This was back in 2010 while I was working as a sysadmin, one day our head of infra services called us and asked: “Is everything fine with the network, why am I not seeing any alerts from last few months”.

To that our senior replied, "There are no alarms raised because everything has been working fine."

He was still not convinced, so he had to be taken through the monitoring dashboard and shown that, there was no alerts because there is no issue. Interesting story isn’t it. Even though it’s been 11 years since then, I still see organisations having trust issues with infrastructure. It’s the easiest scapegoat to blame for any issue in the services.

As the saying goes seeing is believing, it is necessary to bring transparency into all activities happening in infrastructure through a robust monitoring system. My ideal design is a one stop station for monitoring all kinds of metrics and analytics, a centralised dashboard with metrics from:

Infrastructure resources - Network, System, Storage etc
Applications - Frontend and Backend Services, Third Party Integrations etc
Backend Services - Deployment state, health check, performance, latency etc
Frontend App (Mobile)- Synthetics, Crash analytics etc
Business - User activity, sessions, transactions etc

How to categorise Monitoring?

Monitoring data can be categorised in two ways:

What to Monitor:

Infrastructure
Application
Services

How to Monitor:

State
Performance
Events

Infrastructure Monitoring:

Based on the two categories above infrastructure must be monitoring for its:

State:

Health Check
Uptime/Downtime
Availability - Data Center/AZ monitoring
Connectivity - Intranet & Internet works

Performance:

CPU/Memory/Disk utilisation
Network Bandwidth
Peak hour Traffic

Events:

Authentication failures/Too many failed logins
Unauthorised access
Change Management
Configuration Changes like FW rules
Deployments

Application Monitoring:

Likewise for applications and services monitoring must be set up for the following:

State:

Third party connectivity
Integration
Frontend to Backend Flow
Truepath/Purepath for fault identification

Performance:

Crash analytics for mobile apps
Performance tests
Regression Tests

Events:

User activity/sessions
Transactions
Downloads
Analytics Business
Business metrics, conversions, behavior etc

Centralised Monitoring Dashboard:

A one stop station that provides end to end pure path visibility of how each and every component involved in the software delivery cycle is performing.

This monitoring system should be integrated to a centralised Identity provider that has role based access control and single sign on to provide secure user access.

Compliance of user data, localised caching, security, cloud service monitoring are some more factors that should be considered while selecting any monitoring system.

How to choose your monitoring tool:

There are both cloud managed and self managed options available for monitoring tools. The choice depends on various factors :

Ensure the tool you choose can monitor everything from infrastructure, to application till business metrics. Centralised monitoring means you have everything under one umbrella. Many clouds provide centralised managed services for monitoring, such as AWS Cloudwatch, GCP Operations etc.

There are also managed versions of various third-party or self managed services, such as Prometheus, Dynatrace. You can either set it up yourself or use the managed or SaaS versions available at the respective vendor.

Cloud managed or Saas versions of monitoring tools are efficient because they take the operations overhead away, with a service change. If you compare the time and effort spent on setting up and managing an entire centralised monitoring tool by yourself, you will find spending a few more dollars cost efficient on managed tools.

However, there are some security factors to consider when using managed services. At times, it has been observed that these cloud managed monitoring tools scan the apps and infra from some management console outside the customer vpc. In some of my previous projects this had raised red flags as we had no access to those networks nor the management console. Hence we had to get it in agreement from the cloud provider to confirm that none of our data is cached outside of our designated region and is protected from any third party snooping or data theft.

Another important aspect is the ability of the monitoring tool to integrate with your centralised identity provider(IDP). Check for a tool which can integrate with your IDP with SSO. This will help onboarding and off-boarding users a lot easier and traceable. All the more reason to choose them if they provide RBAC.

Don't Forget The Alerts/Alarms:

As much as it is important to choose the right tool, it’s equally important to set the alerts/alarms properly. You must set alerts for what you want to be notified for, neither allow all or none.

Here are some alerts that are useful for:

Cloud Account

Billing payment due or credit check
Root Account Login
Renewal of Subscription
Alerts are sent out upon modifications of ACLs and security groups
Updates - version, patches, certificated etc

Service Availability

Alert when something goes down and comes back up
Fault Aware and Tolerant
Separate Alerts for infra and app

Operations Dashboard

Collect Alerts in one system like slack
Pagerduty
Sanitise Alerts, don't spam

And last but not least, once received don't forget to acknowledge the alerts or it will spam your inbox or annoy you with constant notifications.

Things to consider for monitoring setup:

First responders need a better dashboard where valid alerts are collected, monitored and acknowledged. Alerts must be sanitised and categorised to prevent spamming.
It is ok to go with licensed monitoring tools provided it covers all the layers like infra, app, mobile, business etc and provides technical support and maintenance.
You can use more than one tools but ensure all of them are integrated into one dashboard
Monitoring Screens also help in cases on satellite service centers or during critical launches.

Summary:

For a fault tolerant infrastructure you need robust end to end monitoring and alerting.

This can be achieved through a centralised system to monitor the state, performance and events happening across infrastructure and application which sends alerts almost immediately when incidents occur.

This monitoring system should be integrated to a centralised Identity provider that has role based access control and single sign on to provide secure user access.

If you like this article, I am sure you will find the 10-Factor Infrastructure even more useful. It compiles all these tried and tested methodologies, design patterns & best practices into a complete framework for building secure, scalable and resilient modern infrastructure.

Don’t let your best-selling product suffer due to an unstable, vulnerable & mutable infrastructure.

Be fit to launch & scale on a compliance ready cloud from Day 1

with 10factorinfra

Thanks & Regards

Kamalika Majumder

The 10-Factor
Infrastructure