Streamlining Incident Response with Central Log Management System

If you ask me what is the best place to get more insights to infra, my answer would be from Infrastructure Issues. I spend a good first half of my career supporting and solving infra issues be it network or server. In those days orgs used to heavily depend on IT support, they still do, though it has transformed to present day tech support or cloud support.

Unfortunately , even today it comes to infra. Its reactive approach to solve these issues has been the norm, although proper analysis of incidents with proactive measures can prevent them but they get ignored since they are considered non-functional or operational . These come to notice when businesses start losing customers to their competitors, review their annual performance and identify its due to lower QoS. The faster you can respond and solve any issue the better your QoS and the more trust you establish with your customers.

How many hours you have spent last month on production issues, how many downtimes you had and how long you spend to bring a back system up. Before I continue further, spend the next few seconds to find out how long those so-called non-productive hours are, so rewind your memory and start noting the downtimes.

Why log ?

Observability and can debug faster.
Root cause analysis and postmortem to identify the root cause of an incident and take preventive measures.
In turn allowing you better commitment to your SLAs and SLOs and having the max no. of 9s in it.

In short it helps identify the fault lines in your infra and help to build a fault tolerant service.

Some common challenges in logging:

Missing Logs
Format Issue
Too many tools
Too much logs
Log corruption

What should be logged ?

Logging must be enabled to build observability for both infrastructure and applications. It is a must for:

Infrastructure Components like network, system, storage to monitor health, performance, incidents, configuration and access management. Unless you log, how will you know why the last deployment failed , if it was an infrastructure issue or application bug.

Likewise log collectors must be enabled by default for all applications being developed to build observability and traceability in performance.

In addition to the standard logging, enable audit logging for security purposes such as when a user gets created or deleted, if there are multiple failed logins, or too many access errors etc. These audit logs are a must during compliance and regulatory audits.

Central Log Management:

In modern IT infrastructures, centralised logging is crucial for effective system monitoring, troubleshooting, and compliance. Centralised logging setups can be deployed either on the cloud or on-premise, each offering distinct advantages and challenges. This article explores both approaches, focusing on traceability, standardisation, log aggregation, storage, log rotation, and retention policies.

A One-Stop Station for Traceability:

Cloud platforms like AWS CloudWatch, Google Cloud Logging, and Azure Monitor provide integrated environments for centralised logging. These services offer a unified interface to collect, analyse, and visualise logs from various sources, enhancing traceability across distributed systems. The scalability of cloud infrastructure ensures that as the volume of logs grows, the system can handle the increased load without significant manual intervention. Cloud logging services also integrate seamlessly with other cloud-native tools, improving traceability through automated correlation of logs, metrics, and traces.

On-premise solutions, often utilizing tools like the Elastic Stack (ELK: Elasticsearch, Logstash, Kibana), Splunk, or Graylog, provide robust traceability within a controlled environment. On-premise setups can be tailored to specific organizational needs, offering complete control over data and infrastructure. However, scaling to handle large volumes of logs can be challenging and expensive. Traceability might require significant manual configuration and integration, especially in heterogeneous IT environments.

Formatting For Human Readability:

Standard format to make it human readable with standard log collectors like filebeat, fluentd, journald to prevent missing logs or corruption issues.

Cloud logging services often come with pre-configured support for standard log formats, making them human-readable and easily searchable. Tools like Filebeat, Fluentd, and journald can be integrated to forward logs in standardised formats. Cloud providers ensure that logs are parsed, indexed, and formatted correctly, reducing the risk of missing or corrupted logs. Additionally, cloud platforms provide built-in dashboards and visualisation tools, making log analysis more accessible.

On-premise setups also support standard log collectors like Filebeat, Fluentd, and journald to gather and forward logs. The flexibility of on-premise solutions allows for customization of log formats and processing pipelines. However, this flexibility requires skilled personnel to configure and maintain the system. Ensuring logs are consistently formatted and free from corruption necessitates rigorous monitoring and maintenance, which can be resource-intensive.

Log Aggregation and Shared Storage:

Logs must be aggregated and stored in shared storage like NAS, SAN.

Cloud environments excel in log aggregation and storage, leveraging services like Amazon S3, Google Cloud Storage, and Azure Blob Storage. These services provide scalable, resilient, and cost-effective storage solutions. Logs from various sources are aggregated into a central repository, ensuring that they are accessible for analysis and auditing. Cloud storage solutions offer redundancy and high availability, minimising the risk of data loss.

On-premise central log management systems typically utilise a Network Attached Storage (NAS) or Storage Area Network (SAN) for log aggregation and storage. While these solutions offer high performance and control, they also come with higher costs and complexity in terms of setup and maintenance. Ensuring data redundancy and availability requires additional infrastructure and management effort. Aggregating logs in an on-premise environment can be challenging, especially when dealing with large-scale distributed systems.

Log Rotation and Retention Policies:

Enable regular log rotation and achieve logs older than 30 days. Set a log retention policy inline with your compliance needs. Most regulators especially in the banking or financial sector may need you to retail logs for at least a year or two. In such a case archive and keep them in a separate storage. Or else you might hit a disk space issue for your log aggregator.

Cloud logging services offer built-in support for log rotation and retention policies. These services can automatically archive logs older than a specified period to cheaper storage tiers or even delete them, ensuring compliance with regulatory requirements. For industries like banking and finance, where logs must be retained for extended periods, cloud providers offer solutions to archive logs securely and cost-effectively. Automation reduces the administrative burden and minimizes the risk of hitting storage limits.

Implementing log rotation and retention policies on-premise requires careful planning and configuration. Tools like Logrotate can manage log rotation, but administrators must ensure that archived logs are moved to separate storage to prevent disk space issues. Meeting regulatory requirements in sectors like banking often involves setting up secure, long-term storage solutions, which can be both costly and complex. Ensuring compliance and avoiding storage constraints require diligent monitoring and proactive management.

Conclusion:

Faster you can debug & solve an issue the shorter your SLA or SLO to your customer. Customer sales increased with the increase of 9s in the SLA.

For faster debugging you need a one stop station of human readable formatted logs, in short a centralized robust logging system that is easy to operate, stable, scalable, secure and cost effective.

Choosing between cloud-based and on-premise centralized logging depends on various factors, including scalability, control, cost, and regulatory requirements. Cloud-based solutions offer ease of use, scalability, and integrated services, making them suitable for organizations looking for a hassle-free, scalable logging setup. On-premise solutions, while offering greater control and customization, come with higher complexity and cost, suitable for organizations with specific security and compliance needs.

Both approaches can provide a one-stop station for traceability, standardised log formats, effective log aggregation, and robust log rotation and retention policies. The choice ultimately depends on the organisation's priorities and resources.

If you like this article, I am sure you will find 10-Factor Infrastructure even more useful. It compiles all these tried and tested methodologies, design patterns & best practices into a complete framework for building secure, scalable and resilient modern infrastructure.

If you like this article do like 👍 and share ♻ it in your network and follow Kamalika Majumder for more.