Centralised Logging: Log Better, Debug Faster, Be Safer

Mar 15, 20243 min read

If you ask me what is the best place to get more insights to infra, my answer would be from infrastructure issues. I spend a good first 5 years of my career supporting and solving infra issues be it network or server. In those days organisations used to heavily depend on IT support teams, they still do, though it has transformed to present day tech support or cloud support or SRE.

Unfortunately, even today, when it comes to infra, its reactive approach to solve these issues has been the norm.

Although proper analysis of incidents with proactive measures can prevent them but they get ignored since they are considered non-functional or operational .

These come to notice when businesses start losing customers to their competitors, review their annual performance and identify its due to lower QoS.

The faster you can respond and solve any issue the better is your QoS and the more trust you establish with your customers.

Let me ask you something:

How many hours do you spend every month on production issues ?
How many downtimes have you experienced and how long did you spend to bring your systems back up ?

Before I continue further, spend the next few seconds to find out how long those so-called non-productive hours are. So let's rewind your memory and start noting the downtimes.

Why logging is a must?

For end to end observability and faster debugging.
To perform root cause analysis(RCA) and postmortem of an incident and take preventive measures.
It ensures better commitment to service availability increasing the number of 9s in your SLAs.

In short it helps identify the fault lines in your infra and help to build a fault tolerant service.

There are some common challenges that you might encounter while logging such as:

Missing Logs
Format Issue
Too many tools
Too much logs
Log corruption

Heres how you can mitigate these issues and avoid any information loss while logging.

What should be logged ?

Logging must be enabled to build observability for both infrastructure and applications.

It is a must for critical infrastructure layers like network, system, storage, to monitor health, performance, incidents, configuration and access management.

Unless you log, how will you know why the last deployment failed, if it was because your servers were unhealthy or an application bug.

Likewise, log collectors must be enabled by default while developing applications to build observability and traceability in performance.

In addition to the standard system or application logging, audit logging is a must for observing secured activities such as:

User creation or deletion.
Multiple failed logins, or
Too many access errors etc.

These audit logs are a must during compliance and regulatory audits.

Once logging is enabled it is essential to preserve them properly, sometimes for years depending on what kind of security standards your organisation has to comply. This is true not only for real time logging but also for archived ones. To avoid the common issues seen in logging as mentioned earlier you need "Centralised Log Aggregation".