
If you ask me what is the best place to get more insights to infra, my answer would be from infrastructure issues. I spend a good first 5 years of my career supporting and solving infra issues be it network or server. In those days organisations used to heavily depend on IT support teams, they still do, though it has transformed to present day tech support or cloud support or SRE.
Unfortunately, even today, when it comes to infra, its reactive approach to solve these issues has been the norm.
Although proper analysis of incidents with proactive measures can prevent them but they get ignored since they are considered non-functional or operational .
These come to notice when businesses start losing customers to their competitors, review their annual performance and identify its due to lower QoS.
The faster you can respond and solve any issue the better is your QoS and the more trust you establish with your customers.
Let me ask you something:
How many hours do you spend every month on production issues ?
How many downtimes have you experienced and how long did you spend to bring your systems back up ?
Before I continue further, spend the next few seconds to find out how long those so-called non-productive hours are. So let's rewind your memory and start noting the downtimes.
Why logging is a must?
For end to end observability and faster debugging.
To perform root cause analysis(RCA) and postmortem of an incident and take preventive measures.
It ensures better commitment to service availability increasing the number of 9s in your SLAs.
In short it helps identify the fault lines in your infra and help to build a fault tolerant service.
There are some common challenges that you might encounter while logging such as:
Missing Logs
Format Issue
Too many tools
Too much logs
Log corruption
Heres how you can mitigate these issues and avoid any information loss while logging.
What should be logged ?
Logging must be enabled to build observability for both infrastructure and applications.
It is a must for critical infrastructure layers like network, system, storage, to monitor health, performance, incidents, configuration and access management.
Unless you log, how will you know why the last deployment failed, if it was because your servers were unhealthy or an application bug.
Likewise, log collectors must be enabled by default while developing applications to build observability and traceability in performance.
In addition to the standard system or application logging, audit logging is a must for observing secured activities such as:
User creation or deletion.
Multiple failed logins, or
Too many access errors etc.
These audit logs are a must during compliance and regulatory audits.
Once logging is enabled it is essential to preserve them properly, sometimes for years depending on what kind of security standards your organisation has to comply. This is true not only for real time logging but also for archived ones. To avoid the common issues seen in logging as mentioned earlier you need "Centralised Log Aggregation".
Centralised Logging:
It's a one stop station for traceability.
Standard format to make it human readable with standard log collectors like filebeat, fluentd, journald to prevent missing logs or corruption issues.
Logs must be aggregated and stored in shared storage like NAS, SAN.
Enable regular log rotation and achieve logs older than 30 days.
Set a log retention policy inline with your compliance needs.
Most regulators especially in the banking or financial sector may need you to retail logs for at least a year or two.
In such a case archive and keep them in a separate storage. Or else you might hit a disk space issue for your log aggregator.
To Summarise:
Faster you can debug & solve an issue the shorter your SLA or SLO to your customer. Customer sales increased with the increase of 9s in the SLA.
For faster debugging you need a one stop station of human readable formatted logs, in short a centralised robust logging system that is easy to operate, stable, scalable, secure and cost effective.
Unidentified fault lines in infrastructure can become bottlenecks for software development & delivery. It could seriously hinder the growth of a venture transitioning to a cloud-based infrastructure. The 10-Factor Infrastructure makes this transition seamless ensuring you get infrastructure right, the first time.
If you like this article, I am sure you will find 10-Factor Infrastructure even more useful. It compiles all these tried and tested methodologies, design patterns & best practices into a complete framework for building secure, scalable and resilient modern infrastructure.
Don’t let your best-selling product suffer due to an unstable, vulnerable & mutable infrastructure.
Get Compliance Ready Cloud For Startups & Enterprises
in Hours, not months
Thanks & Regards
Kamalika Majumder
Comments