SRE Up to Speed. Data Collection. Part 1

As I jumped in SRE space two years back from Software Engineering. I realized quickly that I lacked the troubleshooting depth of a sysadmin. Had I known it earlier, I would have resolved incidents more efficiently, predictably and with certainty. My strengths were around software and system design. In cloud-native infrastructure space, the ideal contributor is best of both worlds. System admin and system design.

To overcome my lacking, I decided to identify first the most crucial system administration and troubleshooting techniques. Sharing my learning, here to save time for software engineers venturing in to infra/SRE space .

I have classified the skill set in three categories:

  1. Data Collection.

Data Collection

In case of failure, finding about current state of system. Its done by identifying the scope and then collecting all the data relevant to that scope. If required, bundle and transport it to local system.

Lets define a failure. An un-desireable state of a system or an application, where Application is any user software and system is OS (Unix/Linux) for the scope of this article.

An operating system provides support for the application to run by managing its life-cycle. The life-cycle begins by reading some initial configurations, running a process which has state and writing the internals events to log files until the process exit with success or failure.

A UNIX/LINUX file system by design organizes the above mentioned items in individual file systems.

  1. Configurations in /etc



The /etc hierarchy contains host specific configuration files. A "configuration file" is a local file used to control the operation of a program; it must be static and cannot be an executable binary.

A configuration file can be a static text file of any format, from key value pairs to json.


The Xdg linux systems creates .config folder in home directory to put user specific configuration files into it. The ownership of the files belong to that specific user only.



The directory /var/log contains miscellaneous log files. Most logs are written to this directory or an appropriate sub-directory.

Taken from

Some of the most important Linux system logs include:

  • /var/log/syslog and /var/log/messages store all global system activity data, including startup messages. Debian-based systems like Ubuntu store this in /var/log/syslog, while Red Hat-based systems like RHEL or CentOS use /var/log/messages.

Application developers can choose to right there logs in standard output or a file. The path to file is usually provided in configuration.


The proc filesystem is the de-facto standard Linux method for handling process and system information.

For each process in the linux, there exist a corresponding directory in /proc file-system.

Expect list of all process and directories under /proc to nearly equal.

ls -l /proc | wc -lps -ef | wc -l

The state of process are organized in distinctive files. A list of these files and there content is given below.

Files under each process directory. Taken from

More information here.

The status file under the directory, contains information about status in readable form. Infact ps directly reads information from status.

Example (nginx)

Lets collect configuration, state and log data for an application. In our case nginx.


nginx write access and error logs under /nginxdirectory in var logs. Logs are rotated by keeping most recent logs in access.log and numeric suffix are added to older file with highest being oldest. eg access.log.2

Path for all access logs can be collected with

find /var/log/nginx | egrep '(access.log)+' will print something like

You can pipe the output to zip and cp to your local machine for later analysis.


Lets see how many processes under nginx daemon are under sleep or running. We will find all process for nginx, and read the status file under them for state attribute and count them.

ps -ef | egrep '(nginx)+' | awk '{$2="/proc/"$2; print $2} | xargs -I{} find {} -maxdepth 1 -name status' 2>/dev/null | xargs cat | grep (State). | uniq -c

Should get us an output like


nginx organizes its configuration in files with .conf suffix.

find /etc/nginx/ -maxdepth 1 | egrep '(.conf)$'

should output

which as mentioned before you can read for analysis or archive and ship for later.


A unix/linux file system organizes configuration, state and logs under /etc, /procand /var/logfile systems respectively. As an SRE, your heauristics should be to

  1. Read up by which naming convention your application organizes its configurations and logs.

Part 2 of data collection describes regular expressions in detail.



Bikes, Tea, Sunsets in that order. Software Engineer who fell in love with cloud-native infrastructure.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store

Bikes, Tea, Sunsets in that order. Software Engineer who fell in love with cloud-native infrastructure.