SRE Up to Speed. Data Collection. Part 1

5 min readNov 4, 2022

As I jumped in SRE space two years back from Software Engineering. I realized quickly that I lacked the troubleshooting depth of a sysadmin. Had I known it earlier, I would have resolved incidents more efficiently, predictably and with certainty. My strengths were around software and system design. In cloud-native infrastructure space, the ideal contributor is best of both worlds. System admin and system design.

To overcome my lacking, I decided to identify first the most crucial system administration and troubleshooting techniques. Sharing my learning, here to save time for software engineers venturing in to infra/SRE space .

I have classified the skill set in three categories:

Data Collection.
Data formatting and analysis.
Network and OS monitoring.

Data Collection

In case of failure, finding about current state of system. Its done by identifying the scope and then collecting all the data relevant to that scope. If required, bundle and transport it to local system.

Lets define a failure. An un-desireable state of a system or an application, where Application is any user software and system is OS (Unix/Linux) for the scope of this article.

An operating system provides support for the application to run by managing its life-cycle. The life-cycle begins by reading some initial configurations, running a process which has state and writing the internals events to log files until the process exit with success or failure.

A UNIX/LINUX file system by design organizes the above mentioned items in individual file systems.

Configurations in /etc
Logs in /var/log
State /proc

Configurations

Host

The /etc hierarchy contains host specific configuration files. A "configuration file" is a local file used to control the operation of a program; it must be static and cannot be an executable binary.

A configuration file can be a static text file of any format, from key value pairs to json.

User

The Xdg linux systems creates .config folder in home directory to put user specific configuration files into it. The ownership of the files belong to that specific user only.

Logs

Host

The directory /var/log contains miscellaneous log files. Most logs are written to this directory or an appropriate sub-directory.

Taken from loggly.com

Some of the most important Linux system logs include:

/var/log/syslog and /var/log/messages store all global system activity data, including startup messages. Debian-based systems like Ubuntu store this in /var/log/syslog, while Red Hat-based systems like RHEL or CentOS use /var/log/messages.
/var/log/auth.log and /var/log/secure store all security-related events such as logins, root user actions, and output from pluggable authentication modules (PAM). Ubuntu and Debian use /var/log/auth.log, while Red Hat and CentOS use /var/log/secure.
/var/log/kern.log stores kernel events, errors, and warning logs, which are particularly helpful for troubleshooting custom kernels.
/var/log/cron stores information about scheduled tasks (cron jobs). Use this data to verify your cron jobs are running successfully.

Application developers can choose to right there logs in standard output or a file. The path to file is usually provided in configuration.

State

The proc filesystem is the de-facto standard Linux method for handling process and system information.

For each process in the linux, there exist a corresponding directory in /proc file-system.

Expect list of all process and directories under /proc to nearly equal.

ls -l /proc | wc -l ≈ ps -ef | wc -l

The state of process are organized in distinctive files. A list of these files and there content is given below.

Files under each process directory. Taken from kernel.org

More information here.

The status file under the directory, contains information about status in readable form. Infact ps directly reads information from status.

Example (nginx)

Lets collect configuration, state and log data for an application. In our case nginx.

Logs

nginx write access and error logs under /nginxdirectory in var logs. Logs are rotated by keeping most recent logs in access.log and numeric suffix are added to older file with highest being oldest. eg access.log.2

Path for all access logs can be collected with

find /var/log/nginx | egrep '(access.log)+' will print something like

You can pipe the output to zip and cp to your local machine for later analysis.

State

Lets see how many processes under nginx daemon are under sleep or running. We will find all process for nginx, and read the status file under them for state attribute and count them.

ps -ef | egrep '(nginx)+' | awk '{$2="/proc/"$2; print $2} | xargs -I{} find {} -maxdepth 1 -name status' 2>/dev/null | xargs cat | grep (State). | uniq -c

Should get us an output like

Configuration

nginx organizes its configuration in files with .conf suffix.

find /etc/nginx/ -maxdepth 1 | egrep '(.conf)$'

should output

which as mentioned before you can read for analysis or archive and ship for later.

Conclusion

A unix/linux file system organizes configuration, state and logs under /etc, /procand /var/logfile systems respectively. As an SRE, your heauristics should be to

Read up by which naming convention your application organizes its configurations and logs.
Find those files or specific content in them by searching in respective filesystems using regular expressions and pattern matching to find specific content in them.

Part 2 of data collection describes regular expressions in detail.