SRE Up to Speed. Data Collection. Part 2. Regular Expression
Continued from Part 1, where I describe why every software engineer looking to become SRE, must know sysadmin skills.
I have classified the skill set in three categories:
- Data Collection.
- Data formatting and analysis.
- Network and OS monitoring.
Data Collection
In case of failure, finding about current state of system. Its done by identifying the scope and then collecting all the data relevant to that scope. If required, bundle and transport it to local system.
Finding log files, state files, configurations or patterns in a any file. You can’t go far without regular expressions. awk, find, sed, grep all aid data collection and support regular expressions.
Regular Expression
A sequence of characters that describes a pattern in text.
Finding a pattern in text, is problem that exists since 1950s. For example,
For protocol handshake, the client sends an init request, and sever sends back an acknowledge. Both init and acknowledge have structured pattern without knowing which handshake wouldn't be possible.
To look for a pattern, you have to describe it to system. Describing large and dynamic pattern is cumbersome. That’s where regular expression comes in.
Think of regular expression as a combination of two things:
- Compression , where using a dictionary, large text is substituted with single characters or constructs. Hence packing in more information per character.
- Templates to describe the occurrence of character dynamically.
Here are the constructs:
Meta Characters:
*
Matches any character occurrence 0 or n times. (Compression)
For example: access.log, access.1.log could be represented as *.log
+
Matches any character occurrence at least 1 or n times. (Compression)
For example: Finding networking ID in ip address that are not single digit.
++.*
will match ip 192.168.11.11 but not 1.1.1.1
?
Matches any character occurrence 0 or 1 time.
Finding plurals. table?
will match both table and tables.
.
Matches any character once
Defining occurrence of any character at least once. For example, collecting all files in log rotation except latest. access...log
will match access.1.log, access.2.log but will skip access.log
Grouping
Matching a sub-sequence described in parenthesis. Regular expression will treat all characters inside the parenthesis as single entity to match in the provided sequence.
(access)
will match every text that have keyword access in them.
But you can also use operators in a group.
(access | error)
will find all files that either have access or error in them.
Lastly, the can be used to match multiple successive occurrence as well.
(abc){2}
will match abcabc.
Brackets and Classes
Instead of a wildcard *, if you want your character to match a specific range. You describe that range in brackets. A range can be described out of alpha(Lower and Upper case distinctively), numerical and special characters.
[abc]
will match a or b or c.
(access).[12].log
will match access.1.log and access.2.log
Regex also has some built in shortcuts that can be used.
For example \w\s\w
will match any sequence of two words separated by space. Where word is group of characters. For example Hello World
Back Reference
Back reference is use to match a character or group, that previously exist in regular expression. An easier way to understand them is as if you were pushing each group into a stack, and each back reference pops out the last stacked item and match with it.
Imagine markup tags.
<i> Some Text </i> , <head> Some Header </head>
Here both i, and head occur twice. Instead of adding the group twice, we can refer its second occurrence back to first.
<([a-z]*)>+</\1>
Here \1
is the back reference to ([a-z]*) group. Using back reference we can compress our regular expressions even further. Back references can be made to as many group or characters by incrementing the reference number. For example: <div><i> Some Test </i> </div> can be matched with
<([a-z]*)><([a-z]*)>+</\1></\2>
Here \1 refers to i and \2 refers to div.
Anchor and Boundary
Anchor $
after a character or group, matches any text that must end with it.
(.log)$
will match all files that end .log. It will skip log.text
Boundry ^
before a character or group, matches any text that start with it.
^(apache)*(log)$
will match all text that must start with apache and end with log.
Conclusion
Now you’re equipped to quickly find files or logs that match a pattern.
The next step, is to know how a UNIX system organizes log, configuration and state-files. Stay tuned for part 2.