Site Reliability Engineering: The Time for Change is Now!

OK…. I have a confession to make. Despite my intentions, it appears that I have not been entirely truthful with you. I know that I had planned to discuss and demonstrate Github usage for this post but I just can’t do it. I think it’s time that I take this blog and “change course” and start diving into some real topics and concepts. When I say “real”, I’m referring to the core reason and motivation behind me even starting this blog, being an SRE. Also, let’s be honest, if you want to learn how to use github, go to github.com. In addition to their site having plenty of documentation, said documentation is very thorough and easy to follow. Also, there’s a high percentage chance that github is a more reliable source then myself.

Having said that, I think it’s time to dive into a more relevant and far more exciting topic, Site Reliability Engineering. However, in order to achieve enlightenment, one cannot simply decide to become an SRE. In order to embark on this quest one must be in the correct mindset. This can be achieved by asking yourself a simple question…

What the f**k is a Site Reliability Engineer?

Introduction: The World of SRE

Admittedly, you may not actually need to ask yourself the question. However, having said that, what is a Site Reliability Engineer? A Site Reliability Engineer takes a programmatic approach to monitoring, managing, and maintaining systems, applications, and infrastructure. In the short time that I’ve been in this role, I’ve already been all of the board learning and working with everything, from application flow to network and database management. It occurred to me very quickly that this role requires at least a fundamental understanding of the entire spectrum of IT (Software, Hardware, Networking, etc). I’ve even had to draw on my limited InfoSec knowledge in order to accomplish various tasks. However, the core objective of an SRE is pretty straight forward. Learn every inch of the system, determine it’s shortcomings, analyze issues with it, and, finally, automate a solution for it. In a Windows based environment, this is where PowerShell really flexes it’s muscles.

There’s never been a question of the value of GUI based tools such as SCCM or SSMS or any one of the many other server and AD management based tools available. However, when it comes to utilizing only GUI based tools, the truth is that they will never provide the same flexibility and advantages as PowerShell or any command based interaction. In fact, the vast majority of diagnostic, troubleshooting, or even evaluation solutions can be automated in some fashion. This is where the concepts of being an SRE come into play. One of the most reoccurring concepts, in my case anyways, is being able to identify a problem, determine it’s root cause, create a plan or procedure to prevent it from occurring or reoccurring, and then creating an automated solution for it.

Maybe it’s just me but I would be lying to you if I said it was easy to get to that point. In order to really make an impact in your environment, you really need to know every inch of it. In my case, this meant hours of sifting threw line after line of log files, learning various error codes for a multitude of applications, and learning “every inch” of the infrastructure. This is actually where my experience on in the Systems world really helped me. When I was offered this job, I was fairly surprised given my background. However, it was made abundantly clear to me that, while I might not have been able to code a fully functioning application from scratch, I do know my way around an infrastructure, at least on the windows side. When you’ve had plenty of hands on experience with every component of a given infrastructure, administrated all the various Active Directory components, and understand the core concepts behind programming thoroughly, your syntax knowledge or experience almost becomes irrelevant. The reason is as simple as it gets. If you don’t know or understand something, you dig in and learn it. With that in mind…..

Enter POSH aka Windows PowerShell and the Beginning of an Addiction

Although I’d used it intermittently throughout my IT career, I really did not get into to PowerShell until a very absurd and tedious task was given to me. The task involved taking a CSV of computers with mismatched or empty computer descriptions, both local and Active Directory. My task was to ensure that both local and AD descriptions matched across the board. This posed a few problems for me. First, I was looking at something like 500 computers that needed to be updated. To make matters worse, our infra was garbage and slow. This means that manually updating a few hundred descriptions would have likely taken days. However, while I was debating how I would tackle this project or even whether I would do it, a thought popped in my head, “I wonder if I can write a PowerShell script to do it”. A little less than two hours later, I had “achieved enlightenment” and had created a very ugly and basic script, but it did work. It’s a pretty simple concept for me now but back then, I thought I was entering the matrix for the first time. The script imported the CSV, copied the local description of a given machine to AD or vice versa. I was able to knock out the entire list in less than 15 minutes.

After writing my first script, I had no doubt in my mind that PowerShell was something I really wanted to learn. Not only for potential value to my resume but also because it was fun. I spent the rest of my time at that job writing scripts for anything I could think of. The majority were really short simple functions created for very specfic tasks. Eventually, I started stringing some functions together and creating some more “complex” scripts. Eventually, I attempted to share my scripts and get other colleagues on board but the idea was generally rejected by both managers and other SysAdmins alike. At first that part really bothered me. All I could think is “You people are insane, why wouldn’t you want to automate shit?” Eventually, it dawned on me that it doesn’t matter at all if others are on board or if I get recognition. At the end of the day, I’m taking this path to do my job. This went on to motivate me even more because I began focusing my energy on learning more about PowerShell scripting as a means of creating better opportunities for myself.

At the time when I was contacted by a recruiter for this role, I had not even heard the term Site Reliability Engineer. In fact, I was certain that I was not qualified for the job and even ignored the email when I initially received. Then an unpleasant day came about and while spending some time in the restroom, I thought to myself “Fuck it, let’s go for it!”. I responded to the recruiter and within an hour, I was already on the phone for my first interview (screening or what have you). The next day, I had my second phone interview with two folks that were already engineers with the company. I can honestly say that it was probably the most nervous I’ve ever been for an interview and it wasn’t even in person yet. At the time, I didn’t have any way to show my skill level with scripting. Minutes before the interview, I quickly edited a couple scripts, threw them on Pastebin, and saved the links to my desktop just in case. At the end of the interview, I asked if they would have any interest in seeing them, to which the obviously responded, duh! Finally, after a three hour long in-person interview with four other random people, I was offered the role of an SRE. To say that I was excited would be a drastic understatement.

Time to GSD: A Brief Overview of What’s To Come

For those of you wondering, GSD means Get Shit Done. But first, I want to start this part by stating that it’s important for readers to understand that this is my interpretation of the role of an SRE. Other SRE roles will likely be different and I would say that it’s a safe bet that some mistakes will be made over the course of my next posts. But I guess that’s the point, right?

To begin, I’ve started by creating a few bare bones scripts that will be the baseline of a troubleshooting module that I’ll be building over the course of the next few posts. I’ve also created a Github repo for sharing and, even more importantly, implementing version control. For this initial post, as previously mentioned, I’m not going to be diving very deep into code. This post is already looking like my longest to date and the functions that I’ve created are as simple and basic as they come. Also, I’m still working through creating my “itinerary” for my future posts. My main goal is to provide some useful information to the interwebs while not adding more redundant information centered around basic concepts (to an extent)

Even the most basic scripts can illustrate the process of creating automated solutions. Whether it be troubleshooting, recovery, or even just gathering information, you would be surprised how much can be accomplished with a few basic scripts and with that, I’ve created three scripts which have been added to my Github. They are very basic and designed for a “Point and Shoot” scenario for perform tasks like baseline troubleshooting, information gathering, and minimal logic for control. For now, I’m going to keep the functions separate while eventually adding more functionality and a few additional scripts or functions. Here’s a list of scripts that I’ve added thus far.

Get-CPUInfo – Get CPU Usage of remote PC
Check-FolderPermissions – Check security permissions on a given diretory
Network-Check – Confirm Network Connectivity
Get-FirefoxVersion – Confirming Software Version

Even being this basic, these four scripts provide enough functionality to create “an alpha version” of an automated solution. To demonstrate, consider the following scenario. Let’s say that multiple users are unable to launch a browser based application. In a situation like this, the troubleshooting is pretty straight forward.

Uninstall/Reinstall
Confirm Network Connection
Confirm Server Accessibility
You get the idea..

95% of the time, the issue has likely already resolved by one of the few steps above. However, the other 5% usually require digging in and performing in-depth troubleshooting. However, even with a 95% success rate, if this type of issue tends to occur fairly often, the amount of time lost can add up quickly. Even more so if you happen to be dealing with end users that are not very tech savvy. Lucky for us, rather than having to repeat this course of events over, it makes far more sense to automate the troubleshooting needed, or at least as much as possible.

The reality is that the majority of these troubleshooting steps can be accomplished with only a few commands. (Ping/Test-Connection, Reg Query, etc). It took me roughly 50 minutes to write all of them, another 10 to debug, and 5 more to add to git for a grand total of roughly 65 minutes. After adding one last script to link the functions together, all troubleshooting steps are being accomplished in around 30 seconds. Think about how much time would be saved over the course of a year from something so basic.

Conclusion

Wow, that was a long one and we’ve finally reached the end of my inaugural SRE post. As promised, here’s the link for my Github Repository https://github.com/noobishsre/NSREPowershell The scripts, I discussed are all posted, as well as, some of the SQL functions that I discussed in previous posts. Although, in regards to the SQL functions, keep in mind that it would require a database configured in the same manner as my test DB. Regardless, I think it goes without saying that automation is “where it’s at”.

Stay tuned for my next post in which I’ll be expanding on a couple more SRE concepts, adding some more functionality to our current scripts, and configuring the repo into a module. It’s going to be lots of fun.

As always, thank you to any and all readers, I hope you enjoyed. Also, feedback is welcomed and encouraged, whether it be comments, suggestions, or you just want to say hi.

Unless, of course, you’re a Nashville Predators fan. In that case you can suck it…. I know, I’m only kidding and also may still be a little sour about my Blackhawks getting swept in the first round. You try watching your team play 4 games and only score a total of 3 effing goals. I’m mean COME ON!!

Okay, I’m done. 😛

For The Love of Engineering