Quantcast
Channel: Loop1
Viewing all articles
Browse latest Browse all 140

THWACK MVP Insights: A Decade of Data Fusion with SolarWinds

$
0
0

Over the course of the last decade, I have made my SolarWinds Orion instances do a lot of things that the original development team might not have had in mind.  In this article, I’ll talk about some examples of the ways I brought data into Orion or, in some cases, pulled data out.  I’ll start off with some of the less technically demanding strategies from my early days, but the later examples relied on significant custom scripting and database work.

Most of the first custom ideas I had for my Orion instance came about during my early days as a network technician. Where I worked, things were pretty poorly documented, and I didn’t have the skills to do anything sophisticated, but I wanted a place to store information that I was picking up about the building I worked in so the next noobie wouldn’t be as lost as I was. These weren’t the most elegant solutions, but they made life easier for my team, and I could execute them with just the few months of IT knowledge I had at the time.

Connecting with documentation in Google Sheets
A person with network cables

So, one of the sources of pain in my life at this time was that we had patch panels all over the building in every closet, and the labels were either missing or just totally incorrect.  I realized that getting called up to fix a device that couldn’t connect and then having to go tone the wires to trace it back to the switch was a huge time sink, so I wanted to proactively figure out where all this spaghetti was going during some of my slower shifts.  Initially, I was just keeping a pocket notepad and writing things down as I figured them out, but it didn’t take long to decide that was not going to work long term.

To help keep track of the information I was figuring out about our cabling, I created a spreadsheet in Google Sheets and started keeping a simple grid describing what was connected to either side of the panel so that the next time I got a call from the front desk manager that they couldn’t get on the network, I would have the necessary information handy.  SolarWinds came into this story when I decided to add a custom property to interfaces where I would paste the link to the relevant cell in the spreadsheet.  If I got an alert from SolarWinds that an interface had gone down, I could quickly jump from the Interface Details view to cross reference against the spreadsheet, and I would have at least a little bit more information to help me move toward a resolution.  Over the years, I have found that using custom properties to link back to relevant documentation makes a measurable impact.  And if you find that you don’t have documentation to cover an alert-worthy situation, then that just raises the flag that maybe that is something that needs to be focused on.

Screen shot
Screen shot
Leveraging QR codes
QR code

As time went on and I got a better handle on the environment, I built out groups in SolarWinds that represented each of my MDF/IDF closets and all the racks inside them.  As I learned about each server in my building, I would populate custom properties with things like their Closet and RackUnit and the Applications it hosted, and who to call when it was broken. With that information in place, it wasn’t too hard to build Atlas diagrams that showed the current polled status of each bit of hardware in each rack. 

By this time, I was considered relatively “senior” on my team, but we had new hires who were back at square one, trying to learn our naming conventions and not knowing what any of the applications were for, and I wanted to give them an easy place to look things up when they were in front of the rack.

One day, I took the URLs for each of my rack groups and put them into a QR code generator, printed it all out, and stuck them on each rack.  Now, any time we wanted to know more about the rack we were standing in front of, we could scan the QR on our phone, and since we were on the internal corporate WiFi, it would pull open the Group Details view in SolarWinds, and we would see all the useful properties I had been populating.  If some bit of hardware was getting swapped out, we could update the properties right there and then instead of trying to remember to do it when we got back to the desk.  Any time I needed to remember to update documents, you can be sure that I was going to be stopped in the halls at least 3x.

Adding user-defined images to web server

When I got into consulting, an early customer of mine was an MSP (Managed Services Provider) who supported a collection of several hundred rural cotton gins. As you can imagine, cotton gins often don’t have proper data closets. Their routers and switches would be sitting under someone’s desk, or on top of a shelf, or hidden behind a potted plant. The technicians would take pictures of any managed IT assets so they knew where to find them next time.

When I came in to help set up their SolarWinds, they asked me if there was a way that we could include those pictures when their techs looked at nodes in SolarWinds. After some discussion, the solution we came to was to create a directory in the Orion website folder that kept copies of all the pictures, organized by the site code of the client. Imagine a file path like /Orion/Images/SiteABCD/Image1.jpg through Image10.jpg. Then we added a tab that we called Images to both the Node Details and the Groups Details views. I’m not an HTML guru, but I knew enough to fill the page with a set of 10 HTML widgets. Each widget had something to the effect of  <img src = “/Orion/Images/${site}/ImageX.jpg”></img>, so if there was an image for that site, it would pop into the view. It took us a few hours to create all the relevant folders, fill them with the images, and get the files renamed, but it felt like it was worth the effort.

Screen shot 3
Screen shot 4
Skipping ahead a few years…

I spent about 5 years consulting for various companies until I eventually went back to being a full-time employee at a large enterprise. By this time, my work was pretty much always sitting at a desk, rarely getting up and heading into a data closet to fix something myself. Since I was there for a longer term, it allowed me to get much deeper into creating custom integrations and scripts and not having to hold back because I didn’t want to hand over something to a client that they wouldn’t be able to maintain after the project ended. My team were all monitoring specialists who defined corporate policy and then built tools to simplify the usage of various monitoring and observability tools. We supported tens of thousands of devices across 7 different monitoring-related tools, so the main mission was to automate our lives as much as possible while creating a more cohesive and unified experience for our users.

Pulling from a CMDB API
Screen shot 5

This is the first order of business in any big environment. Many companies use ServiceNow (SNOW), or if you have not committed to a CMDB platform consider SolarWinds Service Desk, but this process is roughly the same no matter which CMDB you use. You need to take the list of attributes being populated in the CMDB and make sure you have equivalent custom properties in SolarWinds. I also prefer to include the CI itself as a custom property so I can easily generate links that will take me directly to the CMDB object that I have associated with the SolarWinds node.

The other big factor to keep in mind is that often you have to build up some logic about what kind of CI objects you actually want to match. Quite often there will be a separate CI for a server and the virtual machine that really represents the same object. It took me a fair bit of trial and error in my environment to come up with logic that matched my Orion objects to the “thing” in SNOW which most people agreed was the “same” in both tools. Every different type of CI object in the CMDB tends to have different sets of attributes, so it can be a struggle to coalesce them into one comprehensive set of data. If you stick to it and have some level of cooperation from the teams who populate this data, you will be able to figure out the matching logic that makes sense within your environment.

Several years ago, I published a simplified example of pulling SNOW CMDB data into Orion and syncing a set of properties here

If you are at a big enough enterprise, it is not unusual to end up getting throttled by your SaaS vendors because you send too many requests from all your tools and users. Over time I became painfully aware of the limits of SNOW semaphores. We had a tool that cloned our SNOW data to a SQL server, so I was able to set up a linked database in SQL Server so I could join the SN_CI property to that database and display any SNOW data I wanted directly in my SolarWinds console. If you are purely relying on traditional SNOW then it would be a little more complicated, but doable using custom html widgets and JavaScript making calls to the SNOW API as you load pages in Orion. I liked the SNOW mirror database because it reduced the number of calls we made to the SNOW API. 

Connecting Monitoring to Ticketing Databases
Screen shot 6

If you are using an on-premises ticketing tool, then there will always be a database that you can pull in. My environment was using Tivoli (… I know…), so I was able to link our databases and easily show the status, resolution code, and history of all tickets associated with any given node inside SolarWinds. Once I started looking at the full life cycle from monitoring to alerting to the tickets resolution, I found things like the same alert firing off over and over for months, followed by the support team closing it as “No trouble found” over and over. Clearly there needed to be something done to improve the logic of the alert or improve the training of the people who are responding to the tickets. Once I was looking at the big picture, I was able to reduce the volume of alerts our support teams saw by 70%.

Standardizing patterns across multiple tools
Screen Shot 7

My team was the enterprise system monitoring team, and we oversaw defining the corporate policies across all monitoring tools. Things like saying that every production system was required to be monitored, and they needed an appropriate version of a system-down alerting. Different teams might prefer one tool or another, but it was our job to ensure that thousands of developers working across hundreds of projects were all aligned on consistent, repeatable standards without slipping through the cracks.

Teams would want different thresholds, or they would want different amounts of delay before triggering a particular alert action, but what we found was that the learning curve for our developers to be able to reliably create the alerts they thought they were building across all the different tools was too steep. I did not enjoy getting dragged into 3 AM Priority 1 bridges to explain why the alert the developers thought they had configured wouldn’t work because of some quirk of the vendors they relied on.

Setting up an alert in Splunk or New Relic is very different from setting one up in SolarWinds. We developed a set of standards about how alerts should work that was consistent across all the many tools our developers were allowed to use. Instead of asking them to go into each tool one by one and learn the UI and hope they didn’t fall into any of the “quirks,” we asked that they just follow a set of naming patterns we had documented and publish a list of alerts that they wanted to their team’s Git repository.

The monitoring team would then scrape those requested policies, and we built a translation tool that created the necessary alert logic in whatever monitoring tools that team was using. So, if they needed a SolarWinds CPU alert, then they would have a line in their Git repo config file like “CpuCrit_10m_C” that we knew meant to set up an alert where any time a device was above the critical CPU threshold for more than 10 minutes, and to generate a critical severity ticket.

On our end, we were the SMEs, so we did the heavy lifting to understand the limitations of any given alert in each tool. Ultimately, the goal was to abstract away all that specialist knowledge of how exactly a Splunk, SolarWinds, or any other monitoring tool works so the thousands of developers didn’t have to each go through training on each tool. They built software, and we made sure they knew when it was broken.

Auditing and reporting on our observability Program
Screen Shot 8

Once we had the tools in place to be able to easily generate the alerts that needed to exist, we started to work toward being able to accurately measure the effectiveness and coverage of the monitoring policies we had in place. I could write all the documentation I wanted, but if the developers were ignoring it I wasn’t being effective.

I used PowerShell to scrape data from all our relevant sources into a database I wrote from scratch to be able to understand the lay of the land. I started to build a collection of reports that would find every object in our VMware, GCP, and AWS environments, compare that to each of the monitoring tools, and validate that each system had at least a minimal set of alerts that were compliant with the corporate standards. It took some time, but I could reliably say things like 100% of prod systems were in the appropriate monitoring tools for their team and had the expected set of alerts enabled. If there was something missing, I could schedule a meeting to review it with the teams who owned that system, maybe they had a good reason, and we needed to document an exemption from the blanket policies.

Then, when the alerts were consistently in place, we could review the SLAs attached to those incidents. How long would a system be down before the tools saw the issue (MTTD), how long before someone had been notified and begun troubleshooting (MTTA), and how long before the problems were resolved (MTTR). Armed with this data we set up meetings to help dev teams work through their gaps in terms of tooling, documentation, or procedure to ensure that our business continued to be able to serve our customers. 

I hope this gave you a few ideas on ways you can take advantage of the flexibility of the SolarWinds platform to tie different types of data together from a variety of sources to improve the ability of your engineers to get their work done.

Marc Netterfield
THWACK ID: mesverrum

The post THWACK MVP Insights: A Decade of Data Fusion with SolarWinds appeared first on Loop1.


Viewing all articles
Browse latest Browse all 140

Trending Articles