5 Ways to Automate Data Center Operations
Data center automation is the process in which routine processes of data center operations are completed without any manual effort. It increases operational efficiency, improves data accuracy, and simplifies data center management.
One of the leading organizations in driving automation via integration with Data Center Infrastructure Management (DCIM) software is Workday. Workday has been a Sunbird customer for over five years and currently has 900+ users on their system and growing.
In a recent Automation Workshop webinar, the data center experts from Workday shared their real-world use cases and insights on how they use APIs and integration to eliminate manual effort in their data center operations.
"DCIM really is for us a source of truth for the tens of thousands of bare metal assets that we have in our global data centers," said Tim Putney, Senior Software Engineer Manager.
Workday's Key Design Goals for Their DCIM API Architecture
When Workday was designing their DCIM API model, they had three key goals in mind:
- Automatic. Workday wanted to be able to programmatically execute nearly any task that could be done via the DCIM GUI to keep the support burden low and save time for other priorities.
- Secure. With so many users on the system, the Workday team knew that they'd need granular role-based access control that adheres to the least privilege principle both in terms of accessing the service and what could be done with the data.
- Fast. Since they were expecting an increasing number of clients, requests, and assets, Workday required robust performance to support their API-driven global environment as well as extreme scalability.
Workday's DCIM API Architecture
During the event, Tony Lincoln, Principal DevOps Engineer, demonstrated a schematic of Workday's DCIM API architecture.
The bottom layer is Sunbird's dcTrack DCIM Operations appliance. It has a GUI and two ways to programmatically access the underlying data: the native dcTrack API and an ODBC interface. The top layer is Workday's DCIM suite, which contains multiple layers itself.
Workday's logical architecture reflects their design goals to be automatic, secure, and fast, but also reflects some practical considerations. They have a variety of users with different needs and access methods. Developers and engineers run either the command-line interface (CLI) or the API, service accounts are part of their continuous integration pipelines, and business users directly access the GUI.
"Our API users are used to a common API platform here at Workday," said Lincoln. "Some of our queries—especially a few years ago—required custom SQL statements in order to optimize performance. We had monitoring queries that happen at very high frequencies and we have other queries that are just very complex. As you can see, we offer multiple layers of access. We have a Python library that provides direct client methods, abstracting out the two access points for dcTrack."
This architecture gives Workday the ability to optimize their API calls by choosing between native dcTrack API calls and raw SQL via ODBC. They can pick and choose the performance that they need. They also have a REST API on top of that.
"This is the common API with common authentication, authorization, and logging," said Lincoln. "It's a homegrown Workday layer and we achieve efficiencies by developing to that layer."
Above the API is a CLI that uses the underlying layers to provide a simple, interactive tool. This is used heavily by Workday's engineers who make individual changes and some of them embed this into more complex workflows that are captured in Ansible playbooks.
"They can sort of supersize their CLI by mixing and matching that with Ansible," said Lincoln.
Workday's DCIM API Architecture
Workday's Automation Architecture
Moshe Haber, Senior DevOps Engineer, explained how Workday uses Jenkins in their automation architecture.
"Jenkins is a very powerful tool for automation and it has a lot of integrations using various plugins to Jira, Slack, and other applications. Of course, we have the REST API that we developed which allows us to authenticate, authorize, and collect some metrics. For example, the various automation jobs have the ability to run as a Jenkins job and they have the ability to access the dcTrack API via the Workday API. They're able to access the various tables, event logs, audit logs, transaction logs, and other tables, and generate detailed reports. There are a lot of capabilities that are provided using this architecture."
Workday's Automation Architecture
Workday's 5 Use Cases of Automation via Integration
1. Provisioning and orchestration.
Before Workday deployed dcTrack, they had a homegrown tool that they had begun integrating with other systems. Right away, they learned that unless they were able to do this in near real-time, they were going to have data integrity issues where their asset tool could report values that weren't accurate until a machine configuration was complete. This had the potential to create issues with operations, compliance, and credibility. To overcome this challenge, Workday began discussing the concept of a "source of becoming," similar to the well-known "source of truth."
"We realized we needed a way to track both," said Lincoln. "The solution for us was to leverage the custom fields capability that dcTrack has natively. Instead of using a standalone database to capture a lot of these desired values, we capture data for both desired values and reported values within dcTrack. Then, we use those desired values as integration points for our provisioning and orchestration tooling."
Lincoln then explained Workday's provisioning workflow.
"dcTrack plays a critical role in providing both information about the way things are and the way we want them to be. We use DCIM lookups to determine which racks and hosts are ready to build based on their attributes. We write back to DCIM about the state of the given device as it moves through its lifecycle. Then, service teams can rely on DCIM to show which racks are ready to use."
"SNAP is the API-driven compute provisioning system that we developed in-house," Lincoln continued. "It integrates tightly with dcTrack and all these other tools to build, update, and manage our fleet. There's a process we call 'DCIM to Chef' that takes [a custom field value] and other Chef-related values and dynamically updates node data in the reference Chef server. This allows engineers with the appropriate access to change the configuration and functionality of a given compute node in near real-time."
Workday's network automation team also has an integration that scans their fleet of network devices and reports to DCIM with relevant information about them. They're also working on an automated validation tool that will run a series of tests on newly installed top-of-rack switches and populate a field in DCIM that will tell SNAP that the related rack is ready for provisioning.
"That… will make our provisioning process zero-touch from the point at which a cabinet is installed, cabled, and powered," said Lincoln. "All of this depends on knowing the state of your assets, where they've been, what they currently look like, and where they're going. Leveraging custom fields let us build a framework to track all that."
Skip to 11:23 for more information from Workday on this use case.
2. VM data management.
Workday has a significant virtual machine installation running parts of their service and they wanted to track data about the nodes that make up those clusters. However, because of the type of service that Workday provides, they need to apply the strongest requirements that protect their customers' data and they found that some out-of-the-box methods wouldn't work for them.
Their solution was to flip the direction that the data flows.
"dcTrack gives you multiple ways to accomplish things and our stack adds to that," said Lincoln. "We use our API client to push the data into dcTrack directly. Internal requirements met. Problem solved."
"The virtual machines are tracked in DCIM," said Haber. "Our requirement was also to filter some of those VMs and automatically create entries in DCIM upon detection and populate not only the VM data but populate certain attributes related to that group of VMs."
Workday implements virtual machine tracking in their various data centers via a telegraph application by VMware that collects the data from the associated VCenter, packages the data in a CSV file, and pushes it to their central Jenkins. There, they have an automation job that processes the CSV file and creates or updates information in their DCIM.
"Everything is done automatically," said Haber.
Skip to 16:28 for more information from Workday on this use case.
3. Device state tracking.
In another use case, Workday wanted to keep their device data up to date with little to no effort.
"With all this automation powering additions and changes to our fleet, how do we keep the current state of our assets up to date?" Lincoln said. "In what should be a common theme by now, we automate it."
Workday's devices run Chef several times an hour and they cache system information. When it changes, they pass that information to Splunk for retrieval directly into dcTrack.
"The server itself has what we call Splunk forwarding," said Haber. "What we have is a scheduled search that collects all the data from all the servers and puts it in a CSV file and pushes it to our Jenkins. On the Jenkins, we have a job that processes the information and populates or updates all the entries in DCIM if there was any change."
Skip to 20:22 for more information from Workday on this use case.
4. Parts management.
Workday uses Sunbird's Parts Management feature to extend the functionality of dcTrack for their Data Center Operations teams.
"We were very excited to add Sunbird's Parts Management module to our environment, but we were a little nervous," said Lincoln. "If you're tracking tens of thousands of assets and suddenly you need to track ten parts per asset—which is a pretty conservative amount—your environment just got a lot bigger. Suddenly, there are going to be a lot of burning questions that your DC Operations team will want answers for: Are we reordering the right things? Are we reordering the right amount of things? And maybe most importantly, what else can we learn from this kind of information?"
Workday's solution was to report and alert on parts usage to automatically provide answers to those three questions and more. They provide daily reports of parts consumed per location, alert if the number of parts consumed exceeds a certain threshold, alert if a specific part is consumed excessively, create depreciation reports for obsolete parts, and much more. Some of the insights they've gathered from this data include how reliable different parts are and how responsive different vendors are.
Skip to 23:33 for more information from Workday on this use case.
5. External integrations.
Once the rest of the organization realized the Workday team's efforts and results, they had a new challenge on their hands.
"This is one of those problems that you hope you have," said Lincoln. "Several years in, DCIM is a mature service here at Workday and contains a wealth of information that the business uses. We had a business analyst who recently saw all our data and called DCIM 'the hub' for asset data for our organization. It's a great place to be, but it does mean that every spoke out there wants to connect and you can't help everyone, so how do you make those kinds of choices?"
Workday decided that the best solution was to open up their API and share it along with their wealth of documentation and examples to the interested parties. Then, they just had to step out of the way.
They have implemented a number of features that make them more comfortable giving that freedom to their partners. They have API-driven onboarding and configuration so that people can gain access to their common API layer, granular role-based access control, and rate limiting on the number of API requests clients can make per hour and per day. They also perform audit log monitoring to check for specific field changes and the number of concurrent changes.
"We also have sub-production environments," said Lincoln. We always want to make sure we're testing in a non-production environment, but the nice added benefit of having these is that you can open those up to your customers as well so that when they're testing an aggressive query or something like that, they don't have to try it for the first time in production."
"And, of course, we never want to have to use it, but we knew from the beginning that we wanted to have a robust failover and disaster recovery procedure," said Lincoln. "We have lots of documentation as well as systems built in to make sure that if we do have a problem in production, we can recover from it. We have a regular method for generating a backup of the system. We use the one that comes with dcTrack and then we have a method for copying that to a separate system. Recently, we also deployed dcTrack's high availability functionality, so we have that as well as an added layer of protection."
Skip to 26:50 for more information from Workday on this use case.
Bringing It All Together
Sunbird's second-generation DCIM vision is to radically simplify data center management with elegant software. Automation via integration is one of the key pillars that achieves this vision. Modern data center managers need to integrate systems and automate data center operations to save time, save money, and ensure data is accurate.
Workday is a leading example of how organizations can leverage DCIM software to drive automation via integration.
Watch the Automation Workshop recording to learn from the best in the industry and gain insight into how you can drive automation in your data center.
Want to try Sunbird's enterprise-class DCIM software yourself? Take a free test drive today.