Old Dogs Learning New Tricks

Old Dogs

The saying goes “You can’t teach an old dog new tricks”.  Certainly, all of us have found examples of this to be true.  Otherwise this statement would not still be in use to this day.  However, what if I was able to tell you that you can teach an old dog a new trick?  In our industry, plenty of companies, as they age and mature, become satisfied on maintaining the status quo and forgetting to push innovation.  The storage industry is ripe with well-established companies that continue to dominate their market share.  Odds are, you probably have one of their products in your data center.

That beings said, what if I were to tell you that there’s a transformation going on within one of those well-established storage companies?  If I threw out terms like “OpenStack” or “Kubernetes”, I would bet that the first storage company you thought of was going to be a Valley startup.  In this case, you would be wrong.  That company, strangely enough, is NetApp.  Feel free to take a moment to work that Keanu Reeves-esque “Whoa!” look off your face.  Honestly, I was pretty shocked as well, considering, until recently, I have never personally worked with anything in the NetApp portfolio.  This did change when I became a customer of SolidFire (which was acquired by NetApp back in late 2015).  I’m not going to go into specifics into the SolidFire platform in this post.  That would distract from the message about the internal metamorphosis going on within NetApp.  Just know that, in my belief, SolidFire was acquired to continue molding this internal transformation.

The Road to Change

Change is hard, especially for those well established.  You slowly become the saying of “That’s how we’ve always done it.”  However, something had to be done within NetApp.  The overall technology industry was asking for more from all of their vendors.  No longer could we get by with hardware dominated solution with very little software to interface with that hardware.  Storage administration had turned into an overcomplicated mess with little to be done to try to resolve it.  Full-time administrators needed to be devoted to these hardware masses, just to be able to perform the equivalent of keeping the lights on.

Just when those administrators thought they could take a breather, along came an organizational shift.  I’m not talking about a simple organizational chart realignment wheel spin.  DevOps came along.  It forced conversations about the way we approach our IT departments.  No longer could IT departments be content with protecting their individual silos or technical fiefdoms.  IT departments had to start aligning with the business goals of the overall organization.  Contentment with just keeping your realm operational was no longer going to be enough to satisfy those outside of the IT department.  We’ve dubbed this as “better business outcomes”.

To help enable these changes within the IT department, persons within had to start looking towards automation capabilities.  They needed to start gaining efficiency in processes and putting them into technical systems to be able to better deliver the components necessary to achieve better business results.  This meant, those in the datacenter needed to start looking towards their partners for help in better delivering software that would help them automate and orchestrate their data centers.

What I heard during the Tech Field Day 14 presentation, was a story of a large company trying to help their customers along that very journey.  With better software development, NetApp has been able to make headways into products that you wouldn’t normally associate with NetApp.  Better API capabilities within their ONTAP software have opened up capabilities in systems like OpenStack, Puppet, Chef, Ansible, PowerShell, Docker, Jenkins, and even Kubernetes (in the form of Trident).

Final Thoughts

I will admit that it was extremely refreshing to listen to a Tech Field Day (specifically, Tech Field Day 14) presentation from a storage vendor and have only one mention of the underlying storage architecture.  As someone who primarily spends most of his day job hours busting silos and getting a technology organization to try to see that what we provide is greater than the sum of ports, spindles, virtual machines, and blades, I was pleased with the messaging during this presentation.  The goal is easier accessibility by those that aren’t traditionally data center specific personnel.  I really believe that NetApp is on the right path in being able to do so.

Posted in Technical | Tagged , , , , | 1 Comment

The Dawn of a New Networking Era

Preface:  I recently attended a Tech Field Day event (Tech Field Day Extra – Cisco Live 2017) and while the event is put on my Gestalt IT, I was not paid nor received any sort of reimbursement from Gestalt IT or the presenting companies.

Recently, I had the pleasure of being a delegate to a full day of Tech Field Day Extra – Cisco Live 2017.  The entire day was devoted to Cisco and their recent announcements about their latest enterprise networking initiatives, or as they call it, the Network Intuitive.  During the sessions, we got plenty of substance to fill in what the primary key note, delivered by Cisco CEO Chuck Robbins, was all about.

Interestingly enough, while I will never profess to being a packet plumber, my involvement in the day was one as an observer from other sides of IT operations and architecture.  During the day, I was able to draw many parallels between the transformation that other IT disciplines/silos were or are currently going through.

The three major definitions of future enterprise networks that were outlined by the Cisco CEO were as follows:   Scale, Simplification, Security.  The goal of this post is to outline what Cisco is doing to help realize those definitions in their newest enterprise networking gear.

Scale

Going into these sessions, my entire understanding of software defined networking and the subsequent skirmishes between various vendors was very limited.  As far as I was concerned, I didn’t care much about the intricate details of networking and only cared about the result that it delivered to my customers.  These skirmishes between the various companies and their approaches only fueled my disdain for the specifics.

However, having attended these Cisco sessions (which, subsequently, were delivered by some of Cisco’s most distinguished minds and presenters), the importance of the network dawned on me.  Like many architectures in other IT disciplines, while we’ve been able to abstract many of the architectural components, of say, a storage array, it’s purpose in delivering business value will always be a high priority.

That being said, I was able to fully grasp the reasoning behind the need for the ASIC technology in Cisco’s new Catalyst 9000 line.  If you go back and touch on the points that Cisco CEO Chuck Robbins brought up in his Day 1 Keynote, scale was referred to as a primary point of definition of enterprise networking moving forward.  With a constant stream of new devices that on-ramp into enterprise networks (whether that be devices like smartphones, tablets, or even IoT devices), the ability to process the onslaught of packets becomes something that needs to be addressed.

Thus, the importance of the ASIC in Cisco’s Catalyst lines over the last few years.  The ASIC was created specifically to offload and process what software itself would not be able to scale with commodity processing capabilities.  To parallel, think of all the offloads that come on a NIC in a server these days.  If you tried to perform VXLAN services on those NICs without the offload, you put burden on the CPUs, which means the application running on those servers is going to suffer.  The ASIC is essentially providing the same style of services, but at a different layer in the network.

Lastly, Cisco intends for these ASICs to be adaptable, over time.  Cisco provided an example of VXLAN as to how adaptable the UADP ASIC is.  When the ASIC was first designed, VXLAN was not even invented yet.  However, over the course of time, Cisco’s software element, IOS-XE, was able to add support for the protocol, while wrapping it around the ASIC’s capabilities.  This lead to VXLAN being able to be fully performing at a hardware layer, rather than having to resort to running it purely as software.  The possibility, along with the realization, of new features in the platform should ensure there is a long-term investment into the platform that should not require a constant stream of hardware refreshes (not unlike certain legacy storage vendors) to be able to take advantage of new protocols over time.

Simplification

You can’t call yourself a software-defined architecture without having to provide software elements to manage all these capabilities.  This is also the layer that many traditional networkers struggle with.  The software capabilities are used to encapsulate many of the functions that those networkers used to do, mostly through the power of the command line interface.

As we move towards a more automation forward enterprise, we are realizing that the devil is truly in the details.  Those details led us towards needing to understand minute bits of information down to the tiniest object within our infrastructure.  As enterprises scale, this mentality does not.  Abstracting that information into the form of software-based policies becomes all the more important to scaling out the duties of the network administrator.

This is where Cisco’s DNA (Digital Network Architecture) product line comes into play.  Included in DNA are a management dashboard (called DNA Center), a policy-driven engine to automatically segment network traffic (called Software-Defined Access), and a machine learning platform (included in DNA Center) to help drive more realization from predictive analytics.  All of these tools are to change the thought process of networking to a more intent-based system versus the low-level detail orientation or prior network generations.

This is where I bring into this explanation my work with Cisco management software.  If you read through my prior blogs, I work with Cisco’s UCS Director product to help automate our technical process in more legacy networking stacks, along with the combination of storage, compute, and virtualization.  I mention this, as I see a lot of parallels between that product and what DNA Center is ultimately going to provide.  The purpose is to be able to define a policy and implement it across the fabric.  Going back to the original statement of networkers needing to become more ingrained with tools like RESTful APIs and associated scripting languages that execute against these APIs is going to be one of the biggest challenges moving forward for networkers.  The same movement is happening with storage administrators, compute administrators, and virtualization administrators.  Welcome to the party, networkers!  As one who’s moved towards manipulation of IT systems through these protocols and scripting languages, I can assure you that you aren’t abandoning anything and, in fact, will become much more valuable to the organization by adding these skillsets to your toolset.

Security

Now, more than ever, security is becoming a primary initiative within any technical silo/domain.  For the longest time, we’ve considered security to be the “NO” of IT operations (in fact, you can’t spell “operations” with “no”).  Businesses can no longer risk having security being an afterthought to projects.  When you factor in the sheer number of devices onboarding to enterprise networks these days, that fact is enough to keep C-level executives awake with fear at night.

Cisco has woven security through all layers of this entire platform.  Many of the ASICs roles are that to drive home encryption capabilities so that, again, offloads aren’t needing to be done by the software itself.  However, some of the more interesting security measures are that being done by Encrypted Traffic Analytics.

I found this to be one of the most intriguing aspects in the security profile of the platform.  I had no idea that a vast majority of our network traffic was already being encrypted.  I also had no idea that many of our recent vulnerability attacks were also being performed using encryption.  This has introduced a double-edged sword for those in the security realm.  They want us to encrypt our traffic, however, now the malicious traffic is using the same means, making it harder to detect when these sorts of vulnerability events occur within the enterprise.

This is where Cisco has decided that analytics would be a great approach to identifying potential threats before they become fully bloomed nightmares.  Instead of having to decrypt traffic to learn the intricate details of the potentially malicious threat, pattern recognition is applied to the traffic.  If the traffic looks drastically different than what an organization is used to having, these sorts of anomalies can be flagged or even immediate set to deny so that further transmission can no longer occur.  So, the net result is that by using this engine to analyze the pattern, you aren’t wasting precious resources having to fully decrypt the transmission.

Conclusion

The rise of business goal driven IT is forcing many IT technologists to find new ways to exist once the shift starts to happen in their specific environments.  We, as the technologists, are attempting to adapt across the board, and we are more dependent on our technology partners to understand and provide us with the means to realize these goals.  My personal belief is that Cisco is genuinely turning that corner to be able to help adapt IT in the ways we’ve started to experience or have come to expect in other technology disciplines.  I look forward to watching these platforms evolve, along with watching my IT networking brethren evolve.  I welcome you all to the new evolution of IT.

Now, can we stop arguing over the definition of what SDN is supposed to be?  😉

Posted in Technical | Tagged , , , | 1 Comment

Tech Field Day 14 Preview – Datrium

Most of my peers know that I’m not one to dig deep into many nuances of storage.  I consider storage to be a necessary evil in which there will always be a surly storage administrator to give me the runaround on why a virtual workload isn’t performing up to expectations.  However, in an effort to become a more well rounded technologist, I have decided to make my first Tech Field Day 14 preview post specifically about a storage company.  During the event, which will happen on May 11th and 12th of 2017 (in Boston), we will hear much more about a storage company that appears to be doing things differently than most of their storage brethren.

The Architecture

I’ll reference, again, that I’m not much of a storage expert.  In fact, in the right moment, you might be able to sneak by me made up storage terms (say, triple crunch parity).  That being said, one of the things about Datrium’s architecture that stood out to me was the disaggregated approach of their architecture.  While I may have seen various attempts at this approach, I’ve not seen it to the level Datrium has taken it to.

Datrium’s storage solution, called DVX Rackscale, is a combination of a back-end storage array and a software element that is installed onto a VMware-based host.  As part of the requirements of that VMware-based host, there needs to be some amount (between one and ten disks) of flash storage installed into that host.  This local flash storage is used to store read-only, cached copies of more active data.  Connected to each host, via 10Gb Ethernet, is a single back-end disk array that contains the high-capacity, slower-performance disks in which the master copy of all data resides.  This back-end array is also the location where writes occur.  One of the interesting facets of this architecture is that it claims the more hosts you add to this solution, the faster performing the overall system becomes.  Datrium’s data sheets indicate performance estimates with 4K workloads to be roughly 100k IOPS per host, with a potential to drive nearly 3,000,000 IOPS in a maximum configuration.  Those numbers for 32K workloads are reduced (as expected) but still drive some impressive numbers (40k IOPs per host; 1.2 million IOPS across a maximum configuration).

There are some maximums that need to be dealt with in the architecture.  For instance, you can only connect 32 VMware ESXi hosts to a single back-end storage array.  Also, it appears that the back-end array is non-upgradable (the system has 12 – 4TB 7200rpm HDDs), meaning the current capacity you can be working with will vary between 60TB and 180TB.  Like many things storage related, this number is going to vary wildly based on the reduction techniques and their abilities within the system (Datrium expects between 2x and 6x reduction ratios, based of their datasheets).

An Industry First? – Blanket Encryption

External to the architecture of DVX are some very unique features that are not found on many other storage devices.  One such feature is called Blanket Encryption.  This feature, a completely software-based solution, manages to overcome some of the tradeoffs we see in current solutions.  Many hardware solutions only go as far as to offer at rest data encryption.  Combine in the penalties to reduction capabilities by encryption at the source and you’ve got a lot of wasted resources in trying to provide minimal protection capabilities.

Datrium’s DVX solution claims to be able to provide you the best of all worlds.  As they are dependent upon their own software in the hypervisor stack, they are capable of doing better source protection in the VMware host RAM and local flash disks that do not cause the same penalties seen by just the hypervisor itself.  The very same software is also capable of performing in-flight encryption from the host to the storage array.  According to Datrium’s datasheets, the data is first fingerprinted, then compressed, and finally encrypted while it’s being created in the ESXi host RAM.  Then the data is transmitted to the storage array (as ciphertext).  All the while, this is being done using some of Intel’s advanced instruction sets to help offload the performance hit to do all of these operations.

My Non-Storage Enthusiast Take

There is plenty more to look at, when it comes to Datrium’s DVX solution.  Recently added to the portfolio is a very unique replication system and snapshot catalog system.  This replication system is extensive enough to almost justify an entire write up to discuss it’s intricate details.  Combine this replication system with plans to work toward’s public cloud replication (while ensuring the public cloud data is able to take advantage of the same on-premises blanket encryption!) and we have a lot more to talk about with Datrium.  Personally, I’m excited to hear and see more about blanket encryption.  Also, it seems that Datrium has decided to get into the compute market with their own Intel-based servers, preconfigured for VMware ESXi and the DVX software.  It’ll be interesting to ask why the need to jump into that overly saturated hardware market.

Posted in Technical | Tagged , , | 2 Comments

Why I’m Heading to OpenStack Summit – An Architect Perspective

The other day I was presented with a question from someone in my office.  They had seen a LinkedIn update I had posted about being invited to Tech Field Day 14 and that I was going to OpenStack Summit.  I posted this shortly after arriving back from Seattle and a Microsoft hybrid cloud airlift event (focused on Azure Stack).  They found it interesting that we had corporate initiatives to eventually implement Microsoft Azure Stack (along with expanding the Cloud Service Provider practice within).  They wanted to know why I even wanted to go to OpenStack Summit.

I felt this was an interesting question to start writing about.  It made me realize that there are many in our industry that puts very definitive lines between various cloud platforms out there and that they can’t seem to fathom why anyone would want pay attention to all, rather than just go deep into one platform and live there for our technical eternities.

For me, the answer wasn’t a very simple answer.  I come at cloud platforms (and cloud, in general) as someone who tries to get the basics between the various camps and finds what is unique to either.  Like many parts of technology, I view these as different tools for different jobs.  While many of them provide bare basics (like virtual machines or virtual networks), many differ in their approach for how to interact with various audiences.

One of the primary reasons I wanted to go to OpenStack Summit came down to finding out who the primary audience of OpenStack.  I want to be able to ask questions to those on the technical side, as well as the business side.  One of the weird things I’ve noticed about myself, when selecting sessions for OpenStack Summit, is that I’m going to have plenty that are labeled as the “Cultural and Organizational Change”.  While the technology and the inner workings of all the OpenStack projects fascinates me, I want to see how organizations are using OpenStack and what they intend to change within.

Another reason, as I alluded to in the prior paragraph, is the technology within the ecosystem.  I know of many of the tried and true OpenStack projects (like Nova, Cinder, Keystone, Neutron, Glance, and Swift) but I haven’t had the opportunity to administer or work with OpenStack for longer than mild attempts in a very closed lab scenario.  I want to see how the other projects are coming along and see which ones might break into the core project realm, especially as business drivers dictate it.

That last statement leads into my most important reason why I wanted to attend OpenStack Summit.  We all hear about the business drivers for cloud adoption.  What I want to see and hear is about how businesses came to see how OpenStack can achieve their goals.  Those of us, especially from heavy virtualization backgrounds, have likely heard about the complexities of standing up OpenStack, especially on your own and in the early stages of using open sources software and in an enterprise environment.

In the end, when you peel back the nuts and bolts of the various cloud platforms out there, you come to realize they all tend to look the same.  What differs each of them is their intended target markets.  For instance,  Microsoft has been positioning Azure and Azure Stack as a hybrid answer to those still approaching Microsoft-centric environments, but also trying to go about micro-services architectures for cloud-native applications.  I want to see if OpenStack is trying to mature their message to enterprises, while reducing the supposed complexities that exist in standing up the platform from either source or from the ever expanding vendor and partner integrations that are out there.

So, to those in my organization that feel that I’m wasting my time going to OpenStack Summit, I’m going to leave you with some final thoughts.  Too often in this industry, especially at the operational (along with some of those that are architects in a technical subject matter expert sense) level, we get caught up in only going to events that seem to satisfy or maintain the subject matter expert parts of our brain.  My role is one that asks me to see many different perspectives before trying to solve business problems.  To do this, we have to go to events that may not make sense on the surface or even provide any tangible benefit to those business outcomes.  Sometimes, we go to these events purely out of curiosity and/or information gathering.  I don’t know about you, I’d rather be armed with information rather than be one of the uninformed.

Posted in Technical | Tagged , , | Leave a comment

Cisco UCS Director and the PowerShell Agent – Part 5

In this blog post, we will finally work our way up to what I’ve tried to accomplish with a combination of PowerShell, Cisco UCS Director’s northbound APIs, and the concept of parallel processing with workflows in UCS Director.

The Assumptions and Caveats

First, let’s talk about the assumptions for this PowerShell script and UCS Director workflow. My current network architecture is one that models itself after one of Cisco’s validated designs for data centers. This design has a network spine (provided by the Nexus 7000 switch line) and network leaves (provided by the Nexus 5000 switch lines). As part of the configuration of this design, Cisco FabricPath is enabled across the switching lines.

With the implementation of FabricPath, the process to create a simple L2 VLAN across each switch is relatively simple and lends itself to be easily capable of taking simple variables (mostly in the form of which switch the operation needs to be performed). The only main difference between the two switches that forces me to have to create two different VLAN creation scripts is due to the Nexus 7000 family. This only forces me to have to specify the VDC (virtual device context) before running the small script to create a VLAN and enable FabricPath mode on that VLAN.

Also, as part of my architecture and current usage of these switches is the fact that not all the L2 VLANs were created with Cisco UCS Director. In response to this, I created two different workflows. We’ll call them: Nexus List Add and Nexus List Rollback Setup. Nexus List Add is used when the L2 VLAN is detected to NOT already exist on the switch. This will run a block of NX-OS CLI that will create the VLAN based on a passed VLAN name and VLAN ID variable. From there, FabricPath mode is enabled and then we save the running-config to startup. Regarding Nexus List Rollback Setup, instead of going through and trying to create a VLAN that already exists on the switch, we register a rollback task with UCS Director (for consistent VLAN removal across the entire FabricPath domain) and set ownership of the VLAN in UCS Director for the customer in question. This forces UCS Director to know about the VLAN and which customer owns the VLAN in question.

One last caveat for the PowerShell environment on my Cisco PowerShell Agent servers. I’ll admit that I’ve been rather lazy since deploying these out originally and the PowerShell version on them is still version 4.0. This causes issues with large JSON responses by some of the UCS Director northbound APIs. In version 4.0, there was a maximum size of the response in which you could use the ConvertFrom-Json cmdlet. I was forced to use some borrowed code that adjusts some of the JSON parameters in the system for my runspace and then creates a custom PowerShell object. Unfortunately, this process adds a lot of overhead. I’ve recently found that by upgrading my environment to PowerShell 5.0, the issues go away and I can get away from the custom code to create the PoSH object.

The Code

The code is available at the following GitHub location: My Example Code

Declared Functions

I wrote two specific functions in my PoSH code to be reusable through this script. They are called Get-UCSDURIs and Connect-NestedPSSession. For reference, I’ve previously blogged about why I needed to create Connect-NestedPSSession (Using Invoke-WebRequest with the Cisco PowerShell Agent). Get-UCSDURIs was created so that I could go through my entire switching stack and generate all the specific URIs for calling the two Nexus workflows and filling in all the variables. Since I have 2 Nexus 7000s and 10 Nexus 5000s in this configuration, I need to generate a combination of 12 URIs to send to the UCS Director northbound API, userAPISubmitWorkflowServiceRequest.

In Get-UCSDURIs, I also do a quick lookup of the workflow inputs by using the northbound API userAPIGetWorkflowInputs. The reason for this is that even if the input names on the workflows are the same, in UCS Director, they are given a specific number on the variable name to make them unique (example below from one of the JSON returns on Nexus List Add).

Screen Shot 2017-03-14 at 2.31.23 PM

A total of three parameters are passed to the API when executing userAPISubmitWorkflowServiceRequest.  The first, param0, is a string of the name of the workflow you wish to execute.  The second is a lengthy string that’s passed as param1 in the userAPISubmitWorkflowServiceRequest URI.  Most of the code in Get-UCSDURIs focuses on creating this parameter.  Since all this information is put into the URI (and not using any sort of request body), I could not create this as a JSON literal.  I had to create this as a large string object, which is why the code looks the way it does with multiple uses double quotes and back quotes.  Lastly, we send a UCS Director service request ID (or SRID) to param2.  In my case, I usually send the SRID of -1.  This means that each workflow call is independent of themselves and does not register rollback tasks with the parent workflow.  I handle rollbacks in a different way later, since I also want to use parallel processing in removing everything that I created.

The Main Script Body

I start the main script body by passing in multiple arguments from UCS Director.  Many of these are specific to my environment, but the gist is that I need to pass in which of my site’s this needs to be created, what VLAN ID is being requested, what the requested VLAN name is going to be, which UCS Director group do I want to assign ownership to, and which parent UCSD SR ID I want to assign the rollback tasks too (however, in the prior paragraph, this is almost always -1, but I wanted the function to be usable for anything else too, not just this specific use case).

I’m also passing in specifics for the PowerShell Agent at the site in question (the PS Agent) and the username and password I want to use to initiate the nested remote PowerShell session using Connect-NestedPSSession.

With UCS Director and the northbound APIs, there is no Basic authorization header to work with.  UCS Director creates an API key for each login that is accessible through the GUI.  To be able to use that key, you need to store it into a hashtable in which one of the keys is labeled X-Cloupia-Request-Key.  I do this by creating an empty hashtable first, the using the Add method on that object and insert the key/value pair into the hashtable.  For the most part, this is the only thing required for the northbound APIs I’m using in this script.

After setting the header, I now have everything I need to start initiating northbound API calls to UCS Director.  Before I start initiating any specific workflow, I need to go through a couple of lists to check to see what needs to be worked with and to determine which subsequent workflow (Nexus List Add or Nexus List Rollback Setup).  I use the northbound API, userAPIGetTabularReport, to get these lists of networking equipment (MANAGED-NETWORK-ELEMENTS-T52) and the VLANs that have already been pulled from the switch inventories and in the UCS Director databases (VLANS-T52).

After running through these APIs and parsing through the responses (and filtering down to very specific information), we then begin the process of cross checking to see if the VLAN exists on the equipment.  Depending upon whether it exists or doesn’t AND the model type of the equipment being checked, the switch information gets placed into one of four lists.  Each is labeled with either a 5k or 7k string and whether it’s a rollback or not.  These four lists are then processed and the URIs are generated using the Get-UCSDURI function.  Lastly, all four URI string returns are smashed together into one large, comma-separated list which should contain, depending upon the site, either 10 lines or 12 lines of URIs to process.

We start the execution process, but taking our URI list and initiating each call using a foreach loop.  I store the SR ID into a hashtable that I can use for keeping an eye on the workflows I just initiated.  Now, to keep an eye on each one of these API requests, we utilize the northbound API, userAPIGetWorkflowStatus.  This returns a numerical code to tell us what the status of the workflow is.  If the status code returned is one in which the workflow has reached some sort of stop, I remove the SR ID from the hashtable.  Also, I put the SR ID into a string called $sr_list.  This list is going to be used to store all the workflow SR IDs and use it for rollback purposes.  I will also return this list to UCS Director.  Once all the SR IDs have been removed from the hashtable, my while loop will shut down and the script will finish.

The assumption is that all my API calls have completed (without critical stop) and that a Layer 2 VLAN has been pushed out, in a parallel processed way.

Parallel Processing

I haven’t mention much on this concept up to this point, having focused on the code instead.  The reason that I wanted to perform this in parallel is that each switch acts independently of each other.  While FabricPath requires you to create a VLAN on all the switches for traffic purposes, there’s no major error checking that forces you to have it all created, at once.

Taking this idea, you can see that instead of waiting in a long sequential order (switch1 -> switch2 -> … -> switch10), I can initiate all 10 requests, at the same time.  This is what I mean by parallel processing.  Before I created this script, my entire process of creating a L2 VLAN was taking upwards of 10-12 minutes.  In a cloud world, this seemed extremely high, especially for something as simple as a L2 VLAN.

After this implementation, I have execution times as low as three minutes at one site (has less VLANs to parse through) and as high as five minutes (due to 12 switches with 1500 VLANs on them; this is roughly around 175,000 lines of raw JSON information).  Oh, and I forgot to mention I also ADDED new functionality to the overall workflow (these efficiencies were just one part of the major fixes requested).  I added automatic additions of vSphere created distributed portgroups and adding those to a customer’s network policy in their virtual data center.  I also added the ability to go through UCS Central and find any customer service profiles and add the VLAN to the vNICs of each of those service profiles.  Included in all of this were also permissions granting to our VMware hosted environment for those distributed portgroups.  So, I added some major automated assumptions AND drastically cut the time of execution.

Conclusion

I hope that either you read through the entire series or at least got something new and useful out of this blog series.  I was amazed just how little was being published from either Cisco SMEs or from the community with regards to PowerShell and Cisco UCS Director.  I hope that you, the reader, have realized just how powerful the PowerShell capabilities can be and how easy you can extend systems with no direct support in Cisco UCS Director.

Posted in Technical | Tagged , , , , | 3 Comments

Cisco UCS Director and the PowerShell Agent – Part 4

In this blog post, we will be going over some advanced use cases of PowerShell with Cisco UCS Director and the PowerShell Agent.  We will go over a scenario in which we need multiple bits of data returned from a PowerShell script and how that can be handled with some custom tasks for parsing XML data and returned as UCS Director macros/variables.

Real World Use Case

In my own environment, I had one great use case that made me start leveraging more data returns from PowerShell scripts.  In my lab and in production, we run Cisco UCS Central to provide a centralized repository for many Cisco UCS specific constructs (like MAC pools, service profile templates, vNIC templates, etc).  As we grew to multiple data centers, we started to worry about major overlap problems with pools in each Cisco UCS Manager instance, so we decided to start using UCS Central to provide and divide these entities up from a global perspective.

Cisco UCS Director had included a certain number of tasks that Cisco themselves had authored.  Unfortunately, was with many out-of-the-box task implementations in UCS Director, they didn’t quite fit everything we needed to perform our specific processes when it came to building UCS devices for either our own virtualization environments or for bare metal for our customers.

My main use case came from a limitation in the code for the task for cloning a global service profile template.  After upgrading UCS Central from 1.2 to 1.3, this task started to return the value of “undefined” for values like the MAC addresses assigned to our NICs or the HBA WWPN information.  We found that there was now a delay that had to occur to properly pull this information from the cloned service profile and return it back to UCS Director.

PowerShell Saves the Day

As with most of the Cisco UCS Director out-of-the-box tasks, you are unable to see the Javascript/CloupiaScript code within.  This made it impossible to resolve the issue through the existing task (although, a TAC case was logged about the issue).  We resorted to recreating the main functionality in PowerShell using the Cisco UCS Central module (available in the Cisco UCS PowerTool Suite:  Cisco UCS PowerTool Suite).

The Code

A caveat before we continue.  This code was written well over a year ago, so some of the code within may have changed drastically as the UCS Central PowerShell module may have gone through some revisions from earlier iterations.  Also, you are going to notice that I hard coded the password to send to the Connect-UcsCentral cmdlet.  Ask any security person about this practice and you’ll likely get hit with any sort of random object from the size of an eraser to that of a city bus.

The Script

Import-Module CiscoUcsCentralPs

$ucsc_org = ($args[0].Split(";"))[1]   # Passing a UCS Director variable and getting the Central Org DN from it
$ucsc_account = ($args[0].Split(";"))[0]  # Passing a UCS Director variable and getting the UCS Central account (if multiples)
$uscs_gspt = ($args[1].Split(";"))[2]  # Passing a UCS Director variable and getting the Global Service Profile Template DN from it
$customer_id = $args[2]  # Passing a string for usage in creating the name of the service profile
$device_sid = $args[3]  # Passing a string for usage in creating the name of the service profile

$ucsc_username = "*Insert UserName to authenticate to Central*"
$ucsc_password = ConvertTo-SecureString -String "*Password for Central account*" -AsPlainText -Force
$ucsc_credential = New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList $ucsc_username, $ucsc_password
$ucsc_conn = Connect-UcsCentral -Name -Credential $ucsc_credential

$gsp_name = $customer_id + "-" + $device_sid    # Create combined global service profile name
$new_gsp = Get-UcsCentralServiceProfile -Dn $uscs_gspt | Add-UcsCentralServiceProfileFromTemplate -NamePrefix $gsp_name -Count 1 -DestinationOrg (Get-UcsCentralOrg -Dn $ucsc_org) | Rename-UcsCentralServiceProfile -NewName $gsp_name   # Create GSP from template and rename to remove "1" from end of name

Start-Sleep 15   # Sleep for 15 seconds to allow for UCS Central to process global pool values into GSP
$new_gsp = Get-UcsCentralServiceProfile -Name $new_gsp.Name   # Reload the service profile

$ucsd = @{}   # Create our hashtable to store values

# Create the hashtable values for the various parts of the global service profile to be used by later UCS Director tasks

$ucsd["VNIC1_MAC"] = ($new_gsp | Get-UcsCentralVnic -Name ESX_Mgmt_A).Addr   # MAC for Mgmt NIC/PXE Boot NIC, named ESX_Mgmt_A
$ucsd["VNIC2_MAC"] = ($new_gsp | Get-UcsCentralVnic -Name ESX_Mgmt_B).Addr   # Secondary MAC for Mgmt NIC, named ESX_Mgmt_B
$ucsd["VHBA1_WWPN"] = ($new_gsp | Get-UcsCentralvHBA -Name vHBA1).Addr   # WWPN of vHBA1, used for zoning, named vHBA1
$ucsd["VHBA2_WWPN"] = ($new_gsp | Get-UcsCentralvHBA -Name vHBA2).Addr   # WWPN for vHBA2, used for zoning, named vHBA2
$ucsd["VHBA1_WWN"] = ($new_gsp | Get-UcsCentralvHBA -Name vHBA1).NodeAddr + ":" + ($new_gsp | Get-UcsCentralvHBA -Name vHBA1).Addr  # WWN used for EMC initiator creation for vHBA1
$ucsd["VHBA2_WWN"] = ($new_gsp | Get-UcsCentralvHBA -Name vHBA2).NodeAddr + ":" + ($new_gsp | Get-UcsCentralvHBA -Name vHBA2).Addr  # WWN used for EMC initiator creation for vHBA2
$ucsd["ServiceProfileIdentity"] =  $ucsc_account + ";" + $ucsc_org + ";" + $new_gsp.Dn   # UCS Central Service Profile Identity, in UCS Director variable format

return $ucsd   # Return hashtable to UCS Director for processing with custom task

From the beginning, you’ll notice that we must import the modules we wish to use.  The PowerShell agent does not have full access to things like Windows profiles or scripts to load these into the runtime environment for us.  You must declare all the modules you wish to use (and that are installed on the device in question) in all your PowerShell scripts you wish the PowerShell Agent to interact with!

Our next block of code is bringing in arguments in which we sent to the script in question.  At the end of my last blog post, I explained how we can use the system array of $args to pass arguments from Cisco UCS Director to our PowerShell scripts.  From the code, I’m passing in a total of four arguments, but I’m creating five PowerShell variables (all strings) from those arguments.

Now, some object types in Cisco UCS Director are formatted in certain ways.  Take $args[0] that I’m sending to this PowerShell script.  You can tell that by using the Split function and by specifying how to split the string, that it’s semi-colon delimited.  The format of the string (which I believe is to specify how UCS Director sees UCS Central organization objects) looks like this:  1;org-root/org-XXXXX.  UCS Central organization objects appear like this to specify the registered UCS Central instance ID in Director (or the “1” in this example) and the Cisco distinguished name in UCS Central (or DN) of the organization.  So, from one argument, we can get two variables for use by this script.

After our argument section of the script, we perform our operations to create a PSCredential object and use that object to log into UCS Central.  I have a line of code specific to my organizations naming convention in how to name the service profile next.  Follow that up with our second UCS Central specific cmdlet in Get-UcsCentralServiceProfile.  From here, we have examples of how some of the object passing can happen between the Cisco UCS Central cmdlets using the PowerShell pipeline.  This command is getting the global service profile template (we passed this in as an argument), cloning a global service profile template from it, and naming the global service profile template with our custom naming convention.

Now, the code that fixed the issues we were having with the Cisco UCS Director out-of-the-box task were the next couple of lines.  Start-Sleep allows for you to put a hard wait in the execution of the script and wait for background processes to occur.  Once we waited 15 seconds, we re-read the profile information we just created.  This changed all the variables that were listed as “undefined” to their proper values.

Returning the Information

The last part of this code focuses on using a PowerShell object called a hash table.  A hash table is an object containing many PowerShell dictionary objects.  A dictionary object made up of two properties:  a key (or a term in dictionary speak) and a value (or a definition in dictionary speak).  Using this knowledge, we can use a hash table to store multiple pieces of UCS Central global service profile information and give that back to UCS Director for parsing and processing.

You’ll see in the code that first we declare our hash table.  From there, we can declare our keys and start storing values into the table.  You’ll notice that some of the keys that I chose to put in this hash table are address values from the vNICs in the service profile or the vHBAs in the service profile.  Lastly, I also created a value, in the form of the key ServiceProfileIdentity that will be returned to UCS Director.  The value is semi-colon delimited and in the format, that UCS Director expects for a UCS Central Service Profile Identity object.

Lastly, we tell the script to return the contents of the hash table.  At this point, the PowerShell Agent will create the XML response and specifically list all the hash table contents within the XML response.  We need to utilize some XML parsing on the Cisco UCS Director side to then store these values as macros for other parts of our UCS Director workflows.

Parsing the Response

Ages ago, I found this great example out on the UCS Director Community site (UCS Director Community Site.  This laid the foundation for how to parse through the response and get the variables out that I needed to create macros for usage by UCS Director.

I downloaded the workflow example within the above URL and imported it into my UCS Director environment.  When I did this, it automatically exported the custom task for parsing.  I cloned a copy of that task and started to make some edits to make it my own.  This all can be found by navigating (as a system administrator) in UCS Director to Policies > Orchestration and then clicking on the Custom Workflow Tasks tab.

We can start to edit this custom task by selecting the cloned task and clicking on the Edit button in the bar above it.  I usually skip the Custom Task Information section and proceed to the next section, Custom Task Inputs.  In this section, you can see the following:

Screen Shot 2017-03-01 at 4.21.09 PM

The input expected is going to be called xml.  We will be passing the output from the Execute PowerShell task to this input.

Moving along to the next screen, this is where the customization begins.  Knowing what we have for our key value names coming from our PowerShell hash table, we can create outputs based on those very names.  Here’s what I put for my values:

Screen Shot 2017-03-01 at 4.23.29 PM

Everything coming out of this task is going to be a generic text value, except ServiceProfileIdentity.  The reason for this is that my workflow is going to have a task requiring it to be sent this object type to be able to perform the task against it.

We skip past the controller section, as we are not going to be performing any marshalling on this task.  That leads us to this script:

importPackage(com.cloupia.lib.util);
importPackage(java.util);

var xml = input.xml

// Try and parse the ... section

var objects_xml = XMLUtil.getValue("Objects", xml);

// Parse the objects list now (should also be a single section):

object_list = XMLUtil.getTag("Object",objects_xml.get(0));

// Parse the object_list to get properties:

property_list = XMLUtil.getTag("Property",object_list.get(0));

// PowerShell returns arrays weirdly to UCSD, alternating rows of keys/values

// Like this:

//   ip

//   192.168.100.1

//   server_name

//   New Server

//

// Store output in a HashMap:

var variable_map = new HashMap();

// Store previous keys in buffer:

var key_buffer = "";

// Loop through all values taking even as keys and odd as values:

for (i = 0; i < property_list.size(); i++) {

// Remove XML tags (can't seem to coax the XML library to do this for me!)

property_list.set(i, property_list.get(i).replaceAll("",""));

// Keys

if ((i % 2) == 0) {

key_buffer = property_list.get(i);

}

// Values

else {

variable_map.put(key_buffer, property_list.get(i));

}

}

// Match desired output to HashMap fields:

output.VNIC1_MAC = variable_map.get("VNIC1_MAC");
output.VNIC2_MAC = variable_map.get("VNIC2_MAC");
output.VHBA1_WWPN = variable_map.get("VHBA1_WWPN");
output.VHBA2_WWPN = variable_map.get("VHBA2_WWPN");
output.VHBA1_WWN = variable_map.get("VHBA1_WWN");
output.VHBA2_WWN = variable_map.get("VHBA2_WWN");
output.ServiceProfileIdentity = variable_map.get("ServiceProfileIdentity");

The section to focus on to get our outputs is the last lines.  The code parses through the XML return and creates a JavaScript version of a hash table (called a Hash Map) and then we can get the values out of the keys within.  By assigning those values to the output variables in the script, we are creating the last steps to pull that information in as UCS Director macros and thusly, able to pass the information onto other parts of our workflow!

You can see here that I can take the MAC of my primary NIC for OS deployment and assign it as the NIC to use for PXE job creation on the Bare Metal Agent server by passing in the macro that I created:

Screen Shot 2017-03-01 at 4.32.19 PM

To Be Continued

 The last part of this blog series will try to take the PowerShell Agent to another level by showing how you can use the PowerShell Agent service to perform northbound API calls to other systems OR even to UCS Director itself.  I’ll show examples of recent enhancements I did to my workflows to enable parallel processing and gain massive efficiencies to the overall time of execution for some of the tasks within my datacenter.

Posted in Technical | Tagged , , , , , , | Leave a comment

Observations on Blame Cultures and the S3 Outage

One would think that this was scripted the way it happened, but I can assure you, that was not the case.  I’m in the middle of reading a book (a really good book on blame cultures; I highly suggest a copy:  Here).  The day after I finished reading the book, my tech social media feeds were aflame with mentions of problems with AWS (specifically in the S3 service and in the US-East-1 region).  Much has been said about the need for proper application architecture using cloud building blocks and much reflection on whether the cost of this resiliency is worth a significant outage.  I fully expect that there’s plenty of discussions happening within organizations about these very factors.

I found myself not necessarily focused on the incident itself.  I was more interested, strangely enough, on any sort of public post-mortem that would be brought forth.  Having read many DevOps books recently, the concept of a public post-mortem is not new to me, but I can guess that for many private organizations, this could seem like a foreign concept.  When an incident occurs, many in the organization just want the incident to go away.  There’s an extreme negative connotation associated with incident and incident management in many organizations.  To me, post-mortems give me great insight into how an organization treats blame.

Recently, I’ve been doing quite a bit of research into how organizations, specifically IT organizations, deal with blame.  Now, in Amazon’s case, they’ve listed “human error” as a contributing cause to the outage.  What comes after this, in the post-mortem, goes to show how Amazon handles blame internally.  The two quotes, taken from the post-mortem (available here:  https://aws.amazon.com/message/41926/) are telling in how this was handled internally:

I’ve put in bold my key terminology of this event.  Notice that outside of one mention of the authorized S3 team member, every other mention has something to do with the tools to perform that action or in the process that would have helped to prevent the issue.  In this case, the root cause is NOT the operator that entered in the command, it was the process that lead to the input and the associated actions the system took based on the runbook operation.

At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks

So, why the long-winded breakdown of the S3 post-mortem?  This got me thinking about all the organizations that I’ve worked for in the past and made me realize that when it comes to any sort of employment change, especially one that requires on-call or primary production system ownership, I’ve got a perfect question to ask of a potential employer.  Ask that employer about their last major internal incident.  While you might not get a full post-mortem, especially if the organization doesn’t believe in the benefit of such a document, key information about the incident and the handling of the incident should become immediately prevalent.  If the incident was a human error, ask about how the operator that performed the action was treated.

Unfortunately, in many IT organizations, the prevailing thought is that a root cause can easily be established that the operator was incapable of performing their role and immediate termination is a typical reaction to the event.  If not immediate termination, you can be rest assured that the organization will forever assign a hidden asterisk to your HR file and the incident will always be held against you.  Either way, this sort of thought process is what ends up causing more harm to the organization, long term.  Sure, you think you removed the “bad apple” from the mix, but there’s going to be collateral damage in the ranks of those that still must deal with the imperfect technical systems that still need their “care and feeding” to be able to function optimally.

Honestly, if this is the sort of response you get from a potential employer, I would end the interview at that time and no more discussions with that organization would take place.  Based on their response to the incident, you can easily see that:

  • The organization has no real sense or appreciation for the fact that the technical systems IT staff works with on a day-to-day basis are extremely complex
  • Those systems are also designed to be that where updating or changing the system is considered a mandatory operational requirement
  • When change occurs, you can never guarantee the desired outcome 100% of the time. Failure is inevitable.  All you can do is mitigate the damage failure can do to the system in question.
  • Reacting to the incident by putting the entire root cause on the operator is a knee-jerk reaction and occludes you from ever getting to the root cause(s) of your incident
  • By levying a punishing of termination onto the operator in question causes a ripple effect to the rest of the staff. The staff are now less likely to accurately report incident information, out of fear (of employment, being branded a “bad apple”).  This leads to obscuring root causes, which then leads, ultimately, more failures in the system.

Are you sure you want to work for an organization that puts its pride on “just enough analysis” and breeding a culture of self-interest and self-preservation?  No, me either.  Culture matters in an organization and with those seeking opportunities in that organization.  It’s best to figure out what the culture really is before realizing you made a major mistake working for an organization that loves to play the name/blame/shame game.

Posted in Technical | Tagged , | 1 Comment