Getting Real about Production Software

WE DROPPED THE BALL

Last week, we got a call from a customer that one of their users had attempted to reset their password and received a “something went wrong” message from their app. Upon investigation, we were horrified to discover that their app had no email service running. Why so horrified? Here was the initial picture of the situation…

  1. CabForward deployed the app and was supporting the app, so it was our responsibility to discover and resolve any issues.

  1. One of the startup investors discovered the problem — not us.

  1. If the app’s email feature wasn’t working, it had likely not been working since we migrated them between cloud providers just prior to going live a handful of months ago.

  1. Transactions within the app require an email-based approval process and no transactions had been completed since going live.

Make no doubt about it. This was a very serious issue, and we treated it as such. Unfortunately, it appeared our relationship with this customer had likely reached its end. They had trusted us with their proud creation and we simply had not given it the attention it deserved. Well, okay, let’s get real! We hadn’t given it the attention that it required.

HOW COULD THIS HAPPEN?

As CabForward CEO, I am frequently evangelizing the idea of Rugged Software to our employees, partners, prospects, customers, friends — pretty much, whoever will listen. But somehow we completely dropped the ball on this one. As in any similar situation with systemic or team-wide failures, many smaller issues often conspire to create the perfect storm. Having said that, if CabForward was truly as rugged as we aspire to be, none of these smaller issues would have been allowed to happen in the first place.

There were very few automated tests and little or no test coverage for this product. This customer had come to us with a product that had been developed by another shop — a dev shop that had recently switched to Ruby from another language and brought a few anti-patterns with them. The code quality score for this app was painfully low and little time or budget had been committed to cleaning it up — something that becomes more time consuming the larger a product becomes. What’s worse, the previous developer had convinced the product owner that they would write “the right mix” of tests, as if somehow justifying that the right 40% of code coverage is really enough.

There was no error reporting installed to capture run-time production issues. We completely missed implementing any error reporting on this application when we migrated between cloud providers. Software in production requires the full gamut of system testing and monitoring and this app had very little.

There was no real difference in the way we handled code in production versus other environments, like staging and testing. As developers, we believe that there must be strong consistencies between all of our environments. Without strong consistency, there is no confidence that the different environments would behave similarly. But this view of production as just one of the three or four standard environments can cause developers to forget how much more important production is than the other environments. As an organization, the way we treat the production environment must be different. If we are changing variables, data, and/or code on a whim, and without any real discipline, we are gambling with other peoples’ money and disrespecting their investment and interests. It’s the responsibility of the organization to educate its developers on the importance of structure and discipline and, in this case, we failed to do that.

There was no visibility or collaboration between development and QA/testing. When we adopted this project from the previous developer, we were assured that testing was being handled by a third-party. Testing should not be optional, just like quality and security should not be optional and it shouldn’t be disengaged from development. This leads to a question of accountability and breaks down team dynamics. DevOps is founded on the principle of intense collaboration and we’ve been studying the topic for years. Was this particular case so exceptional it didn’t apply? What makes an edge case and why? Sorry, no. No exceptions.

There was no quality control enforced by operations. Policies for standards across projects were not enforced at the time of this folly. We pride ourselves on our flexibility, but with too much flexibility comes the risk of chaos. Standards are there for a reason — enforce them or go into another line of work.

There was no real support agreement. Because we engaged with the customer near the end of their product development lifecycle, funds were low. This created a scenario where the customer felt they couldn’t justify the expense of our standard support agreement. So instead we coasted into one. We extended the original time and materials engagement to include support activities. This type of unstructured T&M support engagement simply doesn’t work because it creates a reactive vs proactive environment.

There was no excuse for any of this. We won’t let our customers talk us into a diminutive low-cost or no-cost support agreement again. We’re drawing the line. We didn’t explain the importance of these things and therefore the customer didn’t know to ask for them.

Regardless of all of these past decisions, we should have pushed harder to encourage, or even require, the development of more discipline. And that is really the primary offense. Product owners are often non-technical entrepreneurs who don’t fully understand the why’s and why-not’s that are so often obscure in most technology decisions. They trust us to be honest. They trust us to be professional. They trust us to enforce standards and communicate the reasons they exist. We could hide behind the veil of secrecy or shame here, but we’re coming out with it. We’re drawing the line.

DRAWING THE LINE

  • Our customers trust us to mentor them and communicate effectively. By not saying ‘no’ to bad decisions, we are essentially enabling our customers to sabotage their own investment.

  • Our customers trust us to be professional and commit to doing things right. When we lower our standards, we are not standing up to our fiduciary responsibility to protect our customers interests. Not to mention their trust in us.

  • We will not allow our customers to fail based on false or unclear assumptions. Assumptions stifle growth and hinder creativity. Assumptions cause missed opportunities, erroneous beliefs, misunderstandings and errors in judgment. Assumptions can kill a relationship if they are incorrect or inconsistently understood between parties.

  • We cannot be swayed by a single customer or unique case from doing what should be required for all of our customers. A canonical system driven by best practices was implemented to provide governance and must be enforced. This practice should not require significant ramp up or investment beyond standard estimates. We cannot place this responsibility squarely on the customer — they shouldn’t have to pay extra to get what we suggest makes us different in the first place.

  • Risk cannot be deferred solely on the customer’s decision to accept it. We have to let the customer know that we insist on the structure; the customer is depending on us for our expertise. We must be willing to exit the customer engagement if the customer wishes to accept unnecessary risk in order to save a relatively small amount of money.

  • We must not hold back on ruggedization just because the customer is frugal. Instead, we should be fully committed to the ruggedization effort and apply it equally to every customer, every engagement, every product we touch.

  • Ruggedization requires a commitment to continuous improvement. In our efforts to keep support costs down for our frugal support customers, we constrained the subcontractors hours and thus met the customer’s budgetary constraints. However, from a Support perspective; this should have never been allowed to happen, because it drives the wrong results and the customer suffers due to our inability to find a more creative way for our customers to afford these quality and security measures.

TAKING UP THE GAUNTLET

We accept! We accept the challenge and take up the gauntlet. Some things must change right away. Let’s outline them.

  1. We will require a minimum fixed cost support contract to be in place for all production systems. A more rugged support contract will be proposed for higher traffic, more mission critical apps.

  1. We will define, document and explain our project workflow and deployment processes. We promise customer visibility and must deliver without exception.

  1. We will set and agree upon a minimum grade requirement for code quality. We will continue to use and explain to our customers the importance of Code Climate, and tools like it, to demonstrate quality code throughout the project life cycle. This is especially critical if we are taking over code from another development team.

  1. We will exercise the application’s mission critical features regularly. We will require a comprehensive set of test cases. From this list, we will require that the business owners declare which of the test cases are mission critical. These mission critical test cases will be subjected to regular exercise and validation. When testing the service dependencies used by the application, we must test the same plumbing used by the application itself. No more blind stubbing! Blind stubbing is not really exercise — it’s like saying “I thought about exercising,” but is not true exercise.

    We will no longer allow our customers to assume that their app is healthy without clear and regularly scheduled proof. Generic application-wide code coverage is not an accurate representation of risk. Success is not about 80% code coverage or 90% code coverage — even 100% code coverage is not a clear indicator that your app is actually going to work when it’s game time. True success is about regularly exercising 100% of the app’s mission critical plumbing.

  1. We will enforce pre-release conditions anchored in three primary documents. These documents already exist for most support projects today, but are of inconsistent quality and are inconsistently managed. Regarding handoff of this application to or from CabForward℠, these three documents are required. Any system that is promoted to production by our team will require these documents. Handoff of existing applications to CabForward℠ would follow this same requirement — before we accept a new production application, either the customer will require this documentation to be supplied by the former developer, or we will be hired to produce these documents. This will assure that risks will be identified and understood. During the on-boarding and off-boarding processes, we will list and validate all assumptions.

  • Application Architecture – This document will include the initial user stories, the proposed architectural diagram, and all touch points with third party utilities along with full descriptions of intent of use.
  • Operations Administration – Explains the system architecture at a very detailed technical level, including how to set up the environments and full documentation on 3rd party touchpoints. This document will also contain all login details and service ownership details for infrastructure components. Furthermore, it will outline agreed upon acceptable levels of risk, including what degrees of monitoring, code analysis, testing, and self-healing are sufficient for the business.
  • Release Plan – This is the documented plan for taking the product from staging to production, including sign-off from the customer and detailed instructions on how to push all releases to production. High-level snapshots of differences should be presented to allow non-technical stakeholders to understand what exactly is being pushed to production.
  1. We will maintain operational readiness by leveraging continuous delivery systems that address any/all issues with repeatability, reliability and process continuity in real-time. We will test every build and require that all pre-release conditions are met as part of every iteration towards release. This will be done without exception and will be enforced at the level of the organization rather than the developer.

  1. We will continuously audit and improve our internal systems and processes. We recognize that our current systems are imperfect and require attention. The systems put in place two years ago are no longer adequate. We will not be beholden to any single system or antiquated tool or one way of doing things. We will embrace and enforce the concept that a system is not rugged without many different types of software running in many different places looking for many different things.

FAIL FORWARD

The lessons we have learned here will take us into tomorrow. We cannot and will not stagnate. We will not be content with previous standards. We will iterate. We will learn. We will embrace each challenge with honesty and humility. If we fail, we fail forward — fearless because we are true believers in the rugged way. We will build a better, stronger, more resilient team, process, product, customer, relationship, standard.

Our customers trust us with their products. They trust us to be technology professionals. We have an obligation to be more than just an assemblage of cowboys. This is our promise to our customers. And we must continue to push ourselves to stay current, sharp and ready for battle at any moment. This is the rugged way.

JOIN US IN OUR RUGGED MISSION

I invite you to join us in our mission to be more rugged, to stop making excuses. To strive to make fewer mistakes, and when you do, share your failures and learnings with the world so we can all benefit. The idea of “rugged” starts with a personal commitment to quality and excellence, but it can’t stop there, especially when you’re the CEO. No excuses, no exceptions. Pick up the gauntlet. Let’s go!

lance-signature