During this incident, TaxJar customers were not able to access the TaxJar App or use the TaxJar API. We know this was impactful, and we are truly sorry it happened.
We have already implemented the following operational changes to ensure this type of failure does not happen again:
Incident Root Cause Analysis
The incident started with a routine Kubernetes minor version upgrade using our vendor’s managed kubernetes service
Immediately following completion of the upgrade of our production cluster, Kubernetes workers began reporting “Not Ready” status.
Within a few minutes all nodes were now in a state of “Not Ready” which caused all workloads to be marked as offline by our load balancers.
The vendor’s support team was able to identify the issue:
We manually added the missing rule, which restored connectivity to our managed Kubernetes cluster.
At this point our services started coming back online.
Several other security groups, managed with cloudformation, which had utilized this rule for connectivity between our K8s workloads to other services provided by the vendor (such as memory caches and databases) were identified as being unexpectedly altered after this upgrade and also had to be repaired before all services could be restored.
We continue to work with the vendor to understand the root cause for the failure of the managed service to not operate as documented.