GitHub has fixed an issue that was negatively impacting pull requests, blaming it on the incorrect application of a network configuration.
Late on March 11, developers began experiencing intermittent errors when issuing pull requests. Error rates ranged from 1% for API to as high as 100% for Secret Scanning and 2FA using GitHub Mobile. The company says the issue has been addressed, and it is working on measures to prevent it from happening again.
The issue was caused by a deployment of network related configuration that was inadvertently applied to the incorrect environment. This error was detected within 4 minutes and a rollback was initiated. While error rates began dropping quickly at 22:55 UTC, the rollback failed in one of our data centers, leading to a longer recovery time. At this point, many failed requests succeeded upon retrying. This failure was due to an unrelated issue that had occurred earlier in the day where the datastore for our configuration service was polluted in a way that required manual intervention. The bad data in the configuration service caused the rollback in this one datacenter to fail. A manual removal of the incorrect data allowed the full rollback to complete at 00:48 UTC thereby restoring full access to services. We understand how the corrupt data was deployed and continue to investigate why the specific data caused the subsequent deployments to fail.
We are working on various measures to ensure safety of this kind of configuration change, faster detection of the problem via better monitoring of the related subsystems, and improvements to the robustness of our underlying configuration system including prevention and automatic cleanup of polluted records such that we can automatically recover from this kind of data issue in the future.