Hardest QA Fails of 2023
This time of year everyone likes to recap all the wonderful things that happened in the year. At Testery, we are lifelong students of complex technical systems and preventing failures. So instead, we're going to recap some of our favorite system and testing failures of 2023.
IKEA Pays $24M For Printing Credit Card Numbers on Receipts
It wasn't clear whether or not this happened due to requirements or by mistake, but earlier this year, Ikea got stuck paying a $24M settlement for printing too many digits of the credit card number on their receipts.
This is a helpful reminder that a well-seasoned product owners and testers should be familiar with the laws that apply to their companies and be on the lookout for features that don't respect those laws.
Rough Year for Southwest Airlines
In December of 2022, Southwest Airlines faced a major disruption. While weather was a significant cause, other airlines weren't impacted. It's not clear the extent to which software was to blame, but industry experts suggest that "the company's uniquely complex flight coordination model and its antiquated internal scheduling systems" played a big factor.
Regardless, it's good reminder of what happens when complex systems fail big.
"The holiday disruption led to over 16,700 flight cancellations, costing the company more than $1 billion" (https://abcnews.go.com/Business/causing-flights-meltdown-southwest-airlines/story?id=95888949)
Four months later, however, Southwest faced another major issue which appeared to be more software-related in nature.
“Southwest has resumed operations after temporarily pausing flight activity this morning to work through data connection issues resulting from a firewall failure,” said Dan Landson, a spokesperson for Southwest Airlines. “Early this morning, a vendor-supplied firewall went down and connection to some operational data was unexpectedly lost.”
Data Dog $5M Outage
You have to give Data Dog a lot of credit for this one. They did a lot of things right. They were actually applying updates to their systems. They had rolling deployments. They had multiple regions.
The Pragmatic Engineer does a better job explaining what happened than I could ever hope to, so please read the article.
ASUS Routers Stopped Routing
For several days this year, a configuration file change resulted in thousands of users losing their internet access for 48 hours.
"The mass outage, the company said, was the result of “an error in the configuration of our server settings file.”
Janitor Turns Off Beeping Noise Causing $1M in Damages
Earlier this year, a janitor caused $1M in damages by unplugging a freezer that was constantly making a beeping sound.
It's a helpful reminder that the physical security of our devices has an impact on their ability to function properly. This holds true not only for freezers, but also for servers.
"A janitor cleaning in a laboratory at a university in Troy, New York, is accused of damaging at least $1 million in scientific research after shutting off the storage freezer while trying to turn off a constant beeping noise"
U.S. Banks Fail To Process 850,000 Transactions On Time
Data masking is a helpful technique for protecting production data. In many systems, there are configuration files or switches that will turn on / off settings like whether or not the data is being masked.
"... instructions were sent to financial institutions 'with the account number and names of customers masked.'"
In this particular case, data masking was accidentally turned on in production — and probably not caught in QA because it was supposed to be on in QA — resulting in thousands of bank transactions not getting processed right away.
Be sure to keep a close eye on any settings that are different between dev, test, and production environments!