Selecting a Test Data Management Strategy
Like any good engineering problem, there is not a one-size-fits-all approach to Test Data Management. Requirements matter. Constraints matter. And so, we find ourselves needing to do some thinking when trying to determine what the best strategy is for Test Data Management in our situation. Spoiler alert --- you will usually need to apply a hybrid of several of these approaches. Here are the common approaches and when to use them.
Cloning Production Data
In this approach, a copy of the production database is made and deployed to a pre-production environment for testing. For many organizations, this is a good place to start because it's often very easy to implement. Export, import, update your QA app config and you're ready to go.
When dealing with production data, you can find scenarios your customers are really facing that you maybe didn't think of (e.g. "That drop down box has 500 values!?" or "We have Canadian zip codes in our data?"). It can be really useful.
But this approach has its limits. When I worked for a big data analytics company, I floated the idea of cloning production for testing. Finding out how expensive cloning a big data database would be was a real eye opener.
Believe it or not production data might not have all the scenarios you need to test. Another company I worked for loaded data from thousands of data sources. Our systems needed to be robust enough to handle the data they would send us tomorrow, which wasn't always the same as the data they had sent us to-date.
Another limitation of cloning production is that data privacy and security concerns are higher than they ever have been. Depending on your industry, the type of data you deal with, and who your customers are, putting production data in a test environment may make it impossible for your developers to work in the test environment or may even be an unthinkable option.
Advantages ✅ Easy to implement |
Disadvantages ❌ Production copies may be too expensive |
Sampling Production Data
This brings us to our next strategy: Sampling Production Data. In Sampling Production Data, you write scripts to copy data from production. But instead of copying all the data, you copy a smaller, representative sample of the data.
How you sample your data can be really important to determining how easy this approach is to implement and how useful it is. In most systems, there are usually data segments that can be pretty easily independent of each other. For example, you may be able to copy all your reference tables and all accounts starting with "A". If your accounts are roughly evenly distributed, you'll get 1/26th of the production data. At the same time, you're not covering the scenario of an account name that starts with a special character.
Sampling addresses some, but not all of the disadvantages of Cloning Production Data.
Advantages ✅ Still a more production-like test environment |
Disadvantages ❌ Still doesn't cover scenarios that haven't happened yet |
Masking Production Data
Advantages ✅ Still results in production-like data |
Disadvantages ❌ Masking may impact properties and usefulness of the data (e.g. testing a validator on masked data may tell you everything is invalid) |
Hand-Crafted Test Data
With Hand-crafted Test Data, you take the time to add test data for new scenarios and features. This is the most time consuming of the strategies, but results in the smallest, most meaningful data set.
It's not a great option for performance testing or reproducing client issues. But it can be a really great option for developer environments where you may want many copies, including copies that may live on developer workstations (the idea of having production data living on a developer laptop that goes missing when they're in the gas station is a nightmare that can be avoided here).
Advantages ✅ Results in the smallest reasonable data set |
Disadvantages ❌ Requires the most work |
Automatic Generation of Test Data
Because of the expense required to craft test data, tools are cropping up to help solve this problem. "Test data automation" tools like Curiosity make it easier to populate your pre-production databases with production-like data that is purely fictional.
This is typically less work than hand-crafting data, but be aware that the implementation isn't free. Chasing down data integrity constraints and creating realistic data still requires some work.
Automatically generated data is good because you can still have volume for performance testing and don't need to worry about data security and privacy. But it can also be harder to make assertions on the data for your scenarios.
Advantages ✅ Results in production-like data |
Disadvantages ❌ Requires additional software |
So Which Approach is Right for You?
The short answer is you probably need to use some combination of several of these approaches. Hand-crafted Test Data can be a fantastic approach for Development environments, but fall short when running performance tests or troubleshooting client issues in QA/Staging. Masking individual columns (e.g. company name, first names, last names, emails) can be fairly straightforward, but masking everything can get really complicated.