Selecting a Test Data Management Strategy

Like any good engineering problem, there is not a one-size-fits-all approach to Test Data Management. Requirements matter. Constraints matter. And so, we find ourselves needing to do some thinking when trying to determine what the best strategy is for Test Data Management in our situation. Spoiler alert --- you will usually need to apply a hybrid of several of these approaches. Here are the common approaches and when to use them.

Cloning Production Data

In this approach, a copy of the production database is made and deployed to a pre-production environment for testing. For many organizations, this is a good place to start because it's often very easy to implement. Export, import, update your QA app config and you're ready to go.

When dealing with production data, you can find scenarios your customers are really facing that you maybe didn't think of (e.g. "That drop down box has 500 values!?" or "We have Canadian zip codes in our data?"). It can be really useful.

But this approach has its limits. When I worked for a big data analytics company, I floated the idea of cloning production for testing. Finding out how expensive cloning a big data database would be was a real eye opener.

Believe it or not production data might not have all the scenarios you need to test. Another company I worked for loaded data from thousands of data sources. Our systems needed to be robust enough to handle the data they would send us tomorrow, which wasn't always the same as the data they had sent us to-date.

Another limitation of cloning production is that data privacy and security concerns are higher than they ever have been. Depending on your industry, the type of data you deal with, and who your customers are, putting production data in a test environment may make it impossible for your developers to work in the test environment or may even be an unthinkable option.

Advantages

Easy to implement
Good for performance testing
Most production-like test environment
Find scenarios you haven't thought of

Disadvantages

❌ Production copies may be too expensive
❌ Doesn't cover scenarios that haven't happened yet
❌ May introduce data privacy and security concerns
❌ Test data may change, impacting your tests

Sampling Production Data

This brings us to our next strategy: Sampling Production Data. In Sampling Production Data, you write scripts to copy data from production. But instead of copying all the data, you copy a smaller, representative sample of the data.

How you sample your data can be really important to determining how easy this approach is to implement and how useful it is. In most systems, there are usually data segments that can be pretty easily independent of each other. For example, you may be able to copy all your reference tables and all accounts starting with "A". If your accounts are roughly evenly distributed, you'll get 1/26th of the production data. At the same time, you're not covering the scenario of an account name that starts with a special character.

Sampling addresses some, but not all of the disadvantages of Cloning Production Data.

Advantages

✅ Still a more production-like test environment
✅ F
inding scenarios you haven't thought of
✅ Can be significantly less expensive than cloning production

Disadvantages

❌ Still doesn't cover scenarios that haven't happened yet
❌ May still introduce data privacy and security concerns
❌ Test data may change, impacting your tests

Masking Production Data

Advantages

✅ Still results in production-like data
✅ Addresses some concerns around data privacy and security

Disadvantages

❌ Masking may impact properties and usefulness of the data (e.g. testing a validator on masked data may tell you everything is invalid)
❌ Can be a challenge to implement, especially if keys need to be masked

Hand-Crafted Test Data

With Hand-crafted Test Data, you take the time to add test data for new scenarios and features. This is the most time consuming of the strategies, but results in the smallest, most meaningful data set.

It's not a great option for performance testing or reproducing client issues. But it can be a really great option for developer environments where you may want many copies, including copies that may live on developer workstations (the idea of having production data living on a developer laptop that goes missing when they're in the gas station is a nightmare that can be avoided here).

Advantages

✅ Results in the smallest reasonable data set
✅ Best coverage of specific scenarios

✅ Test failures are more meaningful
✅ Catches issues earlier
✅ Helps inform the design of the software

Disadvantages

❌ Requires the most work
❌ May require a culture shift
❌ In some organizations, it's not clear who should do this and whether or not they'll have access

Automatic Generation of Test Data

Because of the expense required to craft test data, tools are cropping up to help solve this problem. "Test data automation" tools like Curiosity make it easier to populate your pre-production databases with production-like data that is purely fictional.

This is typically less work than hand-crafting data, but be aware that the implementation isn't free. Chasing down data integrity constraints and creating realistic data still requires some work.

Automatically generated data is good because you can still have volume for performance testing and don't need to worry about data security and privacy. But it can also be harder to make assertions on the data for your scenarios.

Advantages

✅ Results in production-like data
✅ Alleviates security and data privacy concerns
✅ Less work than hand-crafting data

Disadvantages

❌ Requires additional software
❌ May require more work than you think
❌ Hard to test assertions on generated data

So Which Approach is Right for You?

The short answer is you probably need to use some combination of several of these approaches. Hand-crafted Test Data can be a fantastic approach for Development environments, but fall short when running performance tests or troubleshooting client issues in QA/Staging. Masking individual columns (e.g. company name, first names, last names, emails) can be fairly straightforward, but masking everything can get really complicated.