automated data masking

Static Data Masking is such an important aspect of an organization's data security posture, I hoped it could somehow be included in dbatools 1.0. I'm excited to report that, thanks to Sander Stad, data masking is now available as of 0.9.725 and works against all versions of SQL Server!

why this is awesome

Static Data Masking is a data protection feature that helps users sanitize sensitive data in a copy of their SQL databases. Support for Static Data Masking is now in preview in SQL Server Management Studio and Red Gate has a product that helps as well.

As Microsoft outlines, Static Data Masking can be used for:

Development and testing
Analytics and business reporting
Troubleshooting
Sharing the database with a consultant, a research team, or any third-party

Static Data Masking facilitates compliance with security requirements such as the separation between production and dev/test environments. For organizations subject to GDPR, the feature is a convenient tool to remove all personal information while preserving the structure of the database for further processing.

Another really good reason for masking is that attackers often skip production and go straight to dev/test because it may not be as protected and monitored. Data masking helps keep our data secure by replacing data with randomly generated fake values.

For example, the employees of an organization may be assigned an eight-digit employee ID number written like so: 8765-4321. If that data were to be masked, the eight-digit format would remain the same but the values within it would be changed: 3561-2847. Some of the more common methods of data masking are character substitution, numeric variance, character shuffling and format preserving encryption.

Note that this is different from Dynamic Data Masking, which only masks data in transit. Static Data Masking masks data at rest.

but how?

Masking is challenging because it requires a bunch of dictionaries of fake but realistic looking data: first names, last names, addresses, email addresses, and more. I didn't know this before Sander told me, but the open source community actually has projects that revolve around data masking! Most important to us is the C# project Bogus, which is based on the JavaScript project faker.js.

masking in dbatools

As Sander highlights in his post, data masking with dbatools, dbatools now offers two commands to help you mask your data. These commands were based off of Sander's commands originally found in PSDatabaseClone.

New-DbaDbMaskingConfig

The first command creates an easy-to-edit json file which allows you to tell the mask invoker how you'd like it to mask your data. Here's how I tested the command against Adventureworks2014.

generate the masking configuration

New-DbaDbMaskingConfig -SqlInstance sql2014 -Database Adventureworks2014 -Path C:\temp

This returns the results of Get-ChildItem. You can see the contents of the config file here.

edit the file and remove unnecessary columns

While the auto config generator generates an entry for all columns in the database, it's unlikely that all columns will need to be masked.

With AdventureWorks2014, I took out things like error numbers but left in data like error messages because error messages may contain private information.

working with constraints

As it turns out, some columns have constraints that can't be easily determined, so they must be set manually. It was pretty fun figuring out and adding support for constraints. I'll use Employee and EmployeeDepartmentHistory as examples.

First, I ran the invoker command got a ton of errors because the data was out of bounds. So I opened up SSMS and took a look at the constraints.

I then updated all of the constrained columns to match. In the above example, only M and S are allowed in the MaritalStatus column.

{ "Name": "MaritalStatus", "ColumnType": "nchar", "CharacterString": "SM", //right here "MinValue": null, "MaxValue": 1, "MaskingType": "Random", "SubType": "String" }

The HumanResources.Employee was particularly constrained and I had to update a bunch of the values as seen below.

{ "Name": "Employee", "Schema": "HumanResources", "Columns": [ { "Name": "NationalIDNumber", "ColumnType": "nvarchar", "CharacterString": null, "MinValue": null, "MaxValue": 15, "MaskingType": "Random", "SubType": "String" }, { "Name": "LoginID", "ColumnType": "nvarchar", "CharacterString": null, "MinValue": null, "MaxValue": 256, "MaskingType": "Person", "SubType": "UserName" }, { "Name": "JobTitle", "ColumnType": "nvarchar", "CharacterString": null, "MinValue": null, "MaxValue": 50, "MaskingType": "Name", "SubType": "JobType" }, { "Name": "BirthDate", "ColumnType": "date", "CharacterString": null, "MinValue": "1/1/1970", "MaxValue": "1/1/1990", "MaskingType": "Random", "SubType": "Date" }, { "Name": "MaritalStatus", "ColumnType": "nchar", "CharacterString": "SM", "MinValue": null, "MaxValue": 1, "MaskingType": "Random", "SubType": "String" }, { "Name": "Gender", "ColumnType": "nchar", "CharacterString": "MF", "MinValue": 1, "MaxValue": 1, "MaskingType": "Random", "SubType": "String" }, { "Name": "VacationHours", "ColumnType": "smallint", "CharacterString": null, "MinValue": -40, "MaxValue": 200, "MaskingType": "Random", "SubType": "Number" }, { "Name": "SickLeaveHours", "ColumnType": "smallint", "CharacterString": null, "MinValue": 0, "MaxValue": 120, "MaskingType": "Random", "SubType": "Number" } ] }

HumanResources.EmployeeDepartmentHistory was also interesting to work with. I had to ensure the StartDate was before the EndDate, which I did by setting their Min and Max values.

{ "Name": "EmployeeDepartmentHistory", "Schema": "HumanResources", "Columns": [ { "Name": "StartDate", "ColumnType": "date", "CharacterString": null, "MinValue": null, "MaxValue": "1/1/2018", // cant start later than jan 1 "MaskingType": "Random", "SubType": "Date" }, { "Name": "EndDate", "ColumnType": "date", "CharacterString": null, "MinValue": "1/2/2018", // so it'll always be after "MaxValue": null, "MaskingType": "Random", "SubType": "Date" } ] }

Here's a sample of the final file.

Figuring out data types

You can see above that we have things like "Random" and "Date". We figured this out by performing good ol' Get-Member against the bogus class.

Though it's a bit rough right now, we also generated a website to help guide you when picking more specific data types.

Invoke-DbaDbDataMasking

Now that we've got our config file ready, it's time to execute!

Invoke-DbaDbDataMasking -SqlInstance sql2017 -Database AW -FilePath C:\temp\sql2014.AdventureWorks2014.tables.json

But we also wanted a single file to be more flexible, so we added support for processing only specific tables and columns. Let's say that you have two dev environments, and one needs to update Employee and the other EmployeeDepartmentHistory. You can limit which tables and columns that are masked.

Get-ChildItem C:\temp\sql2014.AdventureWorks2014.tables.json | Invoke-DbaDbDataMasking -SqlInstance sql2017 -Database AW -Table EmployeeDepartmentHistory

Note that if you don't specify -Database, it'll use the Database name listed in the json file.

time for some gorgeous output

Here's the data masker in action

try it yourself

Restore AdventureWorks2014 to your test instance and simply run:

Invoke-DbaDbDataMasking -SqlInstance sql2017 -FilePath https://sqlps.io/maskconfig

You will be prompted to confirm changes to your database.

If you'd like to avoid the prompt, just add -Confirm:$false.

Awesome!

limitations

Here's a list of Static Data Limitations from Microsoft's site (tho we added xml limitations) that applies to us as well, from their Limitations section.

Static Data Masking does not support databases with temporal tables.
Static Data Masking does not mask memory-optimized tables.
Static Data Masking does not mask computed columns, and identity columns.
Static Data Masking does not support Azure SQL Hyperscale databases.
Specifically in dbatools, Static Data Masking does not support geometry, geography, and xml datatypes.

In addition, Static Data Masking presents three limitations in its masking abilities:

Static Data Masking does not update histogram statistics. Consequently, the masked copy of the database may still contain sensitive data in the histogram statistics once Static Data Masking has been completed. Consider running UPDATE STATISTICS to remedy this issue.
If Static Data Masking returns an error, all masking operations are suspended. The copy of the database is not deleted and may contain sensitive information. The user is responsible for deleting the copy of the database if Static Data Masking returns an error.
(SQL Server only) The data file(s) and the log file may still contain bits of sensitive data in unallocated memory after Static Data Masking has completed. This sensitive data may be retrievable with a hex editor if given access to the data file(s) and the log file.

Cheers! - Chrissy