automated data masking
Static Data Masking is such an important aspect of an organization's data security posture, I hoped it could somehow be included in dbatools 1.0. I'm excited to report that, thanks to Sander Stad, data masking is now available as of 0.9.725 and works against all versions of SQL Server!
why this is awesome
Static Data Masking is a data protection feature that helps users sanitize sensitive data in a copy of their SQL databases. Support for Static Data Masking is now in preview in SQL Server Management Studio and Red Gate has a product that helps as well.
As Microsoft outlines, Static Data Masking can be used for:
- Development and testing
- Analytics and business reporting
- Troubleshooting
- Sharing the database with a consultant, a research team, or any third-party
Static Data Masking facilitates compliance with security requirements such as the separation between production and dev/test environments. For organizations subject to GDPR, the feature is a convenient tool to remove all personal information while preserving the structure of the database for further processing.
Another really good reason for masking is that attackers often skip production and go straight to dev/test because it may not be as protected and monitored. Data masking helps keep our data secure by replacing data with randomly generated fake values.
For example, the employees of an organization may be assigned an eight-digit employee ID number written like so: 8765-4321. If that data were to be masked, the eight-digit format would remain the same but the values within it would be changed: 3561-2847. Some of the more common methods of data masking are character substitution, numeric variance, character shuffling and format preserving encryption.
Note that this is different from Dynamic Data Masking, which only masks data in transit. Static Data Masking masks data at rest.
but how?
Masking is challenging because it requires a bunch of dictionaries of fake but realistic looking data: first names, last names, addresses, email addresses, and more. I didn't know this before Sander told me, but the open source community actually has projects that revolve around data masking! Most important to us is the C# project Bogus, which is based on the JavaScript project faker.js.
masking in dbatools
As Sander highlights in his post, data masking with dbatools, dbatools now offers two commands to help you mask your data. These commands were based off of Sander's commands originally found in PSDatabaseClone.
New-DbaDbMaskingConfig
The first command creates an easy-to-edit json file which allows you to tell the mask invoker how you'd like it to mask your data. Here's how I tested the command against Adventureworks2014.
generate the masking configuration
New-DbaDbMaskingConfig -SqlInstance sql2014 -Database Adventureworks2014 -Path C:\temp
This returns the results of Get-ChildItem. You can see the contents of the config file here.
edit the file and remove unnecessary columns
While the auto config generator generates an entry for all columns in the database, it's unlikely that all columns will need to be masked.
With AdventureWorks2014, I took out things like error numbers but left in data like error messages because error messages may contain private information.
working with constraints
As it turns out, some columns have constraints that can't be easily determined, so they must be set manually. It was pretty fun figuring out and adding support for constraints. I'll use Employee and EmployeeDepartmentHistory as examples.
First, I ran the invoker command got a ton of errors because the data was out of bounds. So I opened up SSMS and took a look at the constraints.
I then updated all of the constrained columns to match. In the above example, only M and S are allowed in the MaritalStatus column.
{ "Name": "MaritalStatus", "ColumnType": "nchar", "CharacterString": "SM", //right here "MinValue": null, "MaxValue": 1, "MaskingType": "Random", "SubType": "String" }
The HumanResources.Employee was particularly constrained and I had to update a bunch of the values as seen below.
{ "Name": "Employee", "Schema": "HumanResources", "Columns": [ { "Name": "NationalIDNumber", "ColumnType": "nvarchar", "CharacterString": null, "MinValue": null, "MaxValue": 15, "MaskingType": "Random", "SubType": "String" }, { "Name": "LoginID", "ColumnType": "nvarchar", "CharacterString": null, "MinValue": null, "MaxValue": 256, "MaskingType": "Person", "SubType": "UserName" }, { "Name": "JobTitle", "ColumnType": "nvarchar", "CharacterString": null, "MinValue": null, "MaxValue": 50, "MaskingType": "Name", "SubType": "JobType" }, { "Name": "BirthDate", "ColumnType": "date", "CharacterString": null, "MinValue": "1/1/1970", "MaxValue": "1/1/1990", "MaskingType": "Random", "SubType": "Date" }, { "Name": "MaritalStatus", "ColumnType": "nchar", "CharacterString": "SM", "MinValue": null, "MaxValue": 1, "MaskingType": "Random", "SubType": "String" }, { "Name": "Gender", "ColumnType": "nchar", "CharacterString": "MF", "MinValue": 1, "MaxValue": 1, "MaskingType": "Random", "SubType": "String" }, { "Name": "VacationHours", "ColumnType": "smallint", "CharacterString": null, "MinValue": -40, "MaxValue": 200, "MaskingType": "Random", "SubType": "Number" }, { "Name": "SickLeaveHours", "ColumnType": "smallint", "CharacterString": null, "MinValue": 0, "MaxValue": 120, "MaskingType": "Random", "SubType": "Number" } ] }
HumanResources.EmployeeDepartmentHistory was also interesting to work with. I had to ensure the StartDate was before the EndDate, which I did by setting their Min and Max values.
{ "Name": "EmployeeDepartmentHistory", "Schema": "HumanResources", "Columns": [ { "Name": "StartDate", "ColumnType": "date", "CharacterString": null, "MinValue": null, "MaxValue": "1/1/2018", // cant start later than jan 1 "MaskingType": "Random", "SubType": "Date" }, { "Name": "EndDate", "ColumnType": "date", "CharacterString": null, "MinValue": "1/2/2018", // so it'll always be after "MaxValue": null, "MaskingType": "Random", "SubType": "Date" } ] }
Here's a sample of the final file.
Figuring out data types
You can see above that we have things like "Random" and "Date". We figured this out by performing good ol' Get-Member against the bogus class.
Though it's a bit rough right now, we also generated a website to help guide you when picking more specific data types.
Invoke-DbaDbDataMasking
Now that we've got our config file ready, it's time to execute!
Invoke-DbaDbDataMasking -SqlInstance sql2017 -Database AW -FilePath C:\temp\sql2014.AdventureWorks2014.tables.json
But we also wanted a single file to be more flexible, so we added support for processing only specific tables and columns. Let's say that you have two dev environments, and one needs to update Employee and the other EmployeeDepartmentHistory. You can limit which tables and columns that are masked.
Get-ChildItem C:\temp\sql2014.AdventureWorks2014.tables.json | Invoke-DbaDbDataMasking -SqlInstance sql2017 -Database AW -Table EmployeeDepartmentHistory
Note that if you don't specify -Database
, it'll use the Database name listed in the json file.
time for some gorgeous output
Here's the data masker in action
try it yourself
Restore AdventureWorks2014 to your test instance and simply run:
Invoke-DbaDbDataMasking -SqlInstance sql2017 -FilePath https://sqlps.io/maskconfig
You will be prompted to confirm changes to your database.
If you'd like to avoid the prompt, just add -Confirm:$false
.
Awesome!
limitations
Here's a list of Static Data Limitations from Microsoft's site (tho we added xml limitations) that applies to us as well, from their Limitations section.
- Static Data Masking does not support databases with temporal tables.
- Static Data Masking does not mask memory-optimized tables.
- Static Data Masking does not mask computed columns, and identity columns.
- Static Data Masking does not support Azure SQL Hyperscale databases.
- Specifically in dbatools, Static Data Masking does not support geometry, geography, and xml datatypes.
In addition, Static Data Masking presents three limitations in its masking abilities:
- Static Data Masking does not update histogram statistics. Consequently, the masked copy of the database may still contain sensitive data in the histogram statistics once Static Data Masking has been completed. Consider running UPDATE STATISTICS to remedy this issue.
- If Static Data Masking returns an error, all masking operations are suspended. The copy of the database is not deleted and may contain sensitive information. The user is responsible for deleting the copy of the database if Static Data Masking returns an error.
- (SQL Server only) The data file(s) and the log file may still contain bits of sensitive data in unallocated memory after Static Data Masking has completed. This sensitive data may be retrievable with a hex editor if given access to the data file(s) and the log file.
Cheers! - Chrissy