You’ve probably already heard about it, but it’s closer than you think. GDPR (General Data Protection Regulation), the new European regulation regarding privacy - and data protection that comes into affect on May 25th , 2018. The regulation is about how personal data of European Union-citizens should be handled. But it’s also as important outside the EU, because if your website is targeting EU-citizens you’ll have to apply to this regulation too.
This article is not about the details of this new regulation; there’s plenty of blogs and articles that cover this (there are relevant links in the footer below). What I do want to do, is discuss the reasoning behind this regulation and the possible impact for Umbraco. In other words, I will focus specifically on what could/should be changed in Umbraco to accommodate this. Keep in mind that securing privacy and data on a website is just a small part of this new regulation. It also serves to force you into casting a critical gaze upon your own internal processes, network architecture, https, testing environments, etcetera. At our company Perplex, I’ve been dealing with online security for quite a while and the past few months my attention has been directed towards GDPR. In this article I will discuss five questions regarding GDPR and Umbraco and attempting to visualize my ideas into a concept. The five questions are as follows:
- What data do you store?
- Where do you store the data?
- How do you store the data?
- Who has access to the data?
- How long do you store the data?
It is my hope that this article inspires you to participate in the discussion so we can design and implement a system that will serve our needs. This post became quite extensive while writing it. If it’s too long you can always take a shortcut directly to my mock-up where I’ve tried to visualize some of my thoughts: http://downloads.perplex.eu/umbraco/gdpr/members.html.
GDPR in a nutshell
But first, let’s start with the idea behind this new regulation. Are you the owner of a website? If yes, chances are your visitors are leaving (personal) data behind and are trusting you to treat this with the utmost care. After all, you yourself expect the same from site where you leave your (personal) data, right? GDPR is essentially nothing more than a set of official guidelines on how to treat (personal) data with care and respect. It forces you to critically consider what data you store, where, for how long, and who can access it. In my opinion, it’s just a matter of logically thinking about the matter and I personally am a supporter of this regulation.
1. What data do you store?
Take a look at the data you store with your website and ask yourself how sensitive this data is. To get a better idea of how sensitive the data really is, you could try to classify it as follows:
Public data: This data is available on the public areas of your site or could be gained via other public sources. A good example is an overview of clubmembers on your website or someones LinkedIn profile. It’s not really possible to ‘leak’ this data as it is available for every visitor or could easily be gained through other public sources.
Private data: This is data that is only available on a part of your site accessible only for those with specific access, yet isn’t overly sensitive. You will want to protect this data to the best of your abilities, but it’s not the end of the world if the data is stolen. Think of things like addresses, phonenumbers, e-mailaddresses or titles.
Sensitive data: This data is the data you most definitely don’t want to fall into the wrong hands. The regulation specifically mentions things such as health related data, biometrics, religion, union memberships, etc.
For each of these categories it is important to consider wether or not the data actually needs to be stored / asked on your website. If it needs to be stored, you’ll need to consider for how long and how you want to store this data. Remember: you can’t lose what you do not have. Data that you never save, can’t get lost.
Also keep in mind that the category in which your data falls, is dependent on the situation. Membership of a dating site is a tad more private than entering your name at some website to win a free action product. If a list of of given- and surnames were to leak, it would cause more of an uproar in the case of the dating site. In other words, the sensitivity of the data is mostly dependent on the context, except for the data that is specifically mentioned in GDPR. So if you’re asking on your event registration website whether someone has allergies and you store it, then you’re storing health related data and you should consider this as a sensitive data and put in the correct countermeasures.
In my opinion, it is up to Umbraco to provide the correct security measures for storing data as well as methods to change how long this data is stored. However, the responsibility of correctly classifying the data and implementing these measures lies with the owner of the website.
2. Where do you store it?
For Umbraco there are only three options for storing data that make sense to me.
Umbraco Members: stored data of persons in a member profile in the members section
Umbraco Forms: filled in forms that are stored via Umbraco Forms
Custom solution: a custom solution where you put in your own table(s) and/or on filesystem
I can think you shouldn’t store personal data any other way in Umbraco (such as the Content tree).
If we take a close look at each of these options, we can see where it’s stored in Umbraco.
- Umbraco Members: the member profile data is stored in a few database tables. The first table is cmsMember where the login name and email addresses are stored. All other data is stored in the table cmsPropertyData, alongside the data of the content tree.
By executing the following stored procedure, you’ll get a decent overview of all data that is being stored for your members in your own website.
SELECT member.email, type.Name, ISNULL(CAST(data.dataint as nvarchar(255)),ISNULL(CAST(data.datadate as nvarchar(255)),ISNULL(data.dataNvarchar, data.datantext))) FROM [cmsMember] member LEFT JOIN cmsPropertyData data ON member.nodeId = data.contentNodeId LEFT JOIN cmsPropertyType type ON data.propertytypeid = type.id ORDER BY member.nodeId
The output would like as follows:
|email@example.com||Failed Password Attempts||0|
|firstname.lastname@example.org||Is Locked Out||0|
|email@example.com||Last Lockout Date||Nov 21 2016 5:36PM|
|firstname.lastname@example.org||Last Login Date||Nov 6 2017 12:10PM|
|email@example.com||Last Password Change Date||Jul 24 2017 10:55AM|
|firstname.lastname@example.org||Birth date||August, 31st, 1984|
|email@example.com||My biggest secret||I love unicorns|
Umbraco Forms: the filled in form entries are also stored in the database. They are stored in the UFRecords table and if you look into the column RecordData you’ll notice all saved is stored here in json-format. This data is also stored in UFRecordDatabit, UFRecordDataDateTime, UFRecordDataInteger, UFRecordDataLongString, UFRecordDataString. If there’s an upload field on your form it gets stored on disk in the folder /media/forms/upload/.
Retrieving this data can be done using one of these queries:
SELECT TOP (1000) \[Id\],\[RecordData\] FROM \[UFRecords\] The result is a list with the form data of each form in JSON-format. The second option: SELECT TOP (1000) fields.record , fields.alias , ISNULL(ISNULL(ISNULL(ISNULL(datastring.value, datalongstring.value),datainteger.value),databit.value),datadatetime.value) FROM \[PerplexBasis2.0\].\[dbo\].\[UFRecordFields\] fields LEFT JOIN UFRecordDataString datastring ON fields.\[key\] = datastring.\[key\] LEFT JOIN UFRecordDataLongString datalongstring ON fields.\[key\] = datalongstring.\[key\] LEFT JOIN UFRecordDataInteger datainteger ON fields.\[key\] = datainteger.\[key\] LEFT JOIN UFRecordDatabit databit ON fields.\[key\] = databit.\[key\] LEFT JOIN UFRecorddatadatetime datadatetime ON fields.\[key\] = datadatetime.\[key\] ORDER BY 1
This will result in a row of data for every formfield.
- Custom solution: this data is stored in your own way and you should think about where and how you are storing it and whether you encrypt this data or not, and for how long you store it. This however, is up to you and out of the scope of this article.
3. How is the data stored?
Currently all data is stored unencrypted in the database. This means that everyone with access to this database, or the ability to execute some SQL-statement on your site (for example via SQL-injection), can read this data. SQL-injection isn’t something to worry about too much in the Umbraco Core itself, there’s enough checks and precautions for that. However, external packages or especially your own code might still be vulnerable. SQL-injection is even in 2017 still the number 1 vulnerability in websites. If a malicious party gains access to the database and executes any of the earlier listed queries they have full access to readable and usable data.
Another situation that could arise, is someone getting access to the files on disk. They could overwrite the files and use Razor and C# to read the database. Maybe you remember the Umbraco Forms patch release that fixed a possible exploit that would allow this (issue: https://umbraco.com/blog/security-advisory-update-umbraco-forms-immediately/, exploit: https://vimeo.com/205564261/02bfa2680d).
As a countermeasure against these attacks you could encrypt your data. I believe Umbraco should provide the option to allow you to determine which Forms-fields and Member-datatypes will be saved with encryption. This isn’t needed for every field, just those that contain data that falls in the category ‘Sensitive’, because encryption has it’s own downsides. Umbraco should provide the options and means, leaving the final decision with the owner of the website.
There are several different options to handle encryption. I will share my experience on this with you as go over the pros and cons.
SQL Server Encryption
Several weeks ago, I saw some online suggestions on our.umbraco.org to use SQL Server Encryption. This method is available by default in Microsoft SQL Server and thus readily available. I do not think this is a good solution, or at the very least not a complete solution (I could dedicate an entire article about just this subject to substantiate this, so sorry if this seems a little blunt).
There are different ways to apply encryption on a database leve:
Transparent Data Encryption (TDE): This method will encrypt the database files (.mdf and .ldf) preventing them from being read without the proper certificate. While a good and personally recommended security measure that has been in SQL Server for some time now, it does not prevent problems related to SQL-Injection.
Always encrypted: Using this method will encrypt the data in the database in such a way that an administrator can not read the data directly. Still, this method is far from foolproof as the SQL-user set up for the (Umbraco) website will have the correct rights to read the data. Also columns of the type Text and Ntext can not be encrypted in this manner. Remember the queries from earlier? Guess what type the columns were reading with those queries are.
By using one of these methods, you are not safe from SQL-injection. Of course, they are good ways to secure your data, but this really should be one of many layers of security. So if you ask me, I think we need to go a step further.
Encrypting the data yourself
A logical alternative, is changing the code and making sure the data is encrypted before it gets saved to the database. At Perplex we’ve done this before in our PerplexMail-package. With this, we prevent potentially sensitive sent in e-mails is leaked from the website via SQL-injection, a backup of our database that we didn’t protect well enough, or someone who got access to our database server. If you download the package and send a couple of e-mails, you'll notice the data in the database looks like this:
The sensitive, encrypted data is saved in the database making it not only unreadable for people that have access to the database, but also those that manage to use SQL-Injection and execute a simple SELECT query. The data is encrypted with a salt using Rijndael-AES with a 256 bit keysize. The encryption also takes place in the C# code, meaning access to the database is meaningless to an attacker as he can’t simply read the data contained in it. This encryption method has evolved over time and I would like to take a moment to explain that process.
Encryption and Fine Tuning
In the first version of our package, we didn’t add a unique salt to each record when we encrypted the data. This caused the data in column ‘from’ and the start of the data in the column 'body' to look the same. This makes sense, after all the e-mails were always sent from the same address and every e-mail started with the same chunk of HTML. By encrypting with Rijndael-AES without a salt, the result is always the same and would cause the table to look a little like this:
Due to fields having the same encrypted value for several records, it seemed to me that it would be easier possible to retrieve the encryption key.
So, we added a salt for each record. This had the desired effect of making every record look completely different and I think it increased the safety of our data (this is an assumption and I’m not totally sure though. Yes, salting is essential when hashing your passwords, but is it also increasing security when encrypting?).
It did, however, come with it’s own problem. We had developed a search function in our package, so we could search for specific mails within the UI of Umbraco. But to get results from this function we needed to decrypt every record first and then tried to match the search query. With 50 or so e-mails this wasn’t really an issue. But once we reached numbers of about 5,000 e-mails we started to notice significant performance issues. Our conclusion was that we had to stop encrypting fields that had to be searchable when using a salt. In our latest version, we decided to drop encryption completely on the mail addresses column, in order to make these searchable again.
Another ‘downside’ of this method is that the output is always going to be a byte stream/text string. This makes bits (true/false), datetime and integer fields on a database level useless when you want to encrypt them. After all, you can’t save them in their original columns as you will have to save them in a nvarchar or ntext column. Besides, I think encrypting a bit is not really helping as there are only two possible values, making it relatively trivial to crack the encryption once you have the data (but this is the second assumption that I make and I hope some encryption-guru can shine his lights on these two assumptions).
With this short primer about encryption out of the way, we can jump back to the original topic, namely the ability to encrypt specific data to make it unreadable with SQL-injection. For each datatype on your member type or form you will have to decide if this data is considered sensitive. This is not a decision to be made lightly as there are downsides to encrypting the data such as losing the ability to (easily) search, filter or compare these fields and it will have some impact on performance.
I think Umbraco (or a package) should provide some new datatypes that encrypts the data before writing the data into the database. Examples of these datatypes would include text strings, text areas, rich text editors and labels. This way, you can store sensitive data about your users as securely as possible. By implementing encryption via additional datatypes you do not have to confuse users with extra checkboxes in your doctype-editors and form-editors, but you can use new datatypes that handle this for you and the impact for Umbraco is minimal. It’s a pretty easy add-on that even can work on older versions of Umbraco websites.
4. Who has access to and has accessed the data
Well then, let's go take a peek at the Umbraco back office. The data is now stored securely, but in the back office the data is still visible in both the Members- and Forms-sections. But, which users have access to these sections? I doubt people think about this a lot and suspect too many users currently have access to the Member section and thus the data of your members. So, that begs the questions of whether or not this is desired and even necessary.
In the most ideal situation you decide which users can access Member configuration (The Membertype-folder in Umbraco) and who can access the actual Member data. After all, most people won't need access to both. In my eyes, a developer needs access to the Membertype section as they will need this during development. However, I also feel like this area should moved over to the Settings tree where it fits nicely next to the Document- and Mediatype configuration sections. This way, a programmer won't have access to the member section, and potentially personal and sensitive data, by default.
This way, only those users that actually need to handle this data (who are these people exactly? Did you ever consider this?) can access it. My suggestion would be to not show all data straight up either. By default, the user can only see data categorized as Public or Private. This data is already saved unencrypted, so no harm there. The data that falls under the category ‘Sensitive’ should be saved with encryption (the new datatype) and should only become visible once the user is a member of a specific Umbraco usergroup (hooray, for usergroups!), for example 'Sensitive data viewers'. This should be a configuration setting on the new datatype; which usergroups could see the unencrypted content of the record in the Umbraco back office.
If you’re not part of the correct Umbraco user group, a member record in the back office could look something like this:
By doing this, you’ll have more control over who has access to sensitive data of members.
Finally, it would be great to know who accessed any member data and maybe more specifically sensitive member data. Storing it securely is all fine, but what we really want is keeping track on who has or had access to this data and when they actually viewed it (in other words, who opened a specific Member page). We'll could log this and show it in the member section:
This could simply be an extra datatype that people should store on a member-type, but I think this should be implemented by Umbraco HQ in the Core of the project.
The same goes for the Forms section in Umbraco. The situation differs a tad from the Members-section because you want clients to be able to create their own forms. As such, moving the creation of a form to the Settings tree (like the suggestion I made for the Member configuration) would not be very practical. It would however, be a good idea to make a separate group for Forms as well. With access to just the Forms section, you could create, delete or edit your forms. But only when you are part of the 'Form entry viewers' group can you actually see the entries of these forms. Also, we should log who viewed the form entries and when so we have a track record of those who accessed the sensitive data of people.
5. For how long do we store the data
This leaves us with the two last topics that are somewhat related. Just how long do you need to store data? How much use is the data of a member that hasn't logged in to your site for over a year? On one hand there's the question of whether the data is still relevant and up to date, on the other you should probably wonder why you're even saving this data for all that time.
Wouldn't it be smarter to send these 'sleeping' members for example an e-mail asking them if they wish to remain members or if they can be removed? If there's been no reaction within a month, we can delete the data automatically. A dashboard in the back office could show an overview of members that are close to deletion, allowing you to take action based on this. Wouldn't you want other sites to handle your (personal) data in this same manner?
The same of course, goes for the forms. What use is a form entry from over a year ago? Might it not be a better idea to just tell each form how long it can keep entries stored (with a globally set default)? Sure, there will probably be some forms where you do need to store the entries for a longer period of time. But you can just set these specific forms to store their entries for a longer period. Aside from that, I would like the option of not storing the entries into the database at all. There are plenty of forms where you need the entered data just to send a single e-mail, so why store this in the database at all? In our Perplex Forms On Steroids package we implemented just his option by way of a workflow called 'delete on submit'. Combined with our mail package, this achieves what I am looking for, but I feel like this should really be a part of the Umbraco Forms Core.
Do keep in mind, that this has no influence on the life cycle of the e-mail it self. Something that is equally important where GDPR is concerned. Because you are not preventing the e-mail from being saved on your mailserver, in someones mailbox or if it gets printed. Each of which would still make the data accessible to anyone that sees the e-mail. But, at the very least, you've made sure your website doesn't carry this data anymore.
How can we be forgotten
Finally, there's the 'right to be forgotten' part within the GDPR. If a visitor submits this request, it would be nice if we can easily gather all data of this user. But how do we do this? Based on an e-mail address? What about forms the member might have submitted? My suggestion is the creation of a datatype that can easily link form entries to members (maybe even automatically if they are logged in while submitting the form). This way, we can easily keep track of which form entries belong to a certain member and knowing that, it should be relatively simple to just delete it all with the push of a button.
What Do You Think?
In this article I’ve tried to structure all my thoughts about GDPR and Umbraco into something concrete. I tried to discuss all topics that are directly related to Umbraco and proposed some improvements that could be pretty easily implemented and give developers a way to implement the GDPR regulations into their Umbraco website. With this mock-up I’ve tried to visualize it a bit so you could get a feeling how I think this could work.
I’ve been researching security and GDPR for a while now and I hope this will be a good starting point to continue the discussion of what needs to be done. I would love to have feedback on my proposal; did I interpret some regulations incorrectly, am I missing parts, or do you totally agree. Please let me know. I’ve put a lot of time and thinking in this and I hope you can help me make it even better!
Some GDPR background articles:
Useful and inspiring packages related to this article:
- Perplex Dashboard: A package that gives you insight in all authentication-related actions for Umbraco Users. It’s a first step into the direction of full overview of actions of Umbraco Users in the back office.
- uMunge: A useful package that masks personal information when you copy a live database to your local environment.
- Umbraco 2FA: A package that allows you to do two-factor-authentication. A crucial layer in your security!
- PerplexMail: A package we’ve released almost two years ago and contains encryption of the data out of the box. It can be used to have insights in your emails and give Umbraco Users the ability to edit e-mails, email templates and email addresses.
Related forum discussions:
Some related issues
(I will create new issues for all mentioned issues in this article, when people do agree that this is a good direction):