Geo data – support for researchers

Personal vs Anonymous Data

What is Personal Data?

What makes data personal, depends on both the nature of the data and its context. For example, a data entry like “12 December 1980”, devoid of any context, is not personal data – it is just a date. However, if that date refers to someone’s birthday (for example, in an employee dossier), it becomes personal data. Data by itself (numbers, text, pictures, audio, etc) is not inherently personal. It becomes personal when it refers or relates to an individual, directly or indirectly.

By itself, a name like John Smith may not always be personal data because there are many individuals with that name – unless the data context (such as an address, a place of work, or a telephone number) narrows it down to a single individual.

But just because the name of an individual is not known, that does not mean they can’t be identified. Many of us do not know the names of all our neighbours, but we are still able to identify them when we see them.

Formally, the GDPR (Art. 4(1)) defines personal data as any information relating to an identified or identifiable natural person (‘data subject’). An identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.

A special category of personal data is defined in the GDPR as information that reveals racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, genetic and biometric data when it is used for the purpose of uniquely identifying a natural person, and data concerning health or data concerning a natural person’s sex life or sexual orientation. Keep in mind that processing special categories of personal data is in principle prohibited, unless one or more of the exceptions listed in Art 9(2) applies.

Processing in the GDPR (Art. 4(2)) is defined as any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction. In other words, it covers any use of personal data.

An email address like info@uu.nl is not considered personal data, because it is not referring or related to an individual. Someone’s email, like john.doe@uu.nl is considered personal data because it is referring to an identified person – John Doe. Likewise, an IP address like https://8.8.8.8 is not personal data, as it relates to Google Public DNS. But the IP address of a mobile phone is considered personal data because it relates to an identifiable person, the user of that mobile phone. Even if the name of the mobile user is not known, that IP is enough to uniquely identify them as a website visitor and can be used to keep track of what other sites they have visited. Using a research example, power production data from a solar panel may be considered non-personal data, but if the data includes the panel’s geographical location, then it may be considered personal data if the location reveals someone’s house address.

Therefore, any information that is related to – or could identify – an individual from their tools, applications, or devices, like their computer or smartphone, is also considered personal data.

Of note, the GDPR does not apply to personal data of deceased individuals (Recital 27). It also does not apply to data that “does not relate to an identified or identifiable natural person, or to personal data where the data subject (the individual behind the personal data) is not or no longer identifiable” (Recital 26).

In other words, when data does not refer to people individually (for example, it refers to individuals as a group), it is excluded from the GDPR. Data referring or related to a group of individuals is generally considered anonymous data.

Keep in mind that processing anonymous data may still raise ethical concerns. Special caution is required in handling anonymised information, especially whenever such information say something about a group of individuals (thus considered anonymous), which is later used for taking decisions that produce effects (albeit indirectly) on individuals belonging to that group. Even if such process is not in the scope of the GDPR, it may still fall under the scope of other articles of the EU Charter of Fundamental Rights.

Personal data can become anonymous once it no longer falls within the definition of ‘personal data’ according to the GDPR – as previously explained.  Data is no longer considered personal when it stops relating or referring to individual people, or when it only refers or relates to a group of people. Therefore, the process of rendering personal data anonymous consists of making sufficient modifications to the data and its context with the goal of eliminating its identifiability.

In its basic meaning, identifiability is about whether someone is identified or identifiable. Essentially, if you can distinguish an individual from other individuals, then they are identifiable. Identifiability depends both on the identifiers present on the data and on the context of the data.

Reducing identifiability to sufficiently low levels can be challenging given the broad definition of personal data. The EDPB, through the Opinion 05/2014 on Anonymization Techniques, puts forward three key indicators for determining whether information is personal data or not:

  • singling out,
  • linkability, and
  • inferences.

Singling out means that you can tell one individual from another individual in a dataset. For example, if you can isolate some or all records about an individual in the data you process, then that individual is singled out.

Linkability is the concept of combining multiple records about the same individual or group of individuals together. These records may be in a single system or across different systems (e.g., within one database, or in two or more different databases). Individual data sources may seem non-identifying in isolation but could lead to the identification of an individual if combined.

An inference refers to the potential to infer, guess or predict details about someone. In other words, using information from various sources to deduce something about an individual (e.g., based on the qualities of others who appear similar). To determine the likelihood of identifiability through inference, you need to consider the possibility of deducing the identity of individuals from incomplete datasets, e.g., where some of the identifying information has been removed or generalised; from pieces of information in the same dataset that are not obviously or directly linked; or from other information that you either possess or may reasonably be expected to obtain. For example, from using publicly available information like census data or voter registration records.

In other words, data can only be considered properly anonymous when it does not relate to an identified or identifiable individual, or when it has been modified in such a way that individuals are not (or are no longer) identifiable.

Broadly speaking, there are two different approaches to anonymise data: attribute generalization and randomization.

  • Randomization is a family of techniques – inlcuding noise addition, permutation, and differential privacy – that alters the veracity of the data in order to remove the strong link between the data and the individual. If data is sufficiently uncertain then it can no longer be referred or linked to a specific individual.
  • Generalization is a family of techniques – including aggregation, K-anonymity, L-diversity and T-closeness – that generalizes, or dilutes, the attributes of data subjects by modifying the respective scale or order of magnitude (i.e. a region rather than a city, a month rather than a week).

The Opinion 05/2014 on Anonymisation Techniques further explains in more detail how to use these approaches.

It should be clear by now that personal data cannot be considered anonymous even if it is not possible to find out the real identity (name, address, etc.) of the individual. An identifiable individual is someone whose identity is not fully established but who can still be distinguished from other individuals.

Keep in mind that there is a trade-off between reduced identifiability and data utility, because data anonymization practices rely on deleting, reducing or modifying data details. Anonymisation is a process to produce data that is considered safe to openly and publicly share, but it only makes sense if what is being produced is safe useful data. Anonymisation is a process that should not be understood independently of the intended use(s) of the anonymised data.

Broadly speaking, based on its identifiability personal data can be divided in two types: Identified and deidentified.

Identified personal data is data that is can be readily attributed to a specific individual, where the controller is able to identify the data subject. For example, a student record identified by name and/or student ID number; or a research panel where the participant ID is linked to the name of the research participant.

Deidentified personal data is data that cannot be readily attributed to an individual but cannot yet be considered anonymous data. With this type of data, the controller is not able to identify the data subject, unless the data subject provides additional information enabling his or her identification. For example, a student record where the names, email and other direct identifiers have been removed, and only indirectly identifying information (like grades or other academic information) remains.

Working with deidentified data has several advantages:

  • Compliance with Art 89. Several GDPR derogations for scientific research depend on compliance with Art. 89. Using deidentified personal data is an important measure that demonstrate respect for the principle of data minimisation, showing that the research purposes can be fulfilled by processing that does not permit the identification of the data subject.
  • Some data subject rights no longer apply: Since the controller can’t identify the specific personal data of a specific data subject, it won’t be possible to respond to requests of data access (Art. 15), data rectification (Art. 16), data deletion (Art. 17), restriction of processing and notification (Art. 18 & 19) and data portability (Art. 20) – unless the data subject provides additional information enabling his or her identification.
  • Removing consent is likely to have no effect in the project’s data. Since the controller can’t identify the specific personal data of the individual withdrawing his or her consent, it is not possible to stop its processing or to remove it from the project – unless the data subject provides additional information enabling his or her identification.
  • The right to object is likely to have no effect. As with consent withdrawal, since the controller can’t identify the specific personal data of the individual objecting to the processing, it is not possible to specifically stop the processing of his or her data.
  • Further processing of data that was collected for other purposes is likely allowed. According to Art 5(1)(b), further processing for scientific research purposes shall, in accordance with Art 89(1), not be considered to be incompatible with the initial purposes.
  • Providing information directly to data subjects when their data is further processed is likely not necessary. The obligation listed in Art. 14 to provide information individually and directly to data subjects (i.e., by personally addressed email or post) is not possible as the controller can’t identify the specific data subjects to whom information was to be provided. Making this information publicly available (i.e., by posting it in the project’s website) would be often considered sufficient.
  • Personal data breach notifications are likely not needed. Using deidentified data can reduce the risk of harm to individuals that may arise from a personal data breach. Since controllers can’t identify the specific data subjects to whom provide data breach information, personal notifications of data breaches are thus not possible.
  • Data may be stored for longer periods. It is easier to justify storing data for 10 years or more to safeguard the scientific record, as Article 5(1)(e) allows for data to be stored or longer periods insofar as the personal data will be processed solely for scientific research purposes in accordance with Article 89(1).

It is often not possible to fully anonymise research data because it loses its utility, and there are also cases where using identified personal data is clearly necessary (contact information is necessary to book interviews with data subjects, or to send survey invitations). But apart from those cases, projects should definitely aim to deidentify their personal data as much as possible. Deidentified data represents the compromise between the requirement to apply data minimisation and the necessity of having data that is useful for the purpose.

The reason behind not making personal data publicly available, is to protect individuals from any negative consequences (physical, material and non-material damages) that could arise from making their personal data publicly available. However, privacy should not be used as an excuse to prevent individuals from reaping positive outcomes from making their data publicly available. Quote attribution, authorship of scientific works, or professional recognition are a few examples where making personal data publicly available (i.e., their names) may bring a positive outcome for individuals.

Making personal data publicly available is a processing operation that is covered by the GDPR, which requires Implementing Data Protection by Design and by Default. In summary, you need to:

  • Identify the purpose of making personal data publicly available – why this is necessary to achieve your project’s goals. define which ‘lawful basis’ you will rely on – consent is often used for research projects.
  • Identify the minimum amount of personal data that needs to be published that still satisfy your project purpose. Even if it’s not possible to fully anonymize it, you must deidentify data as much as possible while ensuring data retains its utility.
  • Provide sufficient information, to ensure that the people behind the data fully understand the benefits of making their data public, how their data will be deidentified and what possible risks this may bring to them.
  • If consent is used for making personal data public, you also need to consider that if individuals withdraw their consent, you will likely have to remove their personal data from the published dataset. However, as mentioned above, if published data is deidentified, it will not be possible to identify the requesting individual’s personal data within the published dataset, and (per Art. 11) you will not be required to remove their data.
  • Lastly, you need to keep sufficient documentation that demonstrate the compliance efforts that allowed personal data to be published (often, by completing a Privacy Scan.

The Anonymisation Decision Making Framework (ADF) aims to furnish practical understanding of anonymisation. It has produced a Guide and other specific tools and templates intended for those who have microdata that they need to anonymise.

The UK Information Commissioner’s Office has a guidance on anonymisation: How do we ensure anonymisation is effective?

The Opinion 05/2014 on Anonymisation Techniques | WP 216 (10 April 2014) is the official GDPR guidance for assessing if information is personal data.

The Irish Data Protection Commission Guidance on Anonymisation and Pseudonymisation.