Data Anonymization

Anonymized Data Definition

Data anonymization is the process of irreversibly altering classified data in order to protect the privacy of data subjects.

Data anonymization diagram showing the process of achieving anonymized data.
Image from Nature.com

FAQs

What is Data Anonymization?

Anonymized data is a type of information sanitization in which data anonymization tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject’s privacy. This reduces the risk of unintended disclosure during the transfer of information across boundaries and facilitates evaluation and analytics post-anonymization.

The European Union’s General Data Protection Regulation (GDPR) requires the anonymization or pseudonymization of stored data on people in the EU. As anonymized data sets are no longer deemed personal information, therefore it is not subject to the GDPR, thus enabling businesses to use the data for much wider purposes without violating data anonymization policy and data protection rights of the data subjects. HIPAA anonymized data is an integral part of the healthcare industry’s dedication to the preservation of patient privacy.

Data Anonymization Techniques

Data anonymization algorithms are designed to automate the process of protecting the identity of a data subject in a dataset. Some data anonymization methods include:

  • Generalization: eliminates only some parts of the data to make it less identifiable but also retain data accuracy.
  • Perturbation: slightly modifies a dataset by adding random noise and applying techniques that round numbers.
  • Pseudonymization: replaces private identifiers with fake identifiers or pseudonyms -- data anonymization and pseudonymization are terms often used interchangeably 
  • Scrambling: letters are thoroughly mixed and rearranged. 
  • Shuffling: also known as permutation or data swapping -- swaps and rearranges dataset attributes.
  • Synthetic data: algorithmically manufactures artificial datasets rather than alter the original dataset.

Choosing the best data anonymization tools depends entirely on the complexity of the project and the programming language in use. A student conducting a survey will have different requirements than a data scientist analyzing banking customer transaction data. 

Overall, professional data anonymization software should provide compliance with GDPR anonymized data, and offer interactivity capabilities that help analysts query data dynamically via an interface with a one-time initial setup. R anonymize data is one of the most popular languages in which to execute the anonymization of personal data.

Anonymized Data vs De-identification

De-identification is the process of preventing an individual’s identity from being compromised by removing all personally identifiable information. A common de identifying technique is pseudonymization, which masks person identifiers from data records by replacing real names with a temporary ID.

When applied to metadata or general data about identification, de-identification is also known as data anonymization. While data anonymization prevents any future re-identification, even by the data controllers under any condition, de-identification may preserve identifying information that is capable of being re-linked by a trusted party in certain situations.

Data Anonymization Best Practices

The best approach to anonymization is multiple layers of defense. Especially in cases of Big Data Analytics, in which one layer of anonymization is not sufficient. Implementing the following security measures will add layers of protection to block de-anonymization attacks. 

  • Database activity monitoring provides real-time alerts on policy violations in data warehouses, big data sets, data warehouses and mainframes, and relational databases.
  • A database firewall evaluates known vulnerabilities and blocks SQL injections.
  • Data discovery determines where data resides and data classification identifies the quantity and context of data on-premises and in the cloud.
  • Data loss prevention software detects potential data breaches by inspecting sensitive information while in use, in motion, and at rest.
  • Data masking will render sensitive data useless in the wrong hands.
  • User behavior analytics uses machine learning to establish a baseline for data access behavior and detect abnormal activity. 
  • A user rights management feature monitors data access and privileged user activity, and identifies inappropriate privileges.

Data Masking vs Anonymization

Data masking intentionally randomized data by creating characteristic but inauthentic versions of personal user data with the use of encryption and data shuffling techniques. This obfuscates personally identifiable data while still upholding the unique characteristic of the data, which ensures that testing conducted on masked data will yield the same results as the original data set.

Data masking adds another layer of security to data anonymization by masking certain pieces of data and only showing the most relevant pieces of data to data handlers who are explicitly authorized to see those specific pieces of relevant data. This facilitates safe application testing wherein authorized testers see only what they need to see.

Does HEAVY.AI Offer a Data Anonymization Solution?

The HEAVY.AI platform provides interactive visualizations of massive amounts of aggregated and anonymized data, providing policy makers with a detailed view of human behavior without compromising individuals’ personal data. Examples of anonymized data on HEAVY.AI’s Immerse platform include Covid-19 tracking and US Political Donations.