Data Anonymization - Mind Palace

Data access / availability is a barrier to entry for Deep Tech application. We can over come this for the initial performance indicator stages through Data Anonymization. **Main Techniques to do this** ## Data Masking Access to a modified version of the sensitive data. Techniques: - encryption, - term or character shuffling, - dictionary substitution ## Pseudonymization method of data de-identification Replace private identifiers with pseudonyms or false identifiers, ensures data confidentiality and statistical precision. For eg the name “David Bloomberg” might be switched with “John Smith”. ## Generalization Excluding certain data to make it less identifiable. Data could be changed into a range of values with logical boundaries. For instance, the house number at a specific address could be omitted, or replaced with a range within 200 house numbers of the original value. The idea is to r**emove certain identifiers** without compromising the data’s accuracy. ## Data Swapping Data swapping, also called shuffling or data permutation, rearranges dataset attribute values so that they don’t match the initial information. Switching columns (attributes) that feature recognizable values, including date of birth, can greatly influence anonymization. ## Data Perturbation Data perturbation changes the initial dataset slightly by using **rounding methods and random noise**. The values used must be proportional to the disturbance employed. It is important to carefully select the base used to modify the original values—if the base is too small, the data will not be sufficiently anonymized, and if it’s too large, the data may not be recognizable or usable. ## Synthetic Data Synthetic data is algorithmically produced data with no connection to any real case. The data is used to create artificial datasets rather than utilizing or modifying the original dataset and compromising protection and privacy.