Data access / availability is a barrier to entry for Deep Tech application.
We can over come this for the initial performance indicator stages through Data Anonymization.
**Main Techniques to do this**
## Data Masking
Access to a modified version of the sensitive data.
Techniques:
- encryption,
- term or character shuffling,
- dictionary substitution
## Pseudonymization
method of data de-identification
Replace private identifiers with pseudonyms or false identifiers, ensures data confidentiality and statistical precision.
For eg the name “David Bloomberg” might be switched with “John Smith”.
## Generalization
Excluding certain data to make it less identifiable.
Data could be changed into a range of values with logical boundaries.
For instance, the house number at a specific address could be omitted, or replaced with a range within 200 house numbers of the original value.
The idea is to r**emove certain identifiers** without compromising the data’s accuracy.
## Data Swapping
Data swapping, also called shuffling or data permutation, rearranges dataset attribute values so that they don’t match the initial information.
Switching columns (attributes) that feature recognizable values, including date of birth, can greatly influence anonymization.
## Data Perturbation
Data perturbation changes the initial dataset slightly by using **rounding methods and random noise**.
The values used must be proportional to the disturbance employed.
It is important to carefully select the base used to modify the original values—if the base is too small, the data will not be sufficiently anonymized, and if it’s too large, the data may not be recognizable or usable.
## Synthetic Data
Synthetic data is algorithmically produced data with no connection to any real case.
The data is used to create artificial datasets rather than utilizing or modifying the original dataset and compromising protection and privacy.