Research Output
A Novel Evaluation Metric for Synthetic Data Generation
  Differentially private algorithmic synthetic data generation (SDG) solutions take input datasets Dp consisting of sensitive, private data and generate synthetic data Ds with similar qualities. The importance of such solutions is increasing both because more and more people realize how much data is collected about them and used in machine learning contexts, as well as a consequence of newly introduced data privacy regulations, e.g. the EU’s General Data Protection Regulation (GDPR). We aim to develop a novel and composite SDG evaluation metric which takes into account macro-statistical dataset similarities and data utility in machine learning tasks against privacy boundaries of the synthetic data. We formalize the mathematical foundations for quantitatively measuring both the statistical similarities and the data utility of synthetic data. We use two well-known datasets containing (potentially) personally identifiable information as inputs (Dp) and existing SDG algorithms PrivBayes and DPGroupFields to generate synthetic data (Ds) based on them. We then test our evaluation metric for different values of privacy budget . Based on our experiments we conclude that the proposed composite evaluation metric is appropriate for quantitatively measuring the quality of synthetic data generated by different SDG solutions and possesses an expected sensitivity to various privacy budget values.


Galloni, A., Lendák, I., & Horváth, T. (2020). A Novel Evaluation Metric for Synthetic Data Generation. In Intelligent Data Engineering and Automated Learning – IDEAL 2020: 21st International Conference, Guimaraes, Portugal, November 4–6, 2020, Proceedings, Part II (25-34).



Synthetic data generation, Differential privacy, Evaluation metrics

Monthly Views:

Available Documents