TY - GEN
T1 - A Synthetic Supplemental Public Use File of Low-Income Information Return Data
T2 - International Conference on Privacy in Statistical Databases, PSD 2020
AU - Bowen, Claire Mc Kay
AU - Bryant, Victoria
AU - Burman, Leonard
AU - Khitatrakun, Surachai
AU - McClelland, Robert
AU - Stallworth, Philip
AU - Ueyama, Kyle
AU - Williams, Aaron R.
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - US government agencies possess data that could be invaluable for evaluating public policy, but often may not be released publicly due to disclosure concerns. For instance, the Statistics of Income division (SOI) of the Internal Revenue Service releases an annual public use file of individual income tax returns that is invaluable to tax analysts in government agencies, nonprofit research organizations, and the private sector. However, SOI has taken increasingly aggressive measures to protect the data in the face of growing disclosure risks, such as a data intruder matching the anonymized public data with other public information available in nontax databases. In this paper, we describe our approach to generating a fully synthetic representation of the income tax data by using sequential Classification and Regression Trees and kernel density smoothing. This synthetic data file represents previously unreleased information useful for tax policy modeling. We also tested and evaluated the tradeoffs between data utility and disclosure risks of different parameterizations using a variety of validation metrics. The resulting synthetic data set has high utility, particularly for summary statistics and microsimulation, and low disclosure risk.
AB - US government agencies possess data that could be invaluable for evaluating public policy, but often may not be released publicly due to disclosure concerns. For instance, the Statistics of Income division (SOI) of the Internal Revenue Service releases an annual public use file of individual income tax returns that is invaluable to tax analysts in government agencies, nonprofit research organizations, and the private sector. However, SOI has taken increasingly aggressive measures to protect the data in the face of growing disclosure risks, such as a data intruder matching the anonymized public data with other public information available in nontax databases. In this paper, we describe our approach to generating a fully synthetic representation of the income tax data by using sequential Classification and Regression Trees and kernel density smoothing. This synthetic data file represents previously unreleased information useful for tax policy modeling. We also tested and evaluated the tradeoffs between data utility and disclosure risks of different parameterizations using a variety of validation metrics. The resulting synthetic data set has high utility, particularly for summary statistics and microsimulation, and low disclosure risk.
KW - Classification and Regression Trees
KW - Disclosure control
KW - Synthetic data
KW - Utility
UR - http://www.scopus.com/inward/record.url?scp=85092075542&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85092075542&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-57521-2_18
DO - 10.1007/978-3-030-57521-2_18
M3 - Conference contribution
AN - SCOPUS:85092075542
SN - 9783030575205
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 257
EP - 270
BT - Privacy in Statistical Databases - UNESCO Chair in Data Privacy, International Conference, PSD 2020, Proceedings
A2 - Domingo-Ferrer, Josep
A2 - Muralidhar, Krishnamurty
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 23 September 2020 through 25 September 2020
ER -