A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications

Claire Mc Kay Bowen, Victoria Bryant, Leonard Burman, Surachai Khitatrakun, Robert McClelland, Philip Stallworth, Kyle Ueyama, Aaron R. Williams

Research output: Chapter in Book/Entry/PoemConference contribution

9 Scopus citations

Abstract

US government agencies possess data that could be invaluable for evaluating public policy, but often may not be released publicly due to disclosure concerns. For instance, the Statistics of Income division (SOI) of the Internal Revenue Service releases an annual public use file of individual income tax returns that is invaluable to tax analysts in government agencies, nonprofit research organizations, and the private sector. However, SOI has taken increasingly aggressive measures to protect the data in the face of growing disclosure risks, such as a data intruder matching the anonymized public data with other public information available in nontax databases. In this paper, we describe our approach to generating a fully synthetic representation of the income tax data by using sequential Classification and Regression Trees and kernel density smoothing. This synthetic data file represents previously unreleased information useful for tax policy modeling. We also tested and evaluated the tradeoffs between data utility and disclosure risks of different parameterizations using a variety of validation metrics. The resulting synthetic data set has high utility, particularly for summary statistics and microsimulation, and low disclosure risk.

Original languageEnglish (US)
Title of host publicationPrivacy in Statistical Databases - UNESCO Chair in Data Privacy, International Conference, PSD 2020, Proceedings
EditorsJosep Domingo-Ferrer, Krishnamurty Muralidhar
PublisherSpringer Science and Business Media Deutschland GmbH
Pages257-270
Number of pages14
ISBN (Print)9783030575205
DOIs
StatePublished - 2020
EventInternational Conference on Privacy in Statistical Databases, PSD 2020 - Tarragona, Spain
Duration: Sep 23 2020Sep 25 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12276 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceInternational Conference on Privacy in Statistical Databases, PSD 2020
Country/TerritorySpain
CityTarragona
Period9/23/209/25/20

Keywords

  • Classification and Regression Trees
  • Disclosure control
  • Synthetic data
  • Utility

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications'. Together they form a unique fingerprint.

Cite this