Synthetic Data Project

The Synthetic Data Project (SDP) was funded by Institute of Education Sciences (opens in new tab) through the Maryland State Department of Education as part of the U.S. Department of Education’s ongoing support of state longitudinal data systems (opens in new tab). The project was proposed by and is being implemented by the Research Branch of the MLDSC, which is a collaborative effort across the University of Maryland Baltimore (opens in new tab) and College Park (opens in new tab) Campuses. The project began in July 2016 and runs through September 2019.

The central purpose of the SDP is to increase the usefulness and accessibility of the Maryland Longitudinal Data System (MLDS) data to researchers, policy analysts, and stakeholders at the local, state, and national levels. The MLDSC has an obligation to make data accessible, however a constellation of confidentiality laws and policies included in the statute that created the MLDSC limits the ability of the Center to provide even de-identified data to individuals outside of the Center. The development of synthetic data setsfrom the data available in the MLDS would provide researchers and policy makers with the ability to directly analyze synthetic MLDS data. Synthetic data are data that statistically act like the actual individual-level data, however, do not include any real individuals or their variable values; therefore, can be shared with researchers and policy makers outside the MLDS Center staff. This technology has grown out of recent developments in the imputation of missing values in data sets, in other words, these data are entirely imputed data sets based on the statistical nature of the actual data they mimic. The use of such synthetic data to expand access to government data sets is a cutting edge, but growing strategy.

Three synthetic datasets are planned: 1) high school to postsecondary education institutions, 2) high school to work force, and 3) postsecondary to the work force. Given the current data available in the MLDS these data sets will include six years of data and one or more cohort of students across each of those three transitions. Once created, we will evaluate the utility of synthetic datasets by conducting analyses on both the raw and synthetic datasets, and comparing results. Data disclosure risk will also be tested at an individual level. Finally, a secondary goal of the project is to determine whether it is feasible to synthesize data to retain the hierarchical or clustered nature of education data: students clustered in classroom, school, and school districts. In this case, disclosure of possible cluster-specific characteristics will also be evaluated to determine whether possible confidentiality concerns would present themselves. Once we have established the research utility and disclosure safety of these synthetic data sets, we will seek permission from the MLDSC Governing Board to release the data sets for use by interested policy makers, administrators, and researchers.

The Institute of Education Sciences (IES) Statewide Longitudinal Data Systems Grant Program produces a monthly publication - SLDS Issue Brief. The August edition is devoted to Maryland’s Synthetic Data Project (opens in new tab). The Issue Brief provides an overview of what synthetic data are, the goals of the Center’s Synthetic Data project, and a discussion of the benefits of the project for Maryland and states seeking to protect confidential data while encouraging statewide longitudinal data system use for research, training, and evaluation.