Before the United States House of Representatives
Committee on Government Reform
Subcommittee on Government Efficiency, Financial Management and
Intergovernmental Relations
Mr. Chairman, Members of the Subcommittee, thank you for
inviting me to testify on the "Confidentiality Information
Protection and Statistical Efficiency Act of 2002," H.R.
5215. I ask that my written testimony be entered into
the record.
I am the Project Manager of The Heritage Foundation's Center for
Data Analysis (CDA). I help direct the work of researchers
who routinely use a wide variety of data supplied by the federal
government. In addition, the CDA has entered into licensing
agreements with a few federal agencies that permit our analysts to
use data that are not generally available to the public.
Although The Heritage Foundation is recognized as a conservative
public policy research institution, our analysts work with those
from diverse ideological perspectives on issues involving access to
quality data. This is the reason why The Heritage Foundation
is a member of broad-based organizations such as the Association of
Public Data Users (APDU) and is an affiliate member of the Council
of Professional Associations on Federal Statistics (COPAFS).
It should be noted that the following testimony is my own view and
does not necessarily reflect that of The Heritage Foundation, or
any other organization.
Three standards for improving federal statistical
policy
Government statistics are an indispensable component to
much of the work done by policy makers. Obvious examples include
economic indicators such as inflation and unemployment and
budgetary estimates involving taxes and the overall level of
spending. Crime, education and health care are just a few of
the other public policy areas in which statistics are regularly
used to better understand social problems and evaluate programs
that may affect them.
Today, I would like to discuss three standards that should guide
any proposal to improve America's statistical system. These
standards are: (1) protection of individual identity for the
respondents who provide original data, (2) production of useful,
timely information for data users, and (3) independent evaluations
of the data for decision-makers. These are the three I's of
statistical policy: Identity protection,
Information value, and Independent
evaluation.
The need to improve federal statistical policy is directly
related to our nation's dependence on high quality statistics. Data
sharing provisions, such as those contained in H.R. 5215, can
improve the quality of economic statistics produced by the
government. In addition, with appropriate
modifications, the identity of those providing data can be
better protected by confidentiality policies such as those in H.R.
5215. However, as I will explain later, it is crucial that
the language used to protect confidentiality not inadvertently and
unnecessarily eliminate the type of data access that is currently
available. After allowing for reasonable adjustments to
protect the identity of respondents, the public should have access
to the greatest amount of data possible. In addition, data
should be provided in a form that allows nongovernment researchers
to provide alternative interpretations of information produced by
the government's statisticians.
Two of the principles cited above have been applied in H.R.
5215. The sections concerning statistical efficiency
contained in Title 2 are examples of measures that can enhance the
value of information by improving the accuracy and timeliness of
economic data. I have left the more detailed discussion of
these issues to the economists and information providers who work
daily with these data. My testimony will focus primarily on
the identity protection aspects of Title 1. I will also
discuss the importance of data access to nongovernment
researchers.
Standard 1: Identity protection
Given the importance of numbers to government
decision-makers, it is perhaps surprising that the federal
statistical system is so fragmented and confusing. Individual
agencies have been added to the U.S. statistical system over a
period of many years and for different legislative
reasons. Over 70 agencies participate in the
collection, preparation, and dissemination of data collected from
administrative records, surveys and censuses. While some
agencies routinely generate wide-ranging products (e.g., the Bureau
of the Census) others focus on more specific areas. In
addition, statistics are produced as by-products in data collection
associated with administrative tasks (e.g., the Internal Revenue
Service).
The growth of America's statistical system has produced not only
a confusing set of statistical agencies, it has also created an
inconsistent set of laws and policies designed to protect the
confidentiality of respondents who supply the government with
data.1 Some of the interagency coordination problems between
the Department of the Census, the Bureau of Labor Statistics, and
the Bureau of Economic Analysis would be reduced by changes such as
those in Title 2 of H.R. 5215.
In addition, the legislation provides a new set of definitions
and protections of confidentiality that would apply throughout the
government. Protections such as these are important because
the federal statistical system faces a serious problem of declining
public trust in government, specifically trust that a respondent's
identity will be kept confidential and that respondents will not be
harmed by the information they supply. A uniform policy to
protect the confidentiality of data providers is basic to the
development of high-quality data. Unless respondents can be
assured that the data they provide to the government for
statistical purposes will not be used against them through
regulations or other enforcement efforts, they will either not
provide data or they will report inaccurate information. In
either case, the effect is to create measurement biases and
errors.
Unfortunately, Congress is not actively considering any proposal
that would replace the current system with a coherent and
comprehensive set of rules for the protection of
confidentiality. Nevertheless, standards such as those in
H.R. 5215 provide a framework for resolving these differences in
the future. An important first step is to clearly distinguish
between statistical and administrative data.
The government collects a vast amount of administrative data in
conjunction with federally funded programs. With appropriate
safeguards, these data can be used for research purposes. For
example, administrative data can be used to determine whether
federal job training programs are effective in raising the incomes
of workers. However, data collected for statistical purposes
should rarely, if ever, be used for administrative
reasons.
Those who provide data to statistical agencies should not have
to worry that the government will use their individual responses to
decrease a monthly benefit check, increase their tax liability, or
impose a fine for violating a government regulation.
Confidentiality protections that clearly distinguish between
statistical and nonstatisitcal purposes, such as those found in
H.R. 5215, will help reinforce this important difference.
Statistical agencies must also protect the identity of
individuals who provide data that may eventually be released to the
public. Agencies protect confidentiality by modifying or
suppressing data that could be used to directly or indirectly
identify an individual respondent. Items such as names,
addresses and identifying codes such as social security numbers are
removed from publicly available databases.
In addition, reasonable steps are taken to ensure that
statistical disclosure does not occur. Statistical disclosure
can occur if the information that is released is so detailed,
analysts can, with a high degree of probability, associate the
information with a specific person or business.
Statistical agencies use procedures to alter data in order to
reduce the chance that this type of disclosure will occur.
Examples of these adjustments include cell suppression, the random
modification of data, and the use of topcoding.2 The effect
is to produce a database that is similar to the original file but
with anonymous information. Data in this form limits the risk
that the identity of respondents can be exposed through indirect
means. Provisions for protecting individual identities can be
found in plans such as H.R. 5215, which prohibit the release of
data in a form that could reasonably be expected to either directly
or indirectly yield the identity of a respondent.
Standard 2: Information value
Although necessary, procedures that protect
confidentiality also tend to reduce the amount and the value of
data that can be released. Technical adjustments to the
data by statistical agencies reduce the usefulness of data that is
available to the public and researchers. It is vital that the
methods adopted to protect individual identity do not inadvertently
or unnecessarily reduce the amount of information available to the
public.3
It is important that a distinction be made between a
respondent's identity and the data they provide. Individual-level
data are often referred to as microdata files because they contain
information about individual persons, families, business entities
or some other individual unit. They include items such as
age, race, sex, education levels, income and expenses.
Examples of these files include the Current Population Survey, the
Consumer Expenditure Survey, and the Survey of Consumer
Finance. These files provide the basis for much of the social
and economic research conducted by analysts in academic
institutions and in public policy organizations. This
research depends on convenient access to individual-level
data.4
Provisions to protect confidentiality are intended to shield the
identity of the respondent but not suppress all data at the
individual level. It is not necessary to adopt such extreme
forms of data suppression as those found in H.R. 5215. As
currently written, this bill states that agencies cannot disclose
data that are in "identifiable form." The bill further
defines data in "identifiable form" to mean the representation of
information that permits information about a specific
respondent to be reasonably inferred through either direct or
indirect means. This method of protecting confidentiality
precludes the disclosure of all individual-level
information that respondents provide despite the use of safeguards
that protect the identity of the respondents. Denying
researchers access to all the individual-level data would
drastically reduce the value of publicly available information and
undermine the quality of important research performed in the United
States.
The problem with the approach taken in H.R. 5215 arises because
it does not clearly distinguish between the identity of
the individual respondent and the information they
provide. Protection of confidentiality requires that the
identity of the individual be kept confidential. However,
other information that is currently available to
researchers should remain accessible. Confidentiality
protections such as those in H.R. 5215 should be modified so it is
clear that they protect the identity of
respondents.
Data providers often refer to a tension between the protection
of individual identity and the degree of information
usefulness. On the one hand, government statisticians want to
reassure respondents who provide data. On the other hand,
they would like to fulfill legitimate requests for data by
users. The tension is often depicted by statisticians in a
graph where the risk of disclosure is measured on one axis and the
amount of information provided is measured on the other axis.5 The
graph shows a trade-off in which a lower level of disclosure risks
leads to a reduction in the amount of information that can be
provided. The goal is to strike a balance that provides
reasonable protections for confidentiality and the greatest amount
of useful data. Although helpful, graphs that only plot
disclosure risks and the usefulness of data omit the role that data
plays in protecting our form of government.
Standard 3: Independent evaluation
Although providing valuable data is a very important
standard, it is not enough for government statisticians to view
data access solely in terms of the amount of data they provide to
the public. In addition, the data should be sufficient so
that researchers outside the government can respond effectively to
government proposals - either to validate or to challenge
them. To function properly, the U.S. government depends on
the ability of potentially opposing interests to influence the
decision-making process and thereby reach a more informed and
reasoned outcome. The U.S. system of government was designed
with checks and balances, and depends for its effectiveness on the
free flow of information.
There is a subtle but critical difference between a standard for
the quality of information that is provided and a standard that
deals with the form in which it is provided. Government
statisticians may supply the public with a large quantity of
valuable data but this information typically comes packaged in
numerical aggregations and generalized categories. If
nongovernment researchers are to provide an independent evaluation
of official government data, they must have access to information
that is similar to that used by government statisticians.
Without this access, a basic U.S. principle of open government,
reflected in the U.S. Constitution and in many laws, most notably
the Freedom of Information Act (FOIA), will be violated. The
U.S. government was designed to be of and for the people, not to be
run by an elite with the unique ability to choose how data are to
be categorized, processed, and released.
A few examples may help clarify why the distinction between the
amount and form of data accessibility makes a difference. I
have selected two studies conducted by Heritage's data center and
ask that they be included in the record.6 Although these are
Heritage publications, I must point out that public policy analysts
commonly produce this type of research and I could have selected
from a large number of studies from individuals associated with
universities and nonprofit organizations.
The first report is an analysis of the distribution of income in
the United States. The authors of this study identify four
weaknesses with the official measurements of income inequality used
by the Census Bureau. For example, the quintiles that Census
uses to divide income do not contain an equal number of
people. In addition, the conventional Census figures do not
take into account the effects of taxation and omit many types of
cash and non-cash income. Because the underlying Census data
are publicly available, Heritage analysts were able to make the
adjustments they believed were appropriate to recompute the
distribution of income. The revised analysis shows a more
even distribution of income than that contained in official Census
reports.
A second Heritage report asked what share of child poverty can
be attributed to the growth of single parenthood since the
1960s. As with the previous study, analysts used data in a
form similar to that available to Census statisticians.
The report notes that "The March 2001 [Current Population Survey]
supplement, also known as the annual demographic file, includes
extensive questions on family demographic characteristics and
previous year income that make it useful for social analyses, such
as this one."7 Heritage analysts utilized the Census data to
estimate the effects that marriage rates have on poverty.
They were also able to use an expanded definition of income that
counts the Earned Income Tax Credit and food stamps as part of a
family's resources for determining whether the family is poor.
Examples of similar research can be found in Heritage reports on
education, taxation and the Social Security system. And, more
important, other public policy analysts who have divergent
political perspectives rely on the same type of data.
Although statistical agencies often state that they are committed
to providing access that allows for independent evaluations there
are few regulations or laws that require them to do so.
Authors and sponsors of federally funded program evaluations seem
particularly reluctant to release their data sets to independent
researchers.8 Requiring public access to program evaluation
data encourages government evaluators to apply more rigorous
methods than would otherwise be the case. If we are to have
open and informed debate on public policy issues it is vital that
all researchers have access to data that permit them to challenge
the government's official reports and to offer alternative
perspectives.
What Congress Should Do
To implement the three statistical standards described in this
testimony, Congress should:
- Provide guidelines, such as those in H.R. 5215, that clearly
distinguish between data that are used for statistical and
nonstatistical purposes. In addition, the guidelines should
specify, as they do in H.R. 5215, that reasonable measures be
implemented so that respondent identities cannot be determined
either directly or indirectly.
- Provide guidelines that clearly indicate that confidentiality
applies to the identify of the respondent. The current
version of H.R. 5215 is not sufficiently clear in this
respect. The protection of a respondent's identity does not
require that all the information about the respondent be
suppressed.
- Require that, whenever possible, federal agencies provide data
to independent researchers in a form that permits them to conduct
complete and independent evaluations.
ENDNOTES
1. Joe Cecil, Senior Research Associate at the Federal
Judicial Center, notes that "Records maintained by U.S. federal
agencies are governed by a web of federal statutes that are
'inconsistent at best and chaotic at worst" (Commission on Federal
Paperwork, 1977). The exchange of statistical information
must conform to standards that often were designed to guard against
administrative abuses, standards that may be inappropriate for
records used only for statistical purposes. As a result,
researchers who seek information maintained by federal agencies
often must recast their request for access in terms of a regulatory
scheme that does little to anticipate the special characteristics
of statistical data." See Joe S. Cecil, "Confidentiality
Legislation and the United States Federal Statistical System,"
Journal of Official Statistics, Vol. 9, No. 2, 1993, p.
519.
2. For review of the adjustments that statistical agencies
employ and the possible effects they may have on the usefulness of
data see articles in Pat Doyle, Julia I. Lane, Jules J.M. Theeuwes,
and Laura V. Zayatz, editors, Confidentiality, Disclosure, and
Data Access: Theory and Practical Applications for
Statistical Agencies (New York: Elsevier Science, 2001).
3. Some agencies are allowed to provide data to external
researchers through data licensing or use agreements. These
licenses extend the legal responsibilities for handling
confidential data to the external researcher. They can be an
effective means of preserving respondent confidentiality without
significantly affecting the quality of research that can be
performed off-site by nongovernment analysts. For a review of
licensing arrangements see: Marilyn M. Seastrom, "Licensing,"
pp. 279-289, in Doyle, et. al., editors, Confidentiality,
Disclosure, and Data Access: Theory and Practical
Applications for Statistical Agencies. Other alternatives,
such as making researchers special sworn employees, are much less
effective in providing data access. The access provided is
time-consuming to obtain, costly, temporary and must be carried out
at a remote site. In addition, special requirements often
limit the research to those subjects that further the mission of
the statistical agency.
4. This issue was considered by the members of The Panel
on Confidentiality and Data Access of the Committee on National
Statistics. They warn that efforts by statistical agencies to
protect confidentiality could significantly reduce the value of the
data. "Because of legitimate concerns about the possibility
of disclosure of individual information, statistical agencies have
limited the amount of detailed data provided to nongovernment users
in tabulations and public-use microdata files. This lack of
detail restricts the ability of users to do analyses that could
contribute to the understanding of significant economic, social,
and health problems." The panel recommended that "Statistical
agencies should continue widespread release, with minimal
restrictions on use, of microdata sets with no less detail than
currently provided." See George T. Duncan, Thomas B. Jabine,
and Virginia A. de Wolf, editors, Private Lives and Public
Policies: Confidentiality and Accessibility of Government
Statistics (Washington, D.C.: National Academy Press,
1993), p. 7.
5. See, for example, various papers in: Doyle, et.
al., Confidentiality, Disclosure, and Data Access: Theory
and Practical Applications for Statistical Agencies.
6. See the attached reports: Robert Rector and Rea
S. Hederman, "Income Inequality: How Census Data Misrepresent
Income Distribution," The Heritage Foundation, Center for Data
Analysis Report, September 29, 1999, and Robert Rector, Kirk A.
Johnson, and Patrick F. Fagan, "The Effect of Marriage on Child
Poverty," The Heritage Foundation, Center for Data Analysis Report,
April 15, 2002.
7. Rector, Johnson, Fagan, "The Effect of Marriage on
Child Poverty," p. 3
8. For example, The National Job Corps Study,
funded by the Department of Labor (DOL) and authored by Mathematica
Policy Research (MPR), was published in July 2001. The DOL
and MPR have denied requests to release the data used for the
study. In addition, the Community Oriented Policing Services
(COPS) refused a FIOA request by The Heritage Foundation to release
data from the National Evaluation of the Effect of COPS Grants
on Crimes from 1994 to 1999.
*******************
The Heritage Foundation is a public policy, research, and
educational organization operating under Section 501(C)(3). It is
privately supported, and receives no funds from any government at
any level, nor does it perform any government or other contract
work.
The Heritage Foundation is the most broadly supported think tank
in the United States. During 2001, it had more than 200,000
individual, foundation, and corporate supporters representing every
state in the U.S. Its 2001 contributions came from the following
sources:
Individuals 60.93%
Foundations 27.02%
Corporations 7.61%
Investment Income 1.60%
Publication Sales and Other 2.84%
The top five corporate givers provided The Heritage Foundation
with less than 3.5% of its 2001 income. The Heritage Foundation's
books are audited annually by the national accounting firm of
Deloitte & Touche.
Members of The Heritage Foundation staff testify as individuals
discussing their own independent research. The views expressed are
their own, and do not reflect an institutional position for The
Heritage Foundation or its board of trustees.
*******************
Ralph
A. Rector, Ph.D. is Research Fellow and Project Manager at
The Heritage Foundation's Center for Data Analysis (CDA). The
CDA conducts research and publishes empirical studies on issues
such as education, crime, welfare, and public finance. Rector
directs CDA research and development activities, including the
development of new computer software and databases. He serves
on the Board of Directors of the Council of Professional
Associations on Federal Statistics (COPAFS). Before joining
Heritage, he worked in the Tax Policy Economics Group at Coopers
& Lybrand, L.L.P., where he supervised the construction of
microsimulation models used to analyze the impact of tax reform on
businesses and individuals. He has managed projects involving
the use of large-scale relational databases and economic
models. He has also served as a tax analyst and revenue
estimator at the state and federal levels. Rector holds a
Ph.D. in economics from George Mason University.