Beyond the Ballot: A Survey of Statistical Methods for Uncovering Election Anomalies

November 8, 2024 15 min read Download Report

Kevin Dayaratna

@kdd0211

Acting Director, Chief Statistician, and Senior Research Fellow, Center for Data Analysis

Kevin specializes in tax, energy, and health policy issues.

Summary

Free and fair elections are a cornerstone of a self-governing republic, yet malfeasance may be a concern. Election forensics, a growing subfield of political science, offers a range of statistical tools to detect anomalies in election data and ensure electoral integrity. This Issue Brief surveys several key methods, including Benford’s Law, Bayesian finite mixture modeling, and outlier analysis, along with a range of other statistical techniques. The Issue Brief ends with policy recommendations aimed at ensuring that election data are made publicly available for independent analysis by the public, reinforcing the protection of electoral integrity.

Key Takeaways

A variety of statistical tools can detect potential breaches in the integrity of elections.

These techniques draw on tools from many fields, including pure mathematics, statistics, and machine learning.

State policymakers should ensure that necessary data are publicly available so that, if concerns arise, voters can use them to assess the integrity of an election.

Copied

Free and fair elections are the foundation of a self-governing republic. Malfeasance is always a possibility, and proper security measures are of paramount importance.REF A number of statistical tools developed in a growing subfield of political science—known as election forensics—can be used to detect potential anomalies in election data. This Issue Brief provides a survey of some such tools as well as policy recommendations to help to enable the public to better leverage these and other tools.

Benford’s Law

Benford’s law, also known as the first digit law, has been used to detect fraud in accounting, health care, real estate, government statistics, and science, among other areas.REF

In short, Benford’s law stipulates that in naturally occurring data, “1” should appear as the first digit approximately 30 percent of the time, “2” should appear as the first digit 18 percent of the time, and other digits would each appear at declining probabilities as a logarithmic function of the reciprocal of the value of the respective digit. It has, however, been well established that because election data—when distributed at the precinct level on a county-level basis—does not span sufficiently many orders of magnitude, analysis of the first digit is not useful for detecting potential fraud in elections.REF As a result, researchers have instead suggested a number of variations to Benford’s law to detect anomalies in election data, including analysis of the distribution of the second digit of data or first-digit analysis upon mathematical transformation.REF These alterations to traditional representations of Benford’s law have been applied to numerous elections, both executive and legislative, across the United States as well as internationally.REF For example, a 2022 study by Katie Anderson and colleagues flagged a number of counties in the 2004 U.S. presidential election between George W. Bush and John Kerry in the contentious state of Ohio.REF

Some in the field of political science, however, have expressed skepticism about the use of Benford’s law in the context of elections. For example, a 2010 study by Joseph Deckert and colleagues in the journal Political Analysis issued a strong criticism of the use of Benford’s law in elections, claiming that Benford’s law is not adept at detecting anomalies in election data.REF A subsequent paper responded to this criticism, pointing out flaws in the arguments presented in the Deckert paper and maintaining that Benford’s law can indeed be appropriate in certain settings.REF

That said, many potential forms of malfeasance may remain undetected by Benford’s law. For example, if two candidates’ votes were to be switched by a fixed percentage, then such an alteration to each candidate’s aggregate distribution of vote counts would almost surely be rendered undetectable by Benford’s law. Similarly, if a fixed number of ballots were added to each precinct, these alterations would also likely evade detection by Benford’s law.REF

Moreover, as Walter Mebane has noted, fraud is not the only possible reason for deviations from Benford’s law. For instance, this type of analysis should consider other potential explanations for anomalies, such as gerrymandering, mobilization efforts, or strategic voting, such as where voters might support a less-preferred candidate to avoid an even less desirable outcome.REF

Regardless, Benford’s law is a long-standing statistical tool in the election forensics literature and should be viewed as one of many tools available for assessing the integrity of an election.

Bayesian Finite Mixture Modeling

Another approach to detecting anomalies in election data involves a technique known as Bayesian finite mixture modeling. Led by Walter Mebane at the University of Michigan, this technique treats the percentage of votes garnered by each candidate on election day as the estimator of the true sentiment for each candidate. Mebane’s work, known as eForensics, seeks to disentangle the disparity between estimates of true sentiment and the sentiment itself by estimating probabilities of fraud.REF Using principles of statistics, the models cleverly estimate the number of fraudulent ballots cast.

Mebane has used these tools to flag ballots in the 2000, 2004, and 2016 U.S. presidential elections, raising concerns in Florida, Ohio, and Wisconsin, respectively.REF He has also applied the techniques to foreign elections, including legislative elections in Germany, Mexico, Canada, and Bangladesh, finding incremental fraud in each of these elections. In fact, his analysis suggests that malfeasance detected by the models may have influenced the outcome of the 2001 Bangladeshi legislative elections. These methods have been applied to many other elections as well, including elections in Venezuela, Turkey, Peru, Boliva, and the Democratic Republic of Congo.REF

Overall, Bayesian finite mixture modeling can be a useful tool for assessing the integrity of elections. In particular, this method is particularly useful as it provides estimates of probabilities of fraud and of the expected number of fraudulent ballots cast. As a result, the approach can shed light on whether potential malfeasance can alter the outcome of the elections analyzed.

That said, as is the case with Benford’s law, analysis with these models does not definitively prove fraud per se, as there may be other alternative explanations for anomalous results including voter decisions about wasted votes and strategic behavior among others.REF And as also is the case with Benford’s law, certain forms of fraud may also be rendered undetectable by these tools. Regardless, Bayesian finite mixture modeling should be viewed as another tool to detect potential malfeasance in elections.

Outlier Analysis

Statistical tools are also valuable for evaluating potential breaches in integrity, particularly when assessing the validity of absentee ballots. For example, the 2018 ninth district congressional election in North Carolina between Republican Mark Harris and Democrat Dan McCready was marred by allegations of fraud in the handling of these types of ballots. After hearing details of these allegations in an evidentiary hearing in February 2019, the North Carolina Board of Elections voted to call a new election due to fraud committed by Republican operatives.

Work by Dartmouth College political scientist Mark Herron that was eventually published in Election Law Journal was used as evidence in this hearing. Herron’s analysis statistically examined the irregularities in mail-in absentee ballots during this election.REF By comparing the mail-in absentee vote share for Mark Harris with his election day vote share and contrasting this with similar data from other races and previous elections, Herron shows that Bladen County’s results were markedly different from those of other counties and other elections throughout the state. The study also includes comparisons with other states and finds Bladen County’s 2018 results to be the most anomalous. Herron’s analysis was useful in the North Carolina board’s decision to vacate the original election.

The analytical tools used in Herron’s study have broader applicability, offering a framework for identifying irregularities in absentee ballot patterns in other elections. By establishing baseline patterns in absentee voting, these methods can help to detect anomalies that may indicate irregularities or potential misconduct in future elections. Additionally, Herron’s statistical approach provides election officials and policymakers with a statistical basis for assessing whether absentee voting patterns are consistent with historical and district-wide norms, potentially informing more robust oversight and integrity measures. As such, Herron’s analysis should not only be considered as influencing the 2018 North Carolina congressional election outcome but should be used as a valuable methodological contribution for safeguarding the integrity of other elections, as well.

Statistical Analysis of Non-Fraudulent Breaches in Integrity

Of course, potential malfeasance may not necessarily be solely due to fraud. For example, concerns in the 2000 presidential election between Republican George W. Bush and Democrat Al Gore were predicated not on fraud but on the allegedly confusing nature of “butterfly ballots.” The issue with the butterfly ballots occurred in Palm Beach County, Florida, leading to confusion among voters because the candidates’ names were listed on alternating sides, with punch holes in the center. The design of the ballot reportedly resulted in a significant number of voters accidentally voting for the wrong candidate or casting multiple votes, particularly affecting votes intended for Al Gore that resulted in votes mistakenly cast for Pat Buchanan instead.REF

Work published by Jonathan Wand and others in American Political Science Review statistically examined whether the alleged disorienting design of these butterfly ballots meaningfully affected the election.REF In particular, the authors used widely applied statistical techniques to estimate the extent to which the ballot design may have led voters to mistakenly cast their votes for Buchanan instead of Gore. Comparing Buchanan’s vote share in Palm Beach County with that in other counties nationwide, the authors find a significant discrepancy. Their analysis concludes that the butterfly ballot design directly contributed to enough unexpected votes for Buchanan to potentially alter the election outcome in Florida.REF

Other Statistical Techniques

A number of other techniques have been used to assess anomalies in elections. For example, research published by Peter Klimek and co-authors in the Proceedings of the National Academy of Sciences analyzed relationships between vote percentages and turnout.REF By comparing these relationships across different countries, the authors found that elections with alleged fraud—such as those in Russia and Uganda—exhibit distinct distributional patterns indicating anomalies compared to other countries examined where fraud is not considered to be as prevalent. These tools can easily be adapted to American elections to assess potential concerns about local, state, and federal elections.

Research published by Mali Zhang and colleagues in PLOS One utilized predictive modeling to analyze the integrity of the Argentinian 2015 national elections. Simulating synthetic data that was (a) fraud-free or (b) tainted by vote stealing and ballot-box stuffing, the authors used machine learning techniques on actual data to ascertain which mesas (polling stations) were at risk for fraud.REF The authors found slightly under 15 percent of these mesas to be at risk for fraud

The techniques drawn on in the Zhang study have advantages that other approaches mentioned earlier do not. In particular, these machine learning techniques are able to be applied more broadly to large-scale elections than other techniques. Additionally, these techniques are not as heavily constrained by the assumptions that many standard statistical models have.

Policy Implications

This Issue Brief offers an overview of statistical tools for assessing the integrity of elections. There are two main types of approaches:

Within election analysis. This approach analyzes data for a particular election alone, making descriptive assumptions about distributions of data. These assumptions, for example, are key in Benford’s law and finite mixture modeling.
Comparative election analysis. This approach compares elections to each other either via time series data for one particular type of election or elections in different countries. The distributional analysis presented by Klimek and colleagues in the Proceedings of the National Academy of Sciences is of this nature.

Indeed, election forensics is a quickly growing field and can be immensely useful in allowing the public to assess concerns about election integrity. Regardless, no single technique can serve as a definitive solution for detecting election misconduct. Indeed, the methods presented in this Issue Brief—ranging from digit-based analysis to statistical modeling to outlier analysis to machine learning techniques—and others should be considered as tools within a broader toolbox to detect potential anomalies in election data.

For the public to use these and other statistical tools, policymakers should require that each state’s board of elections make election data publicly available in a machine-readable format such as text, comma delimited, Microsoft Excel, or JavaScript Object Notation files. At the same time, this data should protect the secrecy of ballots and ensure that the individual choices of voters cannot be and are not disclosed or detectable in the disclosed data. Data provided should be as localized as possible, either at the polling station level, if not the precinct level, and include the number of registered voters as well as candidate choices. Then, should members of the public have any concerns, they can analyze the data themselves, and, if compelling statistical evidence of malfeasance arises, inform legal authorities associated with the election.

Leveraging statistical tools is essential for protecting election integrity. This Issue Brief offers straightforward policy recommendations—drawing upon the tools and insights discussed in this review—to empower the public in promoting fair election practices.

Kevin D. Dayaratna, PhD, is Chief Statistician, Data Scientist, and Senior Research Fellow in the Center for Data Analysis at The Heritage Foundation.