Regulatory bodies around the world increasingly recognize that they need to regulate how governments use machine learning algorithms when making high-stakes decisions. This is a welcome development, but current approaches fall short.
As regulators develop policies, they must consider how human decisionmakers interact with algorithms. If they do not, regulations will provide a false sense of security in governments adopting algorithms.
In recent years, researchers and journalists have exposed how algorithmic systems used by courts, police, education departments, welfare agencies and other government bodies are rife with errors and biases. These reports have spurred increased regulatory attention to evaluating the accuracy and fairness of algorithms used (or proposed for use) by governments.
In the United States, Senate Bill 6280 passed by Washington State in 2020 and Assembly Bill 13 currently being debated in the California Legislature mandate that agencies or vendors must evaluate the accuracy and fairness of algorithms before public agencies can use them. Similarly, the Artificial Intelligence Act proposed in April by the European Commission requires providers of high-risk AI systems to conduct ex ante assessments evaluating the accuracy of their systems.
Although it is necessary to increase the scrutiny placed on the quality of government algorithms, this approach fails to fully consider how algorithmic predictions affect policy decisions. First, in most high-stakes and controversial settings, algorithms do not operate autonomously. Instead, they are provided as aids to people who make the final decisions. Second, most policy decisions require more than a straightforward prediction. Instead, decisionmakers must balance predictions of future outcomes with other, competing goals.
Consider the example of pretrial risk assessments, which many jurisdictions across the United States have adopted in recent years. These tools predict the likelihood that a pretrial defendant, if released before trial, will fail to appear in court for trial or will be rearrested before trial. To many policymakers and academics, pretrial risk assessments promise to replace flawed human predictions with more accurate and “objective” algorithmic predictions.
Yet even if pretrial risk assessments could make accurate and fair predictions (which many scholars and reform advocates doubt), this alone would not guarantee that these algorithms improve pretrial outcomes. Risk predictions are presented to judges, who must decide how to act on them. Furthermore, although judges must limit the risk of defendants being rearrested or not returning to court for trial, they must balance those goals with other interests, such as prioritizing the liberty of defendants.
A central question about the effects of algorithms like pretrial risk assessments is whether providing algorithmic risk predictions improves human decisions. Scholars and civil rights advocates have raised concerns that the emphasis on risk by pretrial risk assessments will prompt judges to treat defendants more harshly.
In a newly published journal article, my colleague Yiling Chen and I used an online experiment to evaluate how algorithms influence human decisionmaking. We found that even when risk assessments improve people’s predictions, they do not actually improve people’s decisions.
When participants in our experiment were presented with the predictions of risk assessments, they became more attentive to reducing risk at the expense of other values. This systematic change in decisionmaking counteracted the potential benefits of improved predictions. In the context of pretrial decisionmaking, the ultimate effect was to increase racial disparities in pretrial detention.
Our experimental results build on a growing body of evidence demonstrating that judges and other public officials use algorithms in unexpected ways in practice. These behaviors mean that government algorithms often fail to generate the expected policy benefits.
This empirical evidence demonstrates the limit of regulations that focus only on how an algorithm operates when used autonomously. Even if an algorithm makes accurate predictions, it may not improve how government staff make decisions. Instead, tools like pretrial risk assessments can generate unexpected and undemocratic shifts in the normative balancing act that is central to decisionmaking in many areas of public policy.
It is necessary to expand regulations of government algorithms to account for how they influence human decisionmaking. Before an algorithm is deemed appropriate for a government body to use, there must exist empirical evidence suggesting that the algorithm is likely to improve human decisionmaking.
Fortunately, there is a path for developing deeper knowledge about human interactions with algorithms before these tools are implemented in practice: conducting experimental evaluations of human-algorithm collaborations.
Prior to deployment, vendors and agencies should run experiments to test how people interact with a proposed algorithm. If an algorithm is operationalized in practice, its use should be continuously monitored to ensure that it generates the intended outcomes.
A proactive pipeline of evaluations along these lines should become a central component of regulations for how governments use algorithms. Before government agencies implement algorithms, there must be rigorous evidence regarding what impacts these tools are likely to generate and democratic deliberation supporting those impacts.
Ben Green is a postdoctoral scholar in the Michigan Society of Fellows and an assistant professor in the Gerald R. Ford School of Public Policy. He is the author of “The Smart Enough City: Putting Technology in Its Place to Reclaim Our Urban Future.”