How AI can enable cross-border data sharing in a fragmenting world
Whether preparing for the next pandemic or monitoring the safety of generative AI, policymakers, business leaders, and academics need access to data from both within and outside their national borders. But instead of policies that enable data to flow more freely, constraints have become the norm. Globally, data flow restrictions more than doubled between 2017 and 2021. Late last year, the U.S. withdrew its long-standing request for the WTO to prohibit data localization requirements for e-commerce. This is a highly symbolic move from a country that has traditionally been one of the staunchest supporters of tearing down barriers in the digital world.
As a result of these shifts, the digital world has never been more fragmented. But we are not here to argue that all digital barriers must come down. As researchers from academic institutions in the U.S. (Harvard), Europe (INSEAD), and China (Tsinghua), and from a global company (Boston Consulting Group), we recognize that governments will continue to feel an obligation to protect their national security interests and citizens’ data. If anything, we may see more barriers erected in the years to come. But we shouldn’t—in fact, we can’t—give up on cross-border data sharing.
Recent events have illustrated the positive impact of sharing—not only within industries (as we have recently argued) but also across borders. For example, it took Mayo Clinic researchers in the United States just six weeks to calculate the increased risk of mortality from the COVID-19 Delta variant thanks to large-scale studies conducted on patient data from different national databases. This experience, though enabled by the exceptional circumstance of a global pandemic, is still illustrative of the power of sharing. But if the rise in data regulation continues at its current rate, such cross-border data sharing will become more and more difficult. This would have major implications, both on the global economy and on our collective ability to address issues that can only be solved by using data from multiple countries, such as anticipating natural disasters and coordinating responses and global aid, or identifying food safety issues in today’s weakening international supply chains.
Beyond the ‘raw data’ paradigm
One powerful solution is to be savvier about the different kinds of data currently available, and the appropriate policy response for each. Public discourse on cross-border data sharing has focused overwhelmingly on raw data. For example, a recent proposal from a Canadian think tank recommended its use to address issues such as global poverty and terrorism. The same can be observed in discussions on data sharing for trade agreements and in public health. We also see this focus on raw data when it comes to regulation, making the sharing of new forms of data unduly difficult. This is becoming increasingly problematic for the new forms of data that have emerged thanks to recent advances in AI, which can be safer to transfer and share, and which in many contexts can create value without sharing raw data.
These new intermediary data types have emerged along the AI pipeline—the process of developing an AI model through a sequence of steps, moving from raw data to final AI solutions. At each step, data is getting transformed or created in ways that can both alleviate regulators’ worries and enable their problem-solving.
For example, raw data must first be transformed into a format that can be used effectively by machine learning models. The results of this transformation, called features and embeddings, often capture critical insights from raw data, and they get increasingly difficult to reverse-engineer as we move up the AI chain of data processing—especially as new privacy preserving methods are being developed. This could have powerful implications in many sectors, including health care. Embeddings can represent raw medical records, minimizing the risk of patient reidentification and protecting confidentiality while enabling entities to share medical data across borders to, for example, accelerate responses to emerging global public health threats.
Valuable data can also be derived from the choices developers make when designing models, including hyperparameters (which guide how a machine learning model learns during training) and weights (the numerical values that help the model make its predictions). The sharing of such “model data” can accelerate the replication of models without sharing actual training data. For example, financial institutions in different countries seeking to improve their fraud prevention models could share these intermediate data without exposing sensitive information about their individual customers—resulting in a significantly more robust fraud detection system than if each bank relied only on its own data.
AI models are also able to create artificial data, so-called “synthetic data,” that can in turn be used to train other AI models in lieu of raw data. Because synthetic datasets are artificial, yet retain the patterns of the original raw data, they could be shared across borders without exposing sensitive information. Returning to the previous example, financial institutions could generate synthetic datasets comprised of imaginary customers and transactions that still display their real customers’ collective behavioral patterns.
The need for regulatory innovation
Sharing different types of data assets resulting from the AI pipeline can overcome some of the traditional barriers to data sharing. Of course, new challenges will likely emerge as the space of possibilities expands. But the crucial point is that such data assets will require different policies and sharing tools and frameworks tailored to their technical features.
However, today’s regulations do not account for all these new and emerging intermediary data categories. For instance, the global trade of certain data-driven services, such as in the financial or telecommunications spaces, is still regulated in part by agreements that predate the internet era—and, as such, don’t take into consideration new data categories. Instead, these categories tend to be treated like raw data—which means they are heavily restricted. And without urgent action, they are bound to be even more restricted over time.
With the advance of increasingly powerful AI, intermediary data types need to be regulated in a way that account for their specificities, such as their distinct use, value, or privacy-preserving features. Robust policies that make these distinctions will enable countries to share critical data on a larger scale, addressing pressing global issues while protecting citizens’ personal data. When it comes to data sharing, as with other innovations tied to the rapid development of AI, policymakers need to ensure that the rules of the game reflect the realities of the tech. There’s too much value at stake for a world confronted by global challenges and in ever-greater need of cross-border collaboration.
***
Read other Fortune columns by François Candelon.
François Candelon is a partner at private equity firm Seven2 and the former global director of the BCG Henderson Institute.
I. Glenn Cohen is the James A. Attwood and Leslie Williams Professor of Law at Harvard Law School.
Theodoros Evgeniou is a professor of technology management at INSEAD and co-founder of trust-and-safety solutions provider Tremau.
Ke Rong is a professor at the Institute of Economics, School of Social Sciences, Tsinghua University in Beijing.
The authors would like to thank Guillaume Sajust de Bergues for his contribution to this piece.