Data Sharing: What is it, really?

Geneva, September 2020

By Maya Plentz

The UN, its agencies, and the Swiss government are leading a series of preparatory conferences do discuss the agenda of the upcoming UN World Data Forum. The UN Brief interviewed participants from the private sector as well as UN agencies and governments to look at some of the issues on the table. Our objective is to add to the discourse by hearing directly from the representatives of multilateral bodies, government officials, and private sector actors.

This month we are launching a series on Data Governance, with a guest post from Mr. Stephen MacFeely, Chief Statistician at the United Nations Conference on Trade and Development. Mr. MacFeely has published extensively on the subject of data gathering and assessment for populating the indicators of the UN Sustainable Development Goals.

Data Sharing: Challenges and Opportunities

By Stephen Mac Feely, Chief Statistician, UNCTAD

Geneva, June 2020

When I hear these words, data sharing, my immediate reaction is to start asking questions – I can’t help it.

Firstly, what do we mean by the word ‘data’? Are we talking about aggregate statistics or are we talking about microdata i.e. individual records? If it’s the latter, are they anonymized (and how well have they been pseudonymised) or do the records contain information that could identify persons or entities. The answer to those questions will have a profound impact on what I might say about ‘sharing’.

But even regarding the word ‘sharing’ I have questions. What do we mean by sharing? Does that mean public dissemination? Or does it mean giving selective, bilateral access? Does it require transmission of the data? Who are we sharing the data with – another unit within the same entity? With an external partner? When the data were first collected were any conditions attached that would prohibit sharing or giving access?

So, let’s explore a bit.

If we are talking about aggregate statistics, then there are relatively few complications, especially if the statistics are official. In that case, provided confidentiality is properly protected, official statistics are designed to be public goods and should by definition be accessible to (shared with) everyone at the same time. See principles 1 and 6 of the UN Fundamental Principles of Official Statistics[1] and the Principles Governing International Statistical Activities[2] which are the ‘constitutions’ for national and international statistical compilers respectively.

Of course, if we are talking about microdata, i.e., individual records, then it’s a whole different discussion, as safeguarding confidentiality is much more challenging. So, first things first – where did the data come from? If they are primary data i.e. you collected them yourself, what was the stated purpose and what guarantees did you give to the respondents? Did you tell them the data would/could be shared, and if yes, with who and for what purposes?

If respondents were told their data would not be shared with anyone, then that promise must be respected. If it was made clear to respondents that data would shared, then whatever conditionality was set out must be respected. So this might mean, at a minimum, stripping all unique identifiers (names, addresses, social security numbers…etc), but probably also aggregating some data into cohorts e.g. say someone is aged 37, then we might replace their actual age with an age cohort, say 30 – 40. Ditto for income, or any other factors that when combined, might reveal an individual identity. For example, if you combine sex, occupation and town, then someone might be able to determine who that is, and what their income or health status is.

If the data are secondary, i.e., repurposed data that were not collected for statistical purposes, e.g., tax records (an example of administrative data) or mobile phone CDR records (an example of big data) then things get even more complex. As above, there may already be strict conditions attached to the data (from the primary data collector) but there will also be conditions with your use of the secondary data – maybe you don’t even have permission to share it.

As if things weren’t already complex enough – what about recursive data i.e. data produced from other data? Now the waters start to get very muddy, because now the issue of ownership is less clear. If I create data from your data, do I need your permission to share it? After all, you didn’t give it to me – I made it.   

Apart from the legal/contractual and ethical issues, there are typically a range of logistical issues surrounding the sharing of data. If sharing involves transmission i.e. moving the data, then it may require encryption, or if the files are very large it may require sophisticated IT infrastructure. But maybe the data are so sensitive they can only be shared under strict lab conditions – which means putting in place physical infrastructure and security. It may require harmonized data infrastructure – common classifications and codes, that allow datasets to speak to each other.

Speaking of legal issues, an important and persistent problem across many organisations, is the ambiguity surrounding their licensing and the terms of reuse. I think this stops a lot of data sharing, either out of genuine fear of what is allowed legally, or deliberately, i.e., it is used as a justification or excuse to data hoard and not share data.

These are critical issues for modern statistical offices, whether national or international. The strategic plans ‘Data Strategy of the Secretary-General for Action by Everyone, Everywhere with Insight, Impact and Integrity2020-22’ and the ‘System-wide Roadmap for Innovating UN Data and Statistics’ which were both endorsed by the UN CEB in May 2020 will together be grappling with these and other issues.

In the context of Covid-19, there is one last, but very important set of issues to consider. Covid-19 has (as the threat of terrorism has done in the past), I think, exposed a tension between community and individual rights. A tension that is becoming white hot in the ‘dataveillance’ debate i.e. using data for surveillance. How do we balance the right to individual privacy against the ‘common good’ (however that might be defined) – and who decides? Should governments be allowed to track citizens using data shared by social media platforms or telecoms to contain COVID-19?

A big question with no easy answer.  

Maya Plentz interviews Mr. Stephen Mac Feely