UK Biobank health data keeps ending up on GitHub

adwf

That's the least of it: https://www.bbc.co.uk/news/articles/cpvxgl3n138o

All 500,000 participants for sale on Alibaba...

And official response: https://www.ukbiobank.ac.uk/news/a-message-to-our-participan...

michaelt

> It has given 20,000 researchers around the world access under strict agreements that prohibit sharing data further.

To me it seems rather naive to have done that.

After all, you can't un-leak medical data. So even if the "strict agreement" included huge punishments, there's no getting the toothpaste back in the tube.

If you want to ensure compliance before a leak happens you have to (ugh) audit their compliance. And that isn't something that scales to 20,000 researchers.

Too late to do anything about it now though :(

show comments

anitil

I've opted in to Australia's version of the biobank knowing that it's inevitable that it will be leaked some day, I think the data is so valuable in perpetuity that it's worth it. I remember Ben Goldacre has been working on how to make data more accessible in a safer way to (in part) avoid this very thing, but I haven't heard much of it since [0]

[0] https://www.bennett.ox.ac.uk/blog/2025/02/opensafely-in-brie...

show comments

captn3m0

Took me 5 minutes to find more: https://github.com/tanaylab/Mendelson_et_al_2023/blob/9c5a65... (Uses Date of Birth column).

And some information on how they were distributing it to researchers: https://github.com/broadinstitute/ml4h/blob/master/ingest/uk...

> The following steps require the ukbunpack and ukbconv utilities from the UK Biobank website. The file decrypt_all.sh will run through the following steps on one of the on-prem servers.

> Once the data is downloaded, it needs to be "ukbunpacked" which decrypts it, and then converts it to a file format of choice. Both ukbunpack and ukbconv are available from the UK Biobank's website. The decryption has to happen on a linux system if you download the linux tools, e.g. the Broad's on-prem servers. Note that you need plenty of space to decrypt/unpack, and the programs may fail silently if disk space runs out during the middle.

https://biobank.ctsu.ox.ac.uk/crystal/download.cgi

show comments

dariosalvi78

the issue is with jupyter notebooks because they keep some of the data in the output (typically a few rows, but still). They should strongly recommend to use regular python scripts, and keep the jupyter books just for verification, which is a very sane thing to do also from a SW engineering perspective.

show comments

mhh__

I haven't been paying attention to it but wasn't there some kerfuffle over some people threatening to leak it over not being allowed to publish controversial findings?

mil22

The irony is, they don’t even provide the data to the participants themselves.

show comments

NGRhodes

Thank you for sharing. I work in a central RSE team and have raised this topic to the team, with a view of bringing attention to this issue and better educating our researchers (as part of our training offerings and documentation).

John7878781

What are the pros/cons of just open-sourcing everything for future bio bank projects?

show comments

nxobject

From the perspective of someone who's worked with (biostatisticians who touch) Medicaid and Medicare billing data...

It looks like they've identified the institutions, at least... but aren't identifying it to the public for now. Are there going to be consequences? Are they going to be identified and sanctioned beyond "having their access suspended?"

In the US, HHS wouldn't hestitate to name, shame, and impose a sanction with corrective action plans. Not knowing much about how things work across the pond, I'm sure CMS PII gets used more often in research without these leaks left and right.

deknos

so.. where can i get this data? XD