Modern datahantering för biobanker och relaterad forskning. Behov av infrastruktur för tillgänglighet och säkerhet.

som en nationell resurs för biomedicinsk forskning Vetenskapsrådet, den 18 september 2007 Modern datahantering för biobanker och relaterad forskning. Behov av infrastruktur för tillgänglighet och säkerhet. Jan-Eric Litton, professor Medicinsk epidemiologi och biostatistik Karolinska Institutet

Dagens situation Ingen infrastruktur i landet Saknas standardisering Begränsad information om biobanksprovet Provsamlingarna starkt personberoende idag Informatiktänkande saknas 2

Agenda Biobanksinformatik BIMS Federerade databaser Biobanker i Europa 3

Vilket problem ska vi lösa? Number of studies Number of participants TARGETED Europe 41 3,500,000 United Kingdom 8 600,000 Scandinavian countries 19 2,000,000 Others 14 900,000 Singlecountry America 30 3,000,000 United-States 23 2,600,000 Others 7 400,000 Australia/New Zealand 4 210,000 Asia 8 1,100,000 Several countries Europe, America, Australia 8 2,000,000 4

Biobanksinformatik En mötesplats för bioinformatik och hälsoinformatik 5

Vilka är problemen? Den gamla goda tiden Research Data Studie deltagare Forskare 6

Vilka är problemen? många olika datakällor Biological Sample Data Genotype Data Phenotype Data Registers 7

Biobank Information Management System (BIMS) Biological Sample Data Genotype Data Phenotype Data Registers 8

Other Biobanks Lab Robot LIMS BIMS DB:s Freezer Web Interface 9

Dataintegration Nio utmaningar 10

Utmaning #1: Data kompatibilitet How old are you?" When were you born? I N T E G R A T I O N Birth date? 11

Utmaning #2: Olika datamodeller Data source with data model Data source with similar data, but in another data model I N T E G R A T I O N Data in integrated data model 12

Utmaning #3: Olika ontologier Kod för chronic ischaemic heart disease: Ontology Code ICD-10 I25.9 ICD-9 414.9 Snomed CT 84537008 UMLS 448589 13

Utmaning #4: Avidentifiering I N T E G R A T I O N 14

Utmaning #5: Vem har rätt till vad? Data source A Data source B I N T E G R A T I O N Data from data source A and B 15

Utmaning #6: Olika dataformat Excel Oracle XML SAS Access I N T E G R A T I O N Common format 16

Utmaning #7: Olika datakvalitet I N T E G R A T I O N? 17

Utmaning #8: Ägarskap till data Mine! Data 18

Utmaning #9: Genotyp data 19

Data 1 Data Integration Sourcespecific Sourcespecific Data 2 Shared data model Sourcespecific Sourcespecific Data 4 BIMS Data Repository Researcher Deidentification Sensitive Data 20

Dataflödet genom BIMS 21

Dataintegrering för länge sedan Merge results 22

Datavaruhuset ODBC - JDBC 23

Federerad databas ODBC JDB C and more 24

Federerad databas Federated database system is a type of database management system that transparently integrates multiple autonomous database systems into a single federated database. The constituent databases are interconnected via a computer network, and may be geographically decentralized. A federated database (or virtual database) is the fully integrated, logical composite of all constituent databases in a federated database system. Data sources could be both structured (relation database, Excel, etc) and/or unstructured data like medical records etc. Because various database management systems employ different query languages, federated database systems can apply wrappers to the sub queries to translate them into the appropriate query languages. ODBC JDB C and more Grid computing is an emerging computing model that provides the ability to perform higher throughput computing by taking advantage of many networked computers to model a virtual computer architecture that is able to distribute process execution across a parallel infrastructure. Grids use the resources of many separate computers connected by a network (usually the Internet) to solve large-scale computation problems. Grids provide the ability to perform computations on large data sets, by breaking them down into many smaller ones, or provide the ability to perform many more computations at once than would be possible on a single computer, by modeling a parallel division of labor between processes. 25

Att koppla ihop 600.000 tvillingpar Syfte; att identifiera kritiska genetiska/livsstils faktorer för vanliga sjukdomar i Europa 26

Tvilling kohorter Australian twins Danish twins English twins Finnish twins Italian twins Dutch twins Norwegian twins Swedish twins Intellectual core facilities Epidemiological expertise (Odense) Genotyping &DNA (Helsinki, Uppsala) Database expertise (Stockholm) Biostatistics expertise (Leiden) Ethical &legal expertise(oslo) 27

Muilu J, Peltonen L, Litton JE. The federated database - a basis for biobank-based post-genome studies, integrating phenome and genome data from 600 000 twin pairs in Europe. Eur J Hum Genet 2007. 29

Hub-and-Spoke (Nav och Eker) Inget behov att ansluta alla till alla på nätverksnivå - Databasfederation gör routing av trafiken (och frågorna) - Hubs står för databasservicen Single access point - Hubs kan federeras - Vi kan ha många Hubs geografiskt spridda, genotype Hub, phenotype Hub, Sample Hub, Meta Data Hub, etc. 30

Hub-and-Spoke för Biobanker i Europa 31

www.biobanks.eu Proposal for European Research Infrastructure European Bio-Banking and Biomolecular Resources 50 organisationer 23 ger stöd 8 st har skrivit letter of support 32

www.biobanks.eu Unika förutsättningar att ta ledningen i Europa 33

www.biobanks.eu Unika förutsättningar att ta ledningen i Europa Starkt stöd från VR och regeringen 34

jan-eric.litton@ki.se 35