Text Mining for Biomedicine

Relevanta dokument
PubMed lathund Örebro universitetsbibliotek Medicinska biblioteket.

Sri Lanka Association for Artificial Intelligence

PubMed lathund Örebro universitetsbibliotek Medicinska biblioteket.

Room E3607 Protein bioinformatics Protein Bioinformatics. Computer lab Tuesday, May 17, 2005 Sean Prigge Jonathan Pevsner Ingo Ruczinski

Introduction to the Semantic Web. Eva Blomqvist

1. Compute the following matrix: (2 p) 2. Compute the determinant of the following matrix: (2 p)

Preschool Kindergarten

FORSKNINGSKOMMUNIKATION OCH PUBLICERINGS- MÖNSTER INOM UTBILDNINGSVETENSKAP

Goals for third cycle studies according to the Higher Education Ordinance of Sweden (Sw. "Högskoleförordningen")

Exam Molecular Bioinformatics X3 (1MB330) - 1 March, Page 1 of 6. Skriv svar på varje uppgift på separata blad. Lycka till!!

SVENSK STANDARD SS-EN ISO 19108:2005/AC:2015

Mapping sequence reads & Calling variants

Kursplan. JP1040 Japanska III: Språkfärdighet. 15 högskolepoäng, Grundnivå 1. Japanese III: Language Proficiency

The present situation on the application of ICT in precision agriculture in Sweden

RUP är en omfattande process, ett processramverk. RUP bör införas stegvis. RUP måste anpassas. till organisationen till projektet

Kursplan. FR1050 Franska: Skriftlig språkfärdighet I. 7,5 högskolepoäng, Grundnivå 1. French Written Proficiency I

Maskinöversättning 2008

Läkemedelsverkets Farmakovigilansdag

District Application for Partnership

Molecular Biology Primer

Isolda Purchase - EDI

Alias 1.0 Rollbaserad inloggning

PORTSECURITY IN SÖLVESBORG

Mis/trusting Open Access JUTTA

Isometries of the plane

Det här med levels.?

Introduktion till vetenskaplig metodik. Johan Åberg

8 < x 1 + x 2 x 3 = 1, x 1 +2x 2 + x 4 = 0, x 1 +2x 3 + x 4 = 2. x 1 2x 12 1A är inverterbar, och bestäm i så fall dess invers.

Skill-mix innovation in the Netherlands. dr. Marieke Kroezen Erasmus University Medical Centre, the Netherlands

Module 1: Functions, Limits, Continuity

en uppsatstävling om innovation Sammanfattning av de vinnande bidragen

Adding active and blended learning to an introductory mechanics course

Authentication Context QC Statement. Stefan Santesson, 3xA Security AB

Oförstörande provning (NDT) i Del M Subpart F/Del 145-organisationer

Schenker Privpak AB Telefon VAT Nr. SE Schenker ABs ansvarsbestämmelser, identiska med Box 905 Faxnr Säte: Borås

Kursplan. MT1051 3D CAD Grundläggande. 7,5 högskolepoäng, Grundnivå 1. 3D-CAD Basic Course

Uttagning för D21E och H21E

Kursplan. FÖ1038 Ledarskap och organisationsbeteende. 7,5 högskolepoäng, Grundnivå 1. Leadership and Organisational Behaviour

A metadata registry for Japanese construction field

Collaborative Product Development:

Kursplan. EN1088 Engelsk språkdidaktik. 7,5 högskolepoäng, Grundnivå 1. English Language Learning and Teaching

Support for Artist Residencies

översikt 1. informationsförädling är, typ: 2. Squirrelprototypen 3. möjligheter för framtiden [5] ICALL/2

CM FORUM. Introduktion till. Configuration Management (CM) / Konfigurationsledning. Tobias Ljungkvist

FÖRBERED UNDERLAG FÖR BEDÖMNING SÅ HÄR

1. Unpack content of zip-file to temporary folder and double click Setup

Michael Q. Jones & Matt B. Pedersen University of Nevada Las Vegas

ENTERPRISE WITHOUT BORDERS Stockholmsmässan, 17 maj 2016

Hållbar utveckling i kurser lå 16-17

Introduktion till vetenskaplig metodik. Johan Åberg

Second handbook of research on mathematics teaching and learning (NCTM)

Design Service Goal. Hantering av demonterbara delar som ingår i Fatigue Critical Baseline Structure List. Presentatör

Scaled Agile Framework

SVENSK STANDARD SS-EN 13612/AC:2016

Kursplan. NA1032 Makroekonomi, introduktion. 7,5 högskolepoäng, Grundnivå 1. Introductory Macroeconomics

Theory 1. Summer Term 2010

Kursplan. NA3009 Ekonomi och ledarskap. 7,5 högskolepoäng, Avancerad nivå 1. Economics of Leadership

SweLL & legal aspects. Elena Volodina

Kursplan. FÖ3032 Redovisning och styrning av internationellt verksamma företag. 15 högskolepoäng, Avancerad nivå 1

OPPOSITION FOR MASTER S PROJECT

Resultat av den utökade första planeringsövningen inför RRC september 2005

ISO STATUS. Prof. dr Vidosav D. MAJSTOROVIĆ 1/14. Mašinski fakultet u Beogradu - PM. Tuesday, December 09,

1. Varje bevissteg ska motiveras formellt (informella bevis ger 0 poang)

Kundfokus Kunden och kundens behov är centrala i alla våra projekt

medier som data några forskningsperspektiv

3rd September 2014 Sonali Raut, CA, CISA DGM-Internal Audit, Voltas Ltd.

EBBA2 European Breeding Bird Atlas

Health café. Self help groups. Learning café. Focus on support to people with chronic diseases and their families

Libers språklåda i engelska Grab n go lessons

Sociala medieströmmar metoder för analys och samarbete via nya medieformat. Pelle Snickars, Umeå universitet & Lars Degerstedt, Södertörns högskola

State Examinations Commission

Kursplan. IK1004 Java - Grafiska användargränssnitt med Swing. 7,5 högskolepoäng, Grundnivå 1. Java - GUI Programming with Swing - Undergraduate Level

Labokha AA et al. xlnup214 FG-like-1 xlnup214 FG-like-2 xlnup214 FG FGFG FGFG FGFG FGFG xtnup153 FG FGFG xtnup153 FG xlnup62 FG xlnup54 FG FGFG

Kursplan. PR1017 Portugisiska: Muntlig språkfärdighet II. 7,5 högskolepoäng, Grundnivå 1. Portuguese: Oral Proficiency II

Sara Skärhem Martin Jansson Dalarna Science Park

Writing with context. Att skriva med sammanhang

Ontologier. Cassandra Svensson

Pre-Test 1: M0030M - Linear Algebra.

Biblioteket.se. A library project, not a web project. Daniel Andersson. Biblioteket.se. New Communication Channels in Libraries Budapest Nov 19, 2007

Swedish adaptation of ISO TC 211 Quality principles. Erik Stenborg

Stad + Data = Makt. Kart/GIS-dag SamGIS Skåne 6 december 2017

Att analysera företagsdynamik med registerdata (FAD) Martin Andersson

Syns du, finns du? Examensarbete 15 hp kandidatnivå Medie- och kommunikationsvetenskap

Information technology Open Document Format for Office Applications (OpenDocument) v1.0 (ISO/IEC 26300:2006, IDT) SWEDISH STANDARDS INSTITUTE

Vässa kraven och förbättra samarbetet med hjälp av Behaviour Driven Development Anna Fallqvist Eriksson


KOL med primärvårdsperspektiv ERS Björn Ställberg Gagnef vårdcentral

Bridging the gap - state-of-the-art testing research, Explanea, and why you should care

Kvalitetsarbete I Landstinget i Kalmar län. 24 oktober 2007 Eva Arvidsson

Accomodations at Anfasteröd Gårdsvik, Ljungskile

Bates, M.J. (1989).The design of browsing and berrypicking techniques for the online search interface. Online Review, 13(5), pp

Transkript:

Why? Text Mining for Biomedicine Individual gene study large scale analysis He Tan Genomics Biology: increasing number of genomes, sequences, proteins Laboratory for Intelligent Information Systems Institutionen för datavetenskap Difficulty interpreting large scale experimental results, e.g. Y2H, microarrays, etc. 1 2 Why? Text Mining for Biomedicine 3 4 Disciplines Status Information retrieval finding the papers Entity recognition identifying the entities Information extraction formalizing the facts Text mining finding nuggets in the literature Integration combining text and biological data Literature mining for the biologist: from information retrieval to biological discovery Jensen et al., Nature Reviews Genetics, 2006 5 6 1

Challenges KDD Cup 2002 TREC Genomics Tracks 2003-2007 BioNLP/JNLPBA 2004 LLL 2005 BioCreative I (2004) & II (2006) Biomedical Language Heavy use of domain specific terminology e.g. chemoattractant, fibroblasts, angiogenesis No standard nomenclatures supported by scientific community HUGO Gene Nomenclature? Rapid growth of new names and new senses This disorder maps to chromosome 7q11-21, and this locus was named CLAM. [PMID:12771259 ] Most words with low frequency (data sparseness) 7 8 Biomedical Language short forms and abbreviations are often used e.g. vascular endothelial growth factor (VEGF) genes have often synonyms e.g. Thermoactinomyces candidus vs. Thermoactinomyces vulgaris Orthographic variants e.g. TNF α, TNF-alpha and TNF alpha (without hyphen) Biomedical Language A name may refer to a particular gene may include homologues of this gene in other organisms may denote an RNA, DNA, or the protein the gene encodes may be restricted to a specific splice variant Neurofibromatosis 2 [disease] NF2 Neurofibromin 2 [protein] Neurofibromatosis 2 gene [gene] 9 10 Text Mining and Natural language processing (NLP) NLP techniques to analyze, understand and generate natural language used by human, e.g. free text Text mining generally relies on NLP in e.g. IR, NER, IE The framework of NLP Tokenization text tokens Lemmatization word its lemma POS tagging e.g. NP-VP-NP B-I-O tagging define the begin and the end of the name entity Semantics interpretation Pragmatics analysis word syntax 11 12 2

Example Lexical Resources GENIA Tagger Collections of lexical items Additional information Part of speech Spelling variants Useful for entity recognition interaction Resources UMLS SPECIALIST Lexion, BioLexicon WordNet 13 14 UMLS SPECIALIST Lexicon SPECIALIST Text Tools Content English lexicon Many biomedical terms 360,000 lexical items Part of speech Variant information {base=hemoglobin base form spelling_variant=haemoglobin entry=e0031208 identifier cat=noun POS variants=uncount no plural variants=reg} plural (hemoglobins) 15 16 Annotated corpora Annotated Corpora in Biomedicine Analysis Determining features of domain language Development and evaluation gold standard Training data-driven methods Tasks POS, BIO, semantics (entities, relations/events) Useful for ER and IE Corpus LLL (train) HPRD50 PDG/PICorpus BioCreative I PPI BioInfer AIMed GENIA Genetag #sentence 77 145 285 486 1100 1955 18546/9372(term/event) 20000 # entities 240 400 1000 1500 4300 6300 24000 93000 Resources for Semantic Mining: Corpora and Ontologies, Tutorial, SMBM 2008 17 18 3

GENIA corpus GENIA corpus POS annotation Treebank Coreference annotation Term annotation Event annotation Cellular localization Disease-Gene association Pathway corpus 19 20 Acronyms Text Mining and Ontologies Many resources available AcroMine http://www.nactem.ac.uk/software/acromine ARGH: Biomedical Acronym http://lethargy.swmed.edu/argh/argh.asp Stanford Biomedical Abbreviation Server http://bionlp.stanford.edu/abbreviation/ AcroMed http://medstract.med.tufts.edu/acro1.1/index.htm SaRAD http://www.hpl.hp.com/research/idl/projects/abbrev.html GO UMLS ontologies OBO Knowledge extraction Semantics interpretation text 21 22 Text Mining and Ontologies The Gene Ontology Controlled vocabulary for the annotation of gene products Databases Semantic Interpretation of data Mathematical Models ontologies Semantic interpretation of models in Systems Biology Knowledge extraction Semantics interpretation text 26292 terms, 98.4% with definitions 15632 biological_process 2233 cellular_component 8427 molecular_function November 10, 2008 at 2:00 Pacific time five types of relationships: is_a, part_of, regulates, positively_regulates and negatively_regulates. 23 24 4

GO Annotations NF 2 neurofibromin 2 (merlin( merlin) ) [ Homo sapiens ] GoPubMed 25 26 Medical Subject Headings (MeSH) UMLS Metathesaurus Controlled vocabulary for indexing articles from the MEDLINE/PubMED database. 24,767 headings in 2008 MeSH arranged in hierarchical structure + 97,000 entry terms, e.g. "Vitamin C" is an entry term to "Ascorbic Acid." Content over 100 source vocabularies 6M names, 1.5M concepts 8M relations common presentation Other subdomains NCBI Clinical taxonomy Repositories SNOMED CT FMA UMLS Genetic Knowledge base GO OMIM MeSH Biomedical literature Model organisms Anatomy Genome annotations 27 28 UMLS Metathesaurus Neurofibromatosis type 2(NF2) is often not recognisedas a distinct entity from peripheral neurofibromatosis. NF2 is a predominantly intracranial condition whose hallmark is bilateral vestibular schwannomas. NF2 results from a mutationin the genenamed merlin, located on chromosome 22. UMLS Semantic NetWork Content 135 high-level categories 7000 relations among them UMLS: C0027832 MeSH: D016518 NOMEDCT: 92503002S OMIM: 101000 Clinical Repositories Genetic Knowledge base Other subdomains NCBI taxonomy SNOMED CT FMA UMLS GO OMIM MeSH Biomedical literature Model organisms Anatomy Genome annotations 29 30 5

UMLSKS UMLS tools MetaMap entity recognition system exploits both the SPECIALIST lexicon and Metathesaurus SemRep exploits UMLS Semantic Network semantic interpretation information extraction 31 32 Are they sufficient for Bio-TM? Ontologies GO, SNOMED CT, UMLS, etc OBO ontologies GENIA, BioInfer ontologies, etc inconsistent and imprecise practice in the naming of biomedical concepts (terminology) Incomplete ontologies as a result of rapid knowledge expansion. Information Retrieval To find those documents from a collection of documents which satisfy a given information demand Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation 33 34 Information Retrieval Information Retrieval Ad hoc information retrieval The user enters a query/a set of keywords The system attempts to retrieve the relevant texts 35 36 6

Ad hoc IR systems Ad hoc IR systems Query is typically, Boolean model e.g. yeast AND cell cycle By automatically expanding queries with additional search terms, recall can be improved Stemming removes common endings (yeast / yeasts) Stop word removal, such as the and an Case folding (yeast / YEAST) Thesauri can be used to expand queries with synonyms and/or abbreviations (yeast / S. cerevisiae) 37 38 Ad hoc IR systems Document Similarity The similarity of two documents can be defined based on their word content Each document is represented in a vector of word. Word is weighted according to their frequency within the document and the document collection. The similarity is calculated based on those vectors 39 40 Document Clustering Document Clustering Unsupervised clustering algorithms All pairwise document similarities are calculated Clusters of similar documents In ad hoc IR systems Rather than matching the query against each document only, the N most similar documents are also considered 41 42 7

Document Classification Document Classification Statistical machine learning methods A pre-defined set of document classes Each class is defined by manually assigning a number of documents to it classified documents 43 44 Information Retrieval ihop (information Hyperlinked over Proteins ) The art is to find the relevant papers even if they do not actually match the query Biological database: Organism: [ Saccharomyces cerevisiae ] GO process: cell cycle Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation The next logical step is to use ontologies to make complex inferences (yeast cell cycle / Cdc28) 45 46 ihop ihop 47 48 8

Wikiprofessional WikiProfessional 49 50 Opinions from leading scientists Fusing literature and biological databases through text mining Interactivity and user interfaces Tool integration Text mining resources Text mining for biology - the way forward: opinions from leading scientists Russ B Altman, et al. Genome Biology 2008, 9(Suppl 2):S7 51 9