Hidden Markov Models and other Multiple-sequence Profile approaches

Relevanta dokument
Biochemistry 201 Advanced Molecular Biology (

Support Manual HoistLocatel Electronic Locks

Module 6: Integrals and applications

Exam Molecular Bioinformatics X3 (1MB330) - 1 March, Page 1 of 6. Skriv svar på varje uppgift på separata blad. Lycka till!!

Hur fattar samhället beslut när forskarna är oeniga?

Styrteknik: Binära tal, talsystem och koder D3:1

Writing with context. Att skriva med sammanhang

1. Compute the following matrix: (2 p) 2. Compute the determinant of the following matrix: (2 p)

F ξ (x) = f(y, x)dydx = 1. We say that a random variable ξ has a distribution F (x), if. F (x) =

Michael Q. Jones & Matt B. Pedersen University of Nevada Las Vegas

Isolda Purchase - EDI

Room E3607 Protein bioinformatics Protein Bioinformatics. Computer lab Tuesday, May 17, 2005 Sean Prigge Jonathan Pevsner Ingo Ruczinski

Preschool Kindergarten

Adding active and blended learning to an introductory mechanics course

Grafisk teknik IMCDP IMCDP IMCDP. IMCDP(filter) Sasan Gooran (HT 2006) Assumptions:

Module 1: Functions, Limits, Continuity

Mapping sequence reads & Calling variants

CUSTOMER READERSHIP HARRODS MAGAZINE CUSTOMER OVERVIEW. 63% of Harrods Magazine readers are mostly interested in reading about beauty

1. Unpack content of zip-file to temporary folder and double click Setup

SVENSK STANDARD SS-EN ISO 19108:2005/AC:2015

EXPERT SURVEY OF THE NEWS MEDIA

Beijer Electronics AB 2000, MA00336A,

Schenker Privpak AB Telefon VAT Nr. SE Schenker ABs ansvarsbestämmelser, identiska med Box 905 Faxnr Säte: Borås

Stad + Data = Makt. Kart/GIS-dag SamGIS Skåne 6 december 2017

Authentication Context QC Statement. Stefan Santesson, 3xA Security AB

Webbregistrering pa kurs och termin

Sannolikhetsteori. Tentamenskrivning: TMS145 - Grundkurs i matematisk statistik och bioinformatik,

Isometries of the plane

Rastercell. Digital Rastrering. AM & FM Raster. Rastercell. AM & FM Raster. Sasan Gooran (VT 2007) Rastrering. Rastercell. Konventionellt, AM

Grafisk teknik IMCDP. Sasan Gooran (HT 2006) Assumptions:

This exam consists of four problems. The maximum sum of points is 20. The marks 3, 4 and 5 require a minimum

Information technology Open Document Format for Office Applications (OpenDocument) v1.0 (ISO/IEC 26300:2006, IDT) SWEDISH STANDARDS INSTITUTE

Swedish adaptation of ISO TC 211 Quality principles. Erik Stenborg

LUNDS TEKNISKA HÖGSKOLA Institutionen för Elektro- och Informationsteknik

Consumer attitudes regarding durability and labelling

Grafisk teknik. Sasan Gooran (HT 2006)

Hidden Markov Model. Definition: V, X,{T k },π. hidden Markov model. X is an output alphabet. V is a finite set of states

Mönster. Ulf Cederling Växjö University Slide 1

Theory 1. Summer Term 2010

EXTERNAL ASSESSMENT SAMPLE TASKS SWEDISH BREAKTHROUGH LSPSWEB/0Y09

Viktig information för transmittrar med option /A1 Gold-Plated Diaphragm

Statistical Quality Control Statistisk kvalitetsstyrning. 7,5 högskolepoäng. Ladok code: 41T05A, Name: Personal number:

Automation Region. Affärsdriven systemutveckling genom agila metoder. Stefan Paulsson Thomas Öberg

Bridging the gap - state-of-the-art testing research, Explanea, and why you should care

Webbreg öppen: 26/ /

Översättning av galleriet. Hjälp till den som vill...

2.1 Installation of driver using Internet Installation of driver from disk... 3

Om oss DET PERFEKTA KOMPLEMENTET THE PERFECT COMPLETION 04 EN BINZ ÄR PRECIS SÅ BRA SOM DU FÖRVÄNTAR DIG A BINZ IS JUST AS GOOD AS YOU THINK 05

12.6 Heat equation, Wave equation

8 < x 1 + x 2 x 3 = 1, x 1 +2x 2 + x 4 = 0, x 1 +2x 3 + x 4 = 2. x 1 2x 12 1A är inverterbar, och bestäm i så fall dess invers.

MÅLSTYRNING OCH LÄRANDE: En problematisering av målstyrda graderade betyg

Annonsformat desktop. Startsida / områdesstartsidor. Artikel/nyhets-sidor. 1. Toppbanner, format 1050x180 pxl. Format 1060x180 px + 250x240 pxl.

RUP är en omfattande process, ett processramverk. RUP bör införas stegvis. RUP måste anpassas. till organisationen till projektet

Kurskod: TAMS11 Provkod: TENB 28 August 2014, 08:00-12:00. English Version

Datasäkerhet och integritet

Installation av F13 Bråvalla

Every visitor coming to the this website can subscribe for the newsletter by entering respective address and desired city.

EBBA2 European Breeding Bird Atlas

Labokha AA et al. xlnup214 FG-like-1 xlnup214 FG-like-2 xlnup214 FG FGFG FGFG FGFG FGFG xtnup153 FG FGFG xtnup153 FG xlnup62 FG xlnup54 FG FGFG

Schenker Privpak AB Telefon VAT Nr. SE Schenker ABs ansvarsbestämmelser, identiska med Box 905 Faxnr Säte: Borås

Questionnaire for visa applicants Appendix A

Second handbook of research on mathematics teaching and learning (NCTM)


Discovering!!!!! Swedish ÅÄÖ. EPISODE 6 Norrlänningar and numbers Misi.se

Quick Start Guide Snabbguide

Här kan du sova. Sleep here with a good conscience

Health café. Self help groups. Learning café. Focus on support to people with chronic diseases and their families

The Arctic boundary layer

Klicka här för att ändra format

The Finite Element Method, FHL064

Kursplan. NA3009 Ekonomi och ledarskap. 7,5 högskolepoäng, Avancerad nivå 1. Economics of Leadership

Protokoll Föreningsutskottet

Motif-based Hidden Markov Models for Multiple Sequence Alignment

Documentation SN 3102

Kursplan. MT1051 3D CAD Grundläggande. 7,5 högskolepoäng, Grundnivå 1. 3D-CAD Basic Course

Kurskod: TAIU06 MATEMATISK STATISTIK Provkod: TENA 17 August 2015, 8:00-12:00. English Version

How to format the different elements of a page in the CMS :

INSTALLATION INSTRUCTIONS

1. Förpackningsmaskin / Packaging machine

Högskolan i Skövde (SK, JS) Svensk version Tentamen i matematik

Kvalitetsarbete I Landstinget i Kalmar län. 24 oktober 2007 Eva Arvidsson

Klyvklingor / Ripping Blades.

STORSEMINARIET 3. Amplitud. frekvens. frekvens uppgift 9.4 (cylindriskt rör)

Measuring child participation in immunization registries: two national surveys, 2001

SUPPLEMENTARY FIGURE LEGENDS

Mid-Semester Evals. CS 188: Artificial Intelligence Spring Outline. Contest. Conditional Independence. Recap: Reasoning Over Time

Eternal Employment Financial Feasibility Study

Här kan du checka in. Check in here with a good conscience


Technique and expression 3: weave. 3.5 hp. Ladokcode: AX1 TE1 The exam is given to: Exchange Textile Design and Textile design 2.

Ett hållbart boende A sustainable living. Mikael Hassel. Handledare/ Supervisor. Examiner. Katarina Lundeberg/Fredric Benesch

För att justera TX finns det ett tool med namnet MMDVMCal. t.ex. /home/pi/applications/mmdvmcal/mmdvmcal /dev/ttyacm0

Robust och energieffektiv styrning av tågtrafik

Normalfördelning. Modeller Vi har alla stött på modeller i olika sammanhang. Ex:

Typografi, text & designperspektiv

Att fastställa krav. Annakarin Nyberg

denna del en poäng. 1. (Dugga 1.1) och v = (a) Beräkna u (2u 2u v) om u = . (1p) och som är parallell

Transkript:

Hidden Markov Models and other Multiple-sequence Profile approaches Mount, Chapter 4, pp. 185-192 Durbin et al, Chapter 5 (also earlier chapters) taken from Sean Eddy Dept. of Genetics Washington U., St. Louis HMMs, PSSMs, Motifs, and consensus representing conserved positions HMMs (Hidden Markov Models), PSSMs (Position Specific Scoring Matrices), BLOCKS and Profiles seek to effectively represent conserved positions in homologous sequences. Functional conservation after divergence Some (but not all) motifs and consensus sequences represent conserved positions in non-homologous sequences (promoters, modification sites, regulatory sites). Functional acquisition by convergence. 1

The best scores are: s-w bits E(14548) % id alen PWHU6 H+-trans. ATP syn. - human mito. 400 326.7 3.3e-90 1.000 226 PWBO6 H+-trans. ATP syn. - cow mito. 157 271.3 1.6e-73 0.779 226 PWMS6 H+-trans. ATP syn. - mouse mito. 118 262.4 7.6e-71 0.757 226 PWXL6 H+-trans. ATP syn. - frog mito. 745 177.3 3.1e5 0.533 229 PWFF6 H+-trans. ATP syn. - D. melanog. 471 114.8 2.0e-26 0.378 222 PWBY3 H+-trans. ATP syn. - yeast mito. 438 107.3 4.4e-24 0.362 232 PWAS6N H+-trans. ATP syn. - E. nidulans 365 90.6 4.4e-19 0.304 230 PWKQ6 H+-trans. ATP syn. - H. maydis 353 87.9 3.0e-18 0.313 214 PWWT6 H+-trans. ATP syn. - wheat mito. 309 77.8 4.9e-15 0.292 233 PWNT6M H+-trans. ATP syn. - tobacco 309 77.8 5.0e-15 0.283 233 PWZM6M H+-trans. ATP syn. - corn mito. 283 71.9 2.2e-13 0.311 180 LWEC6 H+-trans. ATP syn. - E. coli 178 48.0 3.3e-06 0.237 236 LWRZ6 H+-trans. ATP syn. - rice chloro. 144 40.2 0.00063 0.242 231 PWPMA6 H+-trans. ATP syn. - pea chloro. 143 40.0 0.00074 0.250 232 PWYBAA H+-trans. ATP syn. - Cyano. syn. 142 39.7 0.00099 0.265 170 PWSPA6 H+-trans. ATP syn. - spinach 138 38.9 0.0016 0.238 231 PWYCA6 H+-trans. ATP syn. - Synecho. 127 36.3 0.0099 0.263 167 LWNT6 H+-trans. ATP syn. - tobacco 126 36.1 0.011 0.221 231 LWLV6 H+-trans. ATP syn. - liverwort 126 36.1 0.011 0.244 168 PWEGAC H+-trans. ATP syn. - euglena 123 35.4 0.018 0.257 214 JQ0026 ATP/ADP translocase tlc1 - Ricket 122 35.1 0.045 0.247 154 S17420 ubiquinol--cytochrome-c reductase 113 33.1 0.14 0.228 158 QXBO2M NADH dehydrogenase (ubiquinone) 107 31.7 0.32 0.261 211 S17415 ubiquinol--cytochrome-c reductase 105 31.3 0.49 0.277 137 S17417 ubiquinol--cytochrome-c reductase 104 31.0 0.57 0.277 137 DNHUN2 NADH dehydrogenase (ubiquinone) 103 30.8 0.61 0.201 149 CBHU ubiquinol--cytochrome-c reductase 102 30.6 0.79 0.268 205 QRECAA aromatic amino acid trans. prot. 103 30.8 0.82 0.234 111 S17419 ubiquinol--cytochrome-c reductase 101 30.3 0.92 0.234 158 Human mito Bovine mito Mouse mito Frog mito Dros. mito 10-92 /10-6 10-75 /10-6 10-75 /10-6 10 2 /10-6 10-26 /0.003 Yeast mito. 10-24 /10-8 Cochliobolus mito. 10-18 /10-5 Aspergillus mito. 10-19 /10-6 Rice chloro. Tobacco chloro. Spinach chloro. Pea chloro. March. chloro. Cyanobacteria Synechocystis Euglena chloro. E. coli Corn mito. 10-13 /10-9 Wheat mito. 10-15 /10-8 0.0004/10-12 0.007/10-13 0.001/10-13 0.0004/10-12 0.007/10-11 0.006/10-13 0.0006/10-12 0.01/10-12 10-6 /10-90 2

sxl_drome (1sxl) PSI-Blast E() iteration 1: <7 iteration 2: 10-8 ru1a_human (u1a) SXL_DROME/127-198 LIVNYL----PQDMTDRELYALFRA-IGPINTCRIMRD----YKTGYSFGYAFVDFTSEMDSQRAIKVLNG-ITVRNKRLKV U2AF_HUMAN/261-332 LFIGGL----PNYLNDDQVKELLTS-FGPLKAFNLVKD----SATGLSKGYAFCEYVDINVTDQAIAGLNG-MQLGDKKLLV U2AF_SCHPO/312-383 IYISNL----PLNLGEDQVVELLKP-FGDLLSFQLIKN----IADGSSKGFCFCEFKNPSDAEVAISGLDG-KDTYGNKLHA SXLF_DROME/213-285 LYVTNL----PRTITDDQLDTIFGK-YGSIVQKNILRD----KLTGRPRGVAFVRYNKREEAQEAISALNNVIPEGGSQPLS ROC_HUMAN/18-82 VFIGNL---NTLVVKKSDVEAIFSK-YGKIVGCSVHK------------GFAFVQYVNERNARAAVAGEDG-RMIAGQVLDI RU1A_HUMAN/12-84 IYINNLNEKIKKDELKKSLYAIFSQ-FGQILDILVSR-------SLKMRGQAFVIFKEVSSATNALRSMQG-FPFYDKPMRI RU2B_HUMAN/9-81 IYINNMNDKIKKEELKRSLYALFSQ-FGHVVDIVALK-------TMKMRGQAFVIFKELGSSTNALRQLQG-FPFYGKPMRI SQD_DROME/138-208 IFVGGL----TTEISDEEIKTYFGQ-FGNIVEVEMPLD----KQKSQRKGFCFITFDSEQVVTDLLK-TPK-QKIAGKEVDV SQD_DROME/58-128 LFVGGL----SWETTEKELRDHFGK-YGEIESINVKTD----PQTGRSRGFAFIVFTNTEAIDKVSA-ADE-HIINSKKVDP SFR2_CHICK/16-87 LKVDNL----TYRTSPDTLRRVFEK-YGRVGDVYIPRD----RYTKESRGFAFVRFHDKRDAEDAMDAMDG-AVLDGRELRV SFR1_HUMAN/17-85 IYVGNL----PPDIRTKDIEDVFYK-YGAIRDIDLKNR-------RGGPPFAFVEFEDPRDAEDAVYGRDG-YDYDGYRLRV SR55_DROME/5-68 VYVGGL----PYGVRERDLERFFKG-YGRTRDILIKN------------GYGFVEFEDYRDADDAVYELNG-KELLGERVVV SFR3_HUMAN/12-78 VYVGNL----GNNGNKTELERAFGY-YGPLRSVWVARN---------PPGFAFVEFEDPRDAADAVRELDG-RTLCGCRVRV RNP1_YEAST/37-109 LYVGNL----PKNCRKQDLRDLFEPNYGKITINMLKKK----PLKKPLKRFAFIEFQEGVNLKKVKEKMNG-KIFMNEKIVI PES4_YEAST/305-374 IFIKNL----PTITTRDDILNFFSE-VGPIKSIYLSN------ATKVKYLWAFVTYKNSSDSEKAIKRYNN-FYFRGKKLLV YHH5_YEAST/315-384 ILVKNL----PSDTTQEEVLDYFST-IGPIKSVFISEK------QANTPHKAFVTYKNEEESKKAQKCLNK-TIFKNHTIWV YHC4_YEAST/34815 IFVGQL----DKETTREELNRRFST-HGKIQDINLIFK--------PTNIFAFIKYETEEAAAAALESENH-AIFLNKTMHV SFR1_HUMAN/122-186 VVVSGL----PPSGSWQDLKDHMRE-AGDVCYADVYRD-----------GTGVVEFVRKEDMTYAVRKLDN-TKFRSHEGET RU1A_HUMAN/210-276 LFLTNL----PEETNELMLSMLFNQ-FPGFKEVRLVPG---------RHDIAFVEFDNEVQAGAARDALQG-FKITQNNAMK RU2B_HUMAN/153-220 LFLNNL----PEETNEMMLSMLFNQ-FPGFKEVRLVPG---------RHDIAFVEFENDGQAGAARDALQGFKITPSHAMKI RU1A_YEAST/229-293 LLIQNL----PSGTTEQLLSQILGN-EALV-EIRLVSV----------RNLAFVEYETVADATKIKNQLGS-TYKLQNNDVT PR24_YEAST/43-111 VLVKNL----PKSYNQNKVYKYFKH-CGPIIHVDVAD------SLKKNFRFARIEFARYDGALAAIT-KTH-KVVGQNEIIV PR24_YEAST/212-284 IMIRNL---STELLDENLLRESFEG-FGSIEKINIPAG---QKEHSFNNCCAFMVFENKDSAERALQ-MNR-SLLGNREISV SSB1_YEAST/39-114 IFIGNV----AHECTEDDLKQLFVEEFGDEVSVEIPIK-EHTDGHIPASKHALVKFPTKIDFDNIKENYDT-KVVKDREIHI RN12_YEAST/200-267 IVIKFQ----GPALTEEEIYSLFRR-YGTI--IDIFP------PTAANNNVAKVRYRSFRGAISAKNCVSG-IEIHNTVLHI U2AG_HUMAN/67-142 CAVSDVEMQEHYDEFFEEVFTEMEEKYGEVEEMNVCDN-----LGDHLVGNVYVKFRREEDAEKAVIDLNN-RWFNGQPIHA LA_DROME/151-225 AYAKGF----PLDSQISELLDFTAN-YDKVVNLTMRNSYDKPTKSYKFKGSIFLTFETKDQAKAFLE-QEK-IVYKERELLR LA_HUMAN/113-182 VYIKGF----PTDATLDDIKEWLED-KGQVLNIQMRR-----TLHKAFKGSIFVVFDSIESAKKFVE-TPG-QKYKETDLLI 3

Alignments show conservation Profiles Profiles invented in the late eighties Barton and Sternberg Gribskov Taylor Uses a multiple alignment and rules to convert frequencies to scores 4

Profiles the problems Could not compare between two profiles user had to be involved in interpretation Gap penalties again the user had to be involved HMMs Hidden Markov Models have been successfully used for speech recognition passive sonar work other signal detection problems 5

profile-hmms Anders Krogh in David Haussler s group. Takes the standard profiles and uses HMM based standard mathematics to solve two problems Profile-HMM scores are comparable (*) Setting gap costs Theoretical framework for what we are doing. (* this is not really true. see later) Pairwise Alignment RU1A_HUMAN rrm2 PABP_DROME rrm3 VQAGAAR EAAEAAV +2 +2 +2 score matrices: 20x20, 210 parameters positionindependent Cys 12 Ser 0 2 Thr -2 1 3 Pro -1 1 0 6 Ala -2 1 1 1 2 Gly -3 1 0-1 1 5 Asn 1 0-1 0 0 2 Asp -5 0 0-1 0 1 2 4 Glu -5 0 0-1 0 0 1 3 4 Gln -5-1 -1 0 0-1 1 2 2 4 C S T P A G N D E Q 6

Profile Alignment RU1A_HUMAN rrm1 RU1A_HUMAN rrm2 SFR1_HUMAN rrm1 SXLF_DROME rrm1 PABP_DROME rrm3 SSATNAL VQAGAAR RDAEDAV MDSQRAI EAAEAAV +3 +4 0 query target profile: 20 scores per column position-dependent Where pairwise scores come from score(aa)=log P(A A) f(a) probability of A given an A the observed probability of seeing an A aligned to an A in real alignments frequency of A the expected frequency of A in any sequence Sc(AA) = log 0.64 0.04 = +4 2 Sc(AE) = log 0.01 0.04 = -2 2 7

Where profile scores (should) come from score(a x)=log P(A position x) f(a) probability of A at position x the observed probability of seeing an A in the consensus column x Sc(A 6) = log 1.00 0.04 = +4.6 Sc(A 5) = log 0.04 2 0.04 = 0 2 Sc(N 6) = log 0.00 0.06 = -inf Sc(N 5) = log 0.06 2 0.06 = 0 2 1. what about position-specific gap penalties? 2. how to estimate parameters from small numbers of observations? Hidden Markov Models (HMMs) t 1,1 t 2,2 t 1,2 t 2,end 1 2 end HMM p 1 (a) p 1 (b) p 2 (a) p 2 (b) 1 1 2 end hidden state sequence, π a b a observed symbol sequence, x t 1,1 t 1,2 t 2,end p 1 (a) p 1 (b) p 2 (a) = P(x,π HMM) 8

Figure 1 A simple hidden Markov model. A two-state HMM describing DNA sequence with a heterogeneous base composition is shown, following work by Churchill [10]. (a) State 1 (top left) generates AT-rich sequence, and state 2 (top right) generates CG-rich sequence. State transitions and their associated probabilities are indicated by arrows, and symbol emission probabilities for A,C,G and T for each state are indicated below the states. (For clarity, the begin and end states and associated state transitions necessary to model sequences of finite length have been omitted.) (b) This model generates a state sequence as a Markov chain and each state generates a symbol according to its own emission probability distribution (c). The probability of the sequence is the product of the state transitions and the symbol emissions. For a given observed DNA sequence, we are interested in inferring the hidden state sequence that 'generated' it, that is, whether this position is in a CG-rich segment or an AT-rich segment. Profile (protein family) HMMs C X FY i0 i1 i2 i3 1 2 3 C C C C C A G D V K F W Y F Y B X X X X M1 M2 M3 d1 d2 d3 E 9

HMM transitions and emissions are probabilities a - c g a - t a a - c c a t t t a - c - 0.0 1.0 0.75 0.75 0.25 1.0 1.0 0.25 1.0 a 1.0 c 0.0 g 0.0 t 0.0 a 0.0 c 0.4 g 0.0 t 0.6 a.25 c.25 g.25 t.25 Figure 2 An HMM-based profile. An example of an HMM-based profile of three positions is shown, following the model introduced by Krogh et al. [7 ]. Each important column of a multiple sequence alignment (top) is modeled by a triplet of states: match (M), insert (I), and delete (D). For each modeled column of the alignment, there are 49 parameters: nine state transition probabilities (arrows), 20 match state symbol emission probabilities (corresponding to the bracketed amino acid residues [single letter code]), and 20 insert state symbol emission probabilities (usually not learned from the alignment, but instead kept fixed at some background distribution). For the sake of clarity in this example, all the insert symbol emissions are shown set equally to 0.05, underlining the point that the inserts generate essentially 'random' residue sequence. The state transition probabilities will generally tend to favor a 'main line' through the match states (bold arrows) over the rarer paths containing insertions and deletions (dashed arrows). 10

HMMER- Plan 7 profile HMM S N B I1 I2 I3 M1 M3 M3 M4 D1 D2 D3 D4 E C T J Many approaches are HMM-based (or HMM-like) 1 2 3 4 5 6 BLOCKS PSI-Blast 1 2 3 4 5 6 Meta-MEME 1 2 3 4 Profile HMM Pfam 1 2 3 4 HMMER2 - Plan 7 11

HMM Algorithms 1. The scoring problem: P(seq model) "Forward" algorithm (sums over all alignments) 2. The alignment problem: max P(seq, statepath model) "Viterbi" algorithm 3. The training problem: "Forward-backward" algorithm and Baum-Welch expectation maximization For profile HMMs, all three algorithms use O(MN) dynamic programming -- same as "standard" Smith/Waterman and Needleman/Wunsch. Global Global and Local Alignment Paths A B D D E F G H I A \ \ \ \ \ \ \ \ \ 1 _-1-1 -1-1 -1-1 -1-1 B \! \ \ \ \ \ \ \ -1 2 _ 0 _-2-2 -2-2 -2-2 D \! \ \ \ \ \ \ -1 0 3 _ 1 _-1 _-3-3 -3-3 E \ \!! \ \ \ \ -1-2 1 2 2 _ 0 _-2 _ G \ \! \! \ \ \ -1-2 -1 0 1 1 1 _-1 _-3 K \ \ \! \! \! \ \ \ \ -1-2 -3-2 -1 0 0 0 _-2 H \ \ \ \! \! \! \ \ \ -1-2 -3-3 -2-1 1 _-1 I \ \ \ \ \! \! \!! \ -1-2 -3-5 -3-1 2 Local A B D D E F G H I A \ 1 0 0 0 0 0 0 0 0 B \ 0 2 _ 0 0 0 0 0 0 0 D! \ \ 0 0 3 _ 1 0 0 0 0 0 E! \ \ 0 0 1 2 2 _ 0 0 0 0 G \! \ \ \ 0 0 0 0 1 1 1 0 0 K \ \ \ 0 0 0 0 0 0 0 0 0 H \ 0 0 0 0 0 0 0 1 0 I \ 0 0 0 0 0 0 0 0 2 Optimum global alignment ( score: 2) A B D D E F G H I (top) A B D - E G K H I (side) or A B - D E G K H I Optimal local alignment (score 3): A B D (top) A B D (side) 12

+1 : match 1 : mis-match 2 : gap A C G T A C G G T

Algorithms for Global and Local Similarity Scores Global: Local: HMM Alignment a t a 24 a s a 0 0 0 24-10 -10-10 -6-2 -6-6 Needleman-Wunsch max log likelihood HMM Viterbi alignment 2-2 26-10 -10-10 -10 4-10 -10 4-6 -10 30 a a 26 23 4 M I 20 D 49-10 30+10+19 F M j (i) = log e M j (x i ) + log[a M q j 1 M j exp(f M j 1 (i 1)) xi I +a I j 1 M j exp(f j 1 (i 1)) + a D j 1 M j exp(f D j 1 (i 1))] HMM Forward (score) Σ probabilities 13

HMMER usage 4 hmmalign Alignment 1 hmmbuild hmmcalibrate Matches HMM 3 extractp.pl/ CHAPS OUTPUT 2 hmmsearch 1. Building HMMER models Can use HMMER as a black box! Usage: hmmbuild <hmm> <align> Usage: hmmcalibrate <hmm> 14

1. Global and Local models HMMER provides two main styles of model ls - Match whole model within sequence Most sensitive mode. Good for whole domains Default mode fs - Match part of model to part of sequence Good when you don t know domain structure ls mode HMM fs mode HMM Sequence 2. HMMER searches Search protein database Usage: hmmsearch <hmm> <database> Option -A Use to limit size of output Option -E Raise to get more distant matches Will take > 20 minutes Specialized hardware exists (TimeLogic, Paracel, Compugen) 15

Domains and Repeats A - Domain B - Metal stabilised domain C - 7 repeats form domain D - 9 repeats form domain could be unlimited number 3. Choosing a threshold HMMER has two types of threshold: per-sequence per-domain Per-domain 5 4 3 5 3 2 per-sequence scores score 22 16

3. HMMER output Header information hmmsearch - search a sequence database with a profile HMM HMMER 2.1.2 (May 1999) Copyright (C) 1992-1999 Washington University School of Medicine HMMER is freely distributed under the GNU General Public License (GPL). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - HMM file: /nfs/somewhere/cbs/hmm [SEED] Sequence database: /nfs/somewhere/pfamseq per-sequence score cutoff: [none] per-domain score cutoff: [none] per-sequence Eval cutoff: 1e+03 per-domain Eval cutoff: [none] - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Query HMM: SEED [HMM has been calibrated; E-values are empirical estimates] 3. HMMER output Per-sequence scores Scores for complete sequences (score includes all domains): Sequence Description Score E-value N -------- ----------- ----- ------- --- YC25_METJA Q58622 HYPOTHETICAL PROTEIN MJ1225. 210.0 2.3e-58 4 O27291 O27291 INOSINE-5'-MONOPHOSPHATE DEHYDROGEN 198.1 9e-55 4 O27292 O27292 INOSINE-5'-MONOPHOSPHATE DEHYDROGEN 192.6 3.9e-53 4 Q9YDY4 Q9YDY4 HYPOTHETICAL 70.0 KDA PROTEIN APE07 185.5 5.5e-51 7 O29410 O29410 CONSERVED HYPOTHETICAL PROTEIN. 184.7 9.9e-51 4 YE04_METJA Q58799 HYPOTHETICAL PROTEIN MJ1404. 180.1 2.4e9 4 AAKG_HUMAN P54619 5'-AMP-ACTIVATED PROTEIN KINASE, GA 179.9 2.6e9 4 YR33_THEPE P15889 HYPOTHETICAL 33.4 KDA PROTEIN IN RI 179.0 5.1e9 4 AAKG_RAT P80385 5'-AMP-ACTIVATED PROTEIN KINASE, GA 173.2 2.7e7 4 Q9V1T3 Q9V1T3 HYPOTHETICAL 32.1 KDA PROTEIN. 170.2 2.2e6 4 Q9YFL7 Q9YFL7 HYPOTHETICAL 31.7 KDA PROTEIN APE02 168.0 9.9e6 4 17

3. HMMER output Per-domain scores Parsed for domains: Sequence Domain seq-f seq-t hmm-f hmm-t score E-value -------- ------- ----- ----- ----- ----- ----- ------- O26740 2/2 103 156.. 1 54 [] 75.0 1e-17 O26740 1/2 6 59.. 1 54 [] 72.3 6.7e-17 Q9X0M4 2/2 79 132.. 1 54 [] 69.3 5.4e-16 YE26_METJA 2/2 115 168.] 1 54 [] 69.0 6.5e-16 Q9X175 2/2 83 136.. 1 54 [] 68.9 6.8e-16 Q9UY49 2/2 154 207.. 1 54 [] 68.9 6.9e-16 IMDH_PYRFU 2/2 154 207.. 1 54 [] 68.8 7.6e-16 P74477 2/2 518 571.. 1 54 [] 68.7 8.1e-16 IMDH_METKA 2/2 74 127.. 1 54 [] 67.7 1.6e-15 IMDH_PYRHO 2/2 154 207.. 1 54 [] 67.7 1.6e-15 YC32_METJA 2/2 234 287.. 1 54 [] 66.4 3.9e-15 O58317 3/4 134 187.. 1 54 [] 66.3 4.2e-15 O29411 2/2 74 127.. 1 54 [] 66.1 4.7e-15 IMDH_AQUAE 1/2 94 147.. 1 54 [] 65.7 6.5e-15 ACUB_BACSU 1/2 5 58.. 1 54 [] 65.4 8e-15 O29915 2/2 294 346.. 1 54 [] 65.3 8.4e-15 Q9UYR4 3/4 134 187.. 1 54 [] 65.1 9.8e-15 4. Making an alignment HMMER can be used to align all the matches Usage: hmmalign <hmm> <sequences> Much faster compared to other methods 18

Profile-HMM alignments It is hard to manually align a large number of sequences (100-10000) Solution (The Pfam way): Build a good quality manual alignment of representative members Build profile-hmm Align all members automatically Pitfalls of HMMER More complex to run than PSI-BLAST, you must iterate yourself Slow compared to PSI-BLAST 19

Pitfalls of all profile methods Iterative scoring schemes Once a false positive is included all its friends come too! Limit to modeling For large families some members will be missed Profile wander Matches from earlier rounds can be lost Multiple Sequence/Profile/HMMs Identify larger fraction of protein family with 1 sequence (model) Position-specific gap penalties better alignments Rigorous statistical/ mathematical model fast/automatic (in PSI- Blast) Many parameters to estimate (100+ sequences preferred) Poor multiple alignments Accurate statistical estimates? Misled by nonhomologs 20

Conclusions Profile-HMMs formalise profile technology There are four steps in HMMER use build model search database choose threshold build alignment Learning more HMMs HMMER http://hmmer.wustl.edu/ Includes links to documentation and other software packages. The HMMER User's Guide (and tutorial) http://hmmer.wustl.edu/hmmer-html/ "Profile hidden Markov models" S.R. Eddy, Bioinformatics 14:755-763, 1998. "Multiple-alignment and -sequence searches" S.R. Eddy, Trends Guide to Bioinformatics, pp.15-18, 1998. http://www.genetics.wustl.edu/eddy/publications/tigs-9808/ Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Durbin, Eddy, Krogh, and Mitchison; Cambridge Univ. Press, 1998. 21