Biochemistry 201 Advanced Molecular Biology (

Relevanta dokument
Supplementary Data. Figure S1: EIMS spectrum for (E)-1-(3-(3,7-dimethylocta-2,6-dienyl)-2,4,6-trihydroxyphenyl)butan-1-one (3d) 6'' 7'' 3' 2' 1' 6

A G M K. Supplemental Figure S1.

Exam Molecular Bioinformatics X3 (1MB330) - 1 March, Page 1 of 6. Skriv svar på varje uppgift på separata blad. Lycka till!!

Elektron-absorbtionspektroskopi för biomolekyler i UV-VIS-området

Protein en livsviktig byggsten

Elektron-absorbtionspektroskopi för biomolekyler i UV-VIS-området

Släktskap mellan människa och några ryggradsdjur

Proteinsyntesen. Anders Liljas Biokemi och strukturbiologi Lunds universitet

Prov Genetik. Max: 8G+7VG+2MVG G: 7G VG: 7G+4VG MVG: 8G+4VG+1MVG

Hidden Markov Models and other Multiple-sequence Profile approaches

Room E3607 Protein bioinformatics Protein Bioinformatics. Computer lab Tuesday, May 17, 2005 Sean Prigge Jonathan Pevsner Ingo Ruczinski

Kurskod: TAIU06 MATEMATISK STATISTIK Provkod: TENA 17 August 2015, 8:00-12:00. English Version

VI-1. Proteiner VI. PROTEINER. Källor: - L. Stryer, Biochemistry, 3 rd Ed., Freeman, New York, 1988.

VI MÅSTE PRATA MED VARANDRA CELLENS KOMMUNIKATION

Biologisk enfald. enheten i mångfalden. Anders Liljas Biokemi och Strukturbiologi

Module 1: Functions, Limits, Continuity

Föreläsning 5. Stereokemi Kapitel 6

Theory 1. Summer Term 2010

Tentamen i Matematik 2: M0030M.

1. Compute the following matrix: (2 p) 2. Compute the determinant of the following matrix: (2 p)

Tentamen Molekylärbiologi X3 (1MB608) 10 March, 2008 Page 1 of 5. Skriv svaren på varje fråga på SEPARATA blad.

8 < x 1 + x 2 x 3 = 1, x 1 +2x 2 + x 4 = 0, x 1 +2x 3 + x 4 = 2. x 1 2x 12 1A är inverterbar, och bestäm i så fall dess invers.

Föreläsning 17. Karbonylkolets kemi II Kapitel 17 F17

Module 6: Integrals and applications

Kurskod: TAMS28 MATEMATISK STATISTIK Provkod: TEN1 05 June 2017, 14:00-18:00. English Version

P-U-Csv-Aminosyror på Biochrom 30+

Sannolikhetsteori. Tentamenskrivning: TMS145 - Grundkurs i matematisk statistik och bioinformatik,

Kurskod: TAMS11 Provkod: TENB 28 August 2014, 08:00-12:00. English Version

Labokha AA et al. xlnup214 FG-like-1 xlnup214 FG-like-2 xlnup214 FG FGFG FGFG FGFG FGFG xtnup153 FG FGFG xtnup153 FG xlnup62 FG xlnup54 FG FGFG

Statistical modelling and alignment of protein sequences

and u = och x + y z 2w = 3 (a) Finn alla lösningar till ekvationssystemet

Gradientbaserad Optimering,

denna del en poäng. 1. (Dugga 1.1) och v = (a) Beräkna u (2u 2u v) om u = . (1p) och som är parallell

The cornerstone of Swedish disability policy is the principle that everyone is of equal value and has equal rights.

Tentamen i Matematik 2: M0030M.

SUPPLEMENTARY FIGURE LEGENDS

Isometries of the plane

Mapping sequence reads & Calling variants

Pre-Test 1: M0030M - Linear Algebra.

INTRODUKTION - TYP 1 DIABETES

P-U-Csv-Aminosyror på Biochrom 30+

Tentamen del 2 SF1511, , kl , Numeriska metoder och grundläggande programmering

Viktig information för transmittrar med option /A1 Gold-Plated Diaphragm

Sannolikhetsteori. Tentamenskrivning: TMS145 - Grundkurs i matematisk statistik och bioinformatik,

Grafisk teknik IMCDP IMCDP IMCDP. IMCDP(filter) Sasan Gooran (HT 2006) Assumptions:

f(x) =, x 1 by utilizing the guidance given by asymptotes and stationary points. cos(x) sin 3 (x) e sin2 (x) dx,

f(x) = x2 + 4x + 6 x 2 4 by utilizing the guidance given by asymptotes and stationary points.

REHAB BACKGROUND TO REMEMBER AND CONSIDER

Grafisk teknik IMCDP. Sasan Gooran (HT 2006) Assumptions:

NO NEWS ON MATRIX MULTIPLICATION. Manuel Kauers Institute for Algebra JKU

Authentication Context QC Statement. Stefan Santesson, 3xA Security AB

(D1.1) 1. (3p) Bestäm ekvationer i ett xyz-koordinatsystem för planet som innehåller punkterna

Grafisk teknik. Sasan Gooran (HT 2006)


Kurskod: TAMS11 Provkod: TENB 07 April 2015, 14:00-18:00. English Version

Styrteknik: Binära tal, talsystem och koder D3:1

Sammanfattning hydraulik

Studenters erfarenheter av våld en studie om sambandet mellan erfarenheter av våld under uppväxten och i den vuxna relationen

Schenker Privpak AB Telefon VAT Nr. SE Schenker ABs ansvarsbestämmelser, identiska med Box 905 Faxnr Säte: Borås

Preschool Kindergarten

Chapter 2: Random Variables

SUPPLEMENTARY DATA Data in the Relational Database

Det här med levels.?

12.6 Heat equation, Wave equation

Hydroxyquinone O-Methylation in Mitomycin. Biosynthesis

Technique and expression 3: weave. 3.5 hp. Ladokcode: AX1 TE1 The exam is given to: Exchange Textile Design and Textile design 2.

TABELLSAMLING ATT ANVÄNDA I SAMBAND MED PROV I KEMI B

LUNDS TEKNISKA HÖGSKOLA Institutionen för Elektro- och Informationsteknik

Släktträd med hjälp av databaser och program från Internet

TENTAMEN I STRUKTURBIOLOGI

Solutions to exam in SF1811 Optimization, June 3, 2014

x 2 2(x + 2), f(x) = by utilizing the guidance given by asymptotes and stationary points. γ : 8xy x 2 y 3 = 12 x + 3

Tentamen i 2D1396 Bioinformatik, 2 juni 2006

1. Find for each real value of a, the dimension of and a basis for the subspace

INTRODUKTION - TYP 1 DIABETES

Kurskod: TAIU06 MATEMATISK STATISTIK Provkod: TENA 15 August 2016, 8:00-12:00. English Version

Find an equation for the tangent line τ to the curve γ : y = f(4 sin(xπ/6)) at the point P whose x-coordinate is equal to 1.

F ξ (x) = f(y, x)dydx = 1. We say that a random variable ξ has a distribution F (x), if. F (x) =

Molecular Biology Primer

Is it worth to parameterize sequence alignment with an explicit evolutionary model?

Tentamen i Matematik 3: M0031M.

En bioinformatisk genjakt

Robust och energieffektiv styrning av tågtrafik

Service och bemötande. Torbjörn Johansson, GAF Pär Magnusson, Öjestrand GC


GU / Chalmers Campus Lindholmen Tentamen Programutveckling LEU 482 / TIG167

Teknik och diabetes från EASD och ADA 2016

ASSEMBLY INSTRUCTIONS SCALE SQUARE - STANDARD

I. Flersekvensjämförelser, sekvensmotiv och profiler. II. Fylogenetisk analys

ALGEBRA I SEMESTER 1 EXAM ITEM SPECIFICATION SHEET & KEY

Rastercell. Digital Rastrering. AM & FM Raster. Rastercell. AM & FM Raster. Sasan Gooran (VT 2007) Rastrering. Rastercell. Konventionellt, AM

Hur fattar samhället beslut när forskarna är oeniga?

SUPPORTING INFORMATION

Säkerhetsfunktioner rstå varandra? Finns behov av att avvika från normal säkerhetsfunktion s vissa betingelser under uppstart, ändringar i processen

Tentamen MMG610 Diskret Matematik, GU

2. Let the linear space which is spanned by the functions p 1, p 2, p 3, where p k (x) = x k, be equipped with the inner product p q = 1

Tunga metaller / Heavy metals ICH Q3d & Farmakope. Rolf Arndt Cambrex Karlskoga

Protected areas in Sweden - a Barents perspective

2. Förklara vad en egenfrekvens är. English: Explain what en eigenfrequency is.

1. Find the volume of the solid generated by rotating the circular disc. x 2 + (y 1) 2 1

Transkript:

Biochemistry 201 Advanced Molecular Biology (http://cmgm cmgm.stanford.edu/biochem201/) Bioinformatics: Discovering Function from Sequence Doug Brutlag Departments of Biochemistry June 4, 1999

Discovering Function from Protein Sequence BLOCK, Weight Matrix or Position Specific Scoring Matrix Position 1 2 3 4 5 6 7 8 9 10 11 12 A 2 1 3 13 10 12 67 4 13 9 1 2 R 7 5 8 9 4 0 1 16 7 0 1 0 N 0 8 0 1 0 0 0 2 1 1 10 0 D 0 1 0 1 13 0 0 12 1 0 4 0 C 0 0 1 0 0 0 0 0 0 2 2 1 Q 1 1 21 8 10 0 0 7 6 0 0 2 E 2 0 0 9 21 0 0 15 7 3 3 0 G 9 7 1 4 0 0 8 0 0 0 46 0 H 4 3 1 1 2 0 0 2 2 0 5 0 I 10 0 11 1 2 10 0 4 9 3 0 16 L 16 1 17 0 1 31 0 3 11 24 0 14 K 3 4 5 10 11 1 1 13 10 0 5 2 M 7 1 1 0 0 0 0 0 5 7 1 8 F 4 0 3 0 0 4 0 0 0 10 0 0 P 0 6 0 1 0 0 0 0 0 0 0 0 S 1 17 0 8 3 1 3 0 2 2 2 0 T 5 22 3 11 1 5 0 2 2 2 0 5 W 2 0 0 0 0 0 0 0 0 1 0 1 Y 1 0 4 2 0 1 0 0 2 4 0 1 V 6 3 1 1 2 15 0 0 2 12 0 28 Consensus Sequences Zinc Finger (C2H2 type) C.{2,4} C.{12} H.{3,5} H Sequences of Common Structure or Function Sequence Alignments 10 20 30 40 50 1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGS : : : : : : : : : : : 2 HLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN 10 20 30 40 50 AA1 Profiles, PSI-BLAST Hidden Markov Models D 2 I 1 I 2 I 3 I 4 I 5 AA2 D 3 AA3 D 4 AA4 D 5 AA5 AA6

Sequence Alignment Problem T C A T G C A T T G

Sequence Alignment Problem T C A T G C A T T G

Sequence Alignment Problem T C A T G C A T T G T C A T G C A T T G

Sequence Alignment (exact) X 220 230 240 250 X F--SGGNTHIYMNHVEQCKEILRREPKELCELVISGLPYKFRYLSTKE-QLK-Y GDFIHTLGDAHIYLNHIEPLKIQLQREPRPFPKLRILRKVEKIDDFKAEDFQIEGYN X 260 270 280 290 X

Needleman-Wunsch Algorithm (1)

Needleman-Wunsch Algorithm (2)

Needleman-Wunsch Algorithm (3)

Needleman-Wunsch Algorithm (4)

Sequence Alignment X 220 230 240 250 X F--SGGNTHIYMNHVEQCKEILRREPKELCELVISGLPYKFRYLSTKE-QLK-Y : :: : : : : : ::::: :: GDFIHTLGDAHIYLNHIEPLKIQLQREPRPFPKLRILRKVEKIDDFKAEDFQIEGYN X 260 270 280 290 X

Sequence Alignment and Typical Scoring Function X 220 230 240 250 X F--SGGNTHIYMNHVEQCKEILRREPKELCELVISGLPYKFRYLSTKE-QLK-Y : :: : : : : : ::::: :: GDFIHTLGDAHIYLNHIEPLKIQLQREPRPFPKLRILRKVEKIDDFKAEDFQIEGYN X 260 270 280 290 X Region End Score = Similarity-weights - Penalties Region Start where: Region End Region Start Penalty = Gap-penalty + Size-of-gap x Gap-size-penalty

.. Sequence Similarity vs Evolutionary Distance 0 100 20 80 40 60 60 40 80 20 100 40 80 120 160 200 240 280 320 360 400 Mutations Introduced per 100 Residues 0

Dayhoff s Acceptable Point Mutations Ala A Arg R 30 Asn N 109 17 Asp D 154 0 532 Cys C 33 10 0 0 Gln Q 93 120 50 76 0 Glu E 266 0 94 831 0 422 Gly G 579 10 156 162 10 30 112 His H 21 103 226 43 10 243 23 10 Ile I 66 30 36 13 17 8 35 0 3 Leu L 95 17 37 0 0 75 15 17 40 253 Lys K 57 477 322 85 0 147 104 60 23 43 39 Met M 29 17 0 0 0 20 7 7 0 57 207 90 Phe F 20 7 7 0 0 0 0 17 20 90 167 0 17 Pro P 345 67 27 10 10 93 40 49 50 7 43 43 4 7 Ser S 772 137 432 98 117 47 86 450 26 20 32 168 20 40 269 Thr T 590 20 169 57 10 37 31 50 14 129 52 200 28 10 73 696 Trp W 0 27 3 0 0 0 0 0 3 0 13 0 0 10 0 17 0 Tyr Y 20 3 36 0 30 0 10 0 40 13 23 10 0 260 0 22 23 6 Val V 365 20 13 17 33 27 37 97 30 661 303 17 77 10 50 43 186 0 17 A R N D C Q E G H I L K M F P S T W Y V Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

Dayhoff s PAM 250 Matrix (Log-Odds Form) A Ala.18 R Arg -.15.61 N Asn.02 0.20 D Asp.03 -.13.21.39 C Cys -.20 -.36 -.36 -.51 1.19 Q Gln -.04.13.08.16 -.54.40 E Glu.03 -.11.14.34 -.53.25.38 G Gly.13 -.26.03.06 -.34 -.53.25.38 H His -.14.16.16.07 -.34.29.07 -.21.65 I Ile -.05 -.20 -.18 -.24 -.23 -.20 -.20 -.26 -.24.45 L Leu -.19 -.30 -.29 -.40 -.60 -.18 -.34 -.41 -.21.24.59 K Lys -.12.34.10.01 -.54.07 -.01 -.17 0 -.19 -.29.47 M Met -.11 -.04 -.17 -.26 -.52 -.10 -.21 -.28 -.21.22.37.04.64 F Phe -.35 -.45 -.35 -.56 -.43 -.47 -.54 -.48 -.18.10.18 -.53.02.91 P Pro.11 -.02 -.05 -.10 -.28.02 -.06 -.05 -.02 -.20 -.25 -.11 -.21 -.46.59 S Ser.11 -.03.07.03 0 -.05 0.11 -.08 -.14 -.28 -.02 -.16 -.32.09.16 T Thr.12 -.09.04 -.01 -.22 -.08 -.04 0 -.13.01 -.17 0 -.06 -.31.03.13.26 W Trp -.58.22 -.42 -.68 -.78 -.48 -.70 -.70 -.28 -.51 -.18 -.35 -.42.04 -.56 -.25 -.52 1.73 Y Tyr -.35 -.42 -.21 -.43.03 -.40 -.43 -.52 -.01 -.09 -.09 -.44 -.24.70 -.49 -.28 -.27 -.02 1.01 V Val.02 -.25 -.17 -.21 -.19 -.19 -.18 -.14 -.22.37.19 -.24.18 -.12 -.12 -.10.03 -.62 -.25.43 A R N D C Q E G H I L K M F P S T W Y V Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

Mutation Data Matrix (MDM-78) Cys C 12 Ser S 0 2 Thr T -2 1 3 Pro P -3 1 0 6 Ala A -2 1 1 1 2 Gly G -3 1 0-1 1 5 Asn N -4 1 0-1 0 0 2 Asp D -5 0 0-1 0 1 2 4 Glu E -5 0 0-1 0 0 1 3 4 Gln Q -5-1 -1 0 0-1 1 2 2 4 His H -3-1 -1 0-1 -2 2 1 1 3 6 Arg R -4 0-1 0-2 -3 0-1 -1 1 2 6 Lys K -5 0 0-1 -1-2 1 0 0 1 0 3 5 Met M -5-2 -1-2 -1-3 -2-3 -2-1 -2 0 0 6 Ile I -2-1 0-2 -1-3 -2-2 -2-2 -2-2 -2 2 5 Leu L -6-3 -2-3 -2-4 -3-4 -3-2 -2-3 -3 4 2 6 Val V -2-1 0-1 0-1 -2-2 -2-2 -2-2 -2 2 4 2 4 Phe F -4-3 -3-5 -4-5 -4-6 -5-5 -2-4 -5 0 1 2-1 9 Tyr Y 0-3 -3-5 -3-5 -2-4 -4-4 0-4 -4-2 -1-1 -2 7 10 Trp W -8-2 -5-6 -6-7 -4-7 -7-5 -3 2-3 -4-5 -2-6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W

Comparison of Scoring Matrices Sequences Compared Unitary Matrix Genetic Code Matrix Amino Acid Matrix PAM 250 Matrix Antibacterial substance A Streptomyces vs. Neocarzinostatin Streptomyces 3.1 3.2 2.6 2.9 Ferredoxin Clostridium vs Ferredoxin Spirulina 0.1 1.6 1.8 3.4 -Hemoglobin Human vs. Myoglobin Human 5.8 6.6 9.9 10.7 -Hemoglobin Human vs. Globin CTT-III Midge 2.0 2.4 3.2 3.5 Cytochrome C Horse vs. Cytochrome C6 Spirulina 4.5 4.3 7.3 6.1 Cytochrome C Horse vs. Cytochrome C553 Desulfovibrio 0.2 0.4 0.4 3.9 2-microglobulin Human vs. IG Human chain C4 region 3.6 3.3 4.7 4.8 Ig chain C4 region Human vs. Ig chain C4 Human 4.7 9.0 9.2 12.1

Significance of Alignments vs PAMs

Detecting Evolutionary Relationships 300 million years 200 million years 100 years million Today PAM100 PAM100 PAM 100 PAM 100 PAM 200 PAM 150

Block Signatures for a Protein Family (http://www.blocks.fhcrc.org/) (After Henikoff and Henikoff) INKHIQ VSRVVN ASRALM VSHVIN VSAILN IRRDLN THVRVE GSSELA MTRGSN VGRILK LSHLFR LAHLFR ISRLLG LHRLFK HSGEQLAETLGMSRAAINKHIQ VTLYDVAEYAGVSYQTVSRVVN AMIKDVALKAKVSTATVSRALM ATIKDVAKRAGVSTTTVSHVIN ITIYDLAELSGVSASAVSAILN LHLKDAAALLGVSEMTIRRDLN TAYAELAKQFGVSPGTIHVRVE GSLTEAAHLLGTSQPTVSRELA MSQRELKNELGAGIATITRGSN ITRQEIGQIVGCSRETVGRILK FDIASVAQHVCLSPSRLSHLFR LRIDEVARHVCLSPSRLAHLFR MTRGDIGNYLGLTVETISRLLG VTLEALADQVGMSPFHLHRLFK 10-45 25-55 40 SRAAINKHIVA VSYQTVSRVVN VSTATVSRALA GVTTTVSHVIN SGVSAVSAILN GVSEMTRRDLN TAYATIHVRVE GSQPTVSRELA MSIATITRGSN ISRETVGRILK FDISRLSHLFR LRPSRLAHLFR MTVETISRLLG TLEFHLHRLFK

Smith-Waterman Similarity Search Query: HU-NS1 Maximal Score: 452 PAM Matrix: 200 Gap Penalty: 5 Gap Extension: 0.5 No. Score Match Length DB ID Description Pred. No. 1 452 100.0 90 2 DBHB_ECOLI DNA-BINDING PROTEIN H 8.74e-86 2 451 99.8 90 2 DBHB_SALTY DNA-BINDING PROTEIN H 1.54e-85 3 336 74.3 90 2 DBHA_ECOLI DNA-BINDING PROTEIN H 1.64e-57 4 336 74.3 90 2 DBHA_SALTY DNA-BINDING PROTEIN H 1.64e-57 5 328 72.6 90 2 DBH_BACST DNA-BINDING PROTEIN I 1.35e-55 6 328 72.6 92 2 DBH_BACSU DNA-BINDING PROTEIN I 1.35e-55 7 327 72.3 90 2 DBH_VIBPR DNA-BINDING PROTEIN H 2.35e-55 8 302 66.8 90 2 DBH_PSEAE DNA-BINDING PROTEIN H 2.14e-49 9 273 60.4 91 2 DBH1_RHILE DNA-BINDING PROTEIN H 1.47e-42 10 272 60.2 91 2 DBH_CLOPA DNA-BINDING PROTEIN H 2.52e-42 11 263 58.2 90 2 DBH_RHIME DNA-BINDING PROTEIN H 3.18e-40 12 261 57.7 91 2 DBH5_RHILE DNA-BINDING PROTEIN H 9.29e-40 13 250 55.3 94 2 DBH_ANASP DNA-BINDING PROTEIN H 3.32e-37 14 233 51.5 93 2 DBH_CRYPH DNA-BINDING PROTEIN H 2.70e-33 15 226 50.0 95 2 DBH_THETH DNA-BINDING PROTEIN I 1.07e-31 16 210 46.5 99 3 IHFA_SERMA INTEGRATION HOST FACT 4.46e-28 17 206 45.6 100 3 IHFA_RHOCA INTEGRATION HOST FACT 3.52e-27 18 205 45.4 99 3 IHFA_SALTY INTEGRATION HOST FACT 5.90e-27 19 204 45.1 99 3 IHFA_ECOLI INTEGRATION HOST FACT 9.87e-27 20 200 44.2 94 3 IHFB_ECOLI INTEGRATION HOST FACT 7.71e-26 21 200 44.2 94 3 IHFB_SERMA INTEGRATION HOST FACT 7.71e-26 22 165 36.5 99 5 TF1_BPSP1 TRANSCRIPTION FACTOR 3.42e-18 23 147 32.5 90 2 DBH_THEAC DNA-BINDING PROTEIN H 2.12e-14 24 76 16.8 477 2 GLGA_ECOLI GLYCOGEN SYNTHASE (EC 3.80e-01

GAPPED BLAST Starts with a Two Hit Approach

GAPPED BLAST Extension of Two Hit HSP

GAPPED BLAST Alignment

Decypher Search Engine [http://decypher decypher.stanford.edu/]

Decypher Database Search Engine (http://decypher2.stanford.edu/)

Extreme Value Distribution of Scores

Expectation of Extreme Values r r ob(s > X) 1 exp{ Ke X } i =1 here r j =1 is the root of the equation: p i p j exp { s ij } = 1 p i and p j are the probabilities of each esidue in each sequence, ij are the similarity scores of wo residues. f the expected value of he scores for random sequences is 0, i. e. r r p i p j s ij < 0 i =1 j =1 hen there are two solutions for, ero and one other positive root. 1 0.1 0.01 0.001 Distribution of Scores > S Score S

. Dynamic Programming Query Database G L I V S R A D G I R E M T S P L K S G F V G V I L S K A E G I R D V S T

Generalized Dynamic Programming Database Query A R N D C Q E G H I L K M F P S T W Y V (3 4 5 1-5 2 7 1 9-3 5 0-6 1 2 5 6-7 3 4) (3 0 5 9-5-3 2 2-3-3 2 0-2 1 1-5 6-7 3 4) (1 3 5 2-5-3 2 2-3 2 2 0-2 1 1-5 6-7 3 4) (6 4-3 0 2-1-3-1 4 3-5 1-3 3 4-5 2-3 2-1) (1 3 5 2-5-3 2 2-3 2 2 0-2 1 1-5 6-7 3 4) (3 0 5 9-5-3 2 2-3-3 2 0-2 1 1-5 6-7 3 4) (2-3 4-2 5 2-3 1-1 0 2 5-4 2-3 4 5-1 0 4) (6 4-3 0 2-1-3-1 4 3-5 1-3 3 4-5 2-3 2-1) (2-3 4-2 5 2-3 1-1 0 2 5-4 2-3 4 5-1 0 4) (1 3 5 2-5-3 2 2-3 2 2 0-2 1 1-5 6-7 3 4) (3 4 5 1-5 2 7 1 9-3 5 0-6 1 2 5 6-7 3 4) (1 3 5 2-5-3 2 2-3 2 2 0-2 1 1-5 6-7 3 4) (6 4-3 0 2-1-3-1 4 3-5 1-3 3 4-5 2-3 2-1) (2-3 4-2 5 2-3 1-1 0 2 5-4 2-3 4 5-1 0 4) (2-3 4-2 5 2-3 1-1 0 2 5-4 2-3 4 5-1 0 4) (6 4-3 0 2-1-3-1 4 3-5 1-3 3 4-5 2-3 2-1) (3 0 5 9-5-3 2 2-3-3 2 0-2 1 1-5 6-7 3 4) (1 3 5 2-5-3 2 2-3 2 2 0-2 1 1-5 6-7 3 4) G L I V S R A D G I R E M T S P L K S G F V

Profiles & Hidden Markov Models (http://pfam.wustl.edu/) D 2 D 3 D 4 D 5 I 1 I 2 I 3 I 4 I 5 AA1 AA2 AA3 AA4 AA5 AA6

Discovering Function from Protein Sequence BLOCK, Weight Matrix or Position Specific Scoring Matrix Position 1 2 3 4 5 6 7 8 9 10 11 12 A 2 1 3 13 10 12 67 4 13 9 1 2 R 7 5 8 9 4 0 1 16 7 0 1 0 N 0 8 0 1 0 0 0 2 1 1 10 0 D 0 1 0 1 13 0 0 12 1 0 4 0 C 0 0 1 0 0 0 0 0 0 2 2 1 Q 1 1 21 8 10 0 0 7 6 0 0 2 E 2 0 0 9 21 0 0 15 7 3 3 0 G 9 7 1 4 0 0 8 0 0 0 46 0 H 4 3 1 1 2 0 0 2 2 0 5 0 I 10 0 11 1 2 10 0 4 9 3 0 16 L 16 1 17 0 1 31 0 3 11 24 0 14 K 3 4 5 10 11 1 1 13 10 0 5 2 M 7 1 1 0 0 0 0 0 5 7 1 8 F 4 0 3 0 0 4 0 0 0 10 0 0 P 0 6 0 1 0 0 0 0 0 0 0 0 S 1 17 0 8 3 1 3 0 2 2 2 0 T 5 22 3 11 1 5 0 2 2 2 0 5 W 2 0 0 0 0 0 0 0 0 1 0 1 Y 1 0 4 2 0 1 0 0 2 4 0 1 V 6 3 1 1 2 15 0 0 2 12 0 28 Consensus Sequences Zinc Finger (C2H2 type) C.{2,4} C.{12} H.{3,5} H Sequences of Common Structure or Function Sequence Alignments 10 20 30 40 50 1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGS : : : : : : : : : : : 2 HLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN 10 20 30 40 50 AA1 Profiles, PSI-BLAST Hidden Markov Models D 2 I 1 I 2 I 3 I 4 I 5 AA2 D 3 AA3 D 4 AA4 D 5 AA5 AA6

Hidden Markov Models (after Haussler) D 2 D 3 D 4 D 5 I 1 I 2 I 3 I 4 I 5 AA1 AA2 AA3 AA4 AA5 AA6

Globin HMM Model

Decypher Database Search Engine (http://decypher2.stanford.edu/)

General DNA Similarity Search Principles Search both Strands Translate ORFs Use most sensitive search possible BLAST for infinite gap penalty Smith Waterman for cdna/genome comparisons cdna =>Zero gap-length penalty Consider transition matrices Ensure that expected value of score is negative Examine results with exp.. between 0.05 and 10 Reevaluate results of borderline significance using limited query Beware of long results Limit query length to 1,000 bases Segment query if 1,000 bases

General Protein Similarity Search Principles Chose between local or global search algorithm Use most sensitive search algorithm available Original BLAST for no gaps Smith-Waterman for most flexibility Gapped BLAST for well delimited regions PSI-BLAST for families Initially BLOSUM62 and default gap penalties If no significant results, use BLOSUM30 and lower gap penalties Ensure expected score is negative Examine results between exp.. 0.05 and 10 for biological significance Beware of long hits or those with unusual amino acid composition Reevaluate results of borderline significance using limited query Segment long queries 300 amino acids Segments around known motifs