Biochemistry 201 Advanced Molecular Biology (http://cmgm cmgm.stanford.edu/biochem201/) Bioinformatics: Discovering Function from Sequence Doug Brutlag Departments of Biochemistry June 4, 1999
Discovering Function from Protein Sequence BLOCK, Weight Matrix or Position Specific Scoring Matrix Position 1 2 3 4 5 6 7 8 9 10 11 12 A 2 1 3 13 10 12 67 4 13 9 1 2 R 7 5 8 9 4 0 1 16 7 0 1 0 N 0 8 0 1 0 0 0 2 1 1 10 0 D 0 1 0 1 13 0 0 12 1 0 4 0 C 0 0 1 0 0 0 0 0 0 2 2 1 Q 1 1 21 8 10 0 0 7 6 0 0 2 E 2 0 0 9 21 0 0 15 7 3 3 0 G 9 7 1 4 0 0 8 0 0 0 46 0 H 4 3 1 1 2 0 0 2 2 0 5 0 I 10 0 11 1 2 10 0 4 9 3 0 16 L 16 1 17 0 1 31 0 3 11 24 0 14 K 3 4 5 10 11 1 1 13 10 0 5 2 M 7 1 1 0 0 0 0 0 5 7 1 8 F 4 0 3 0 0 4 0 0 0 10 0 0 P 0 6 0 1 0 0 0 0 0 0 0 0 S 1 17 0 8 3 1 3 0 2 2 2 0 T 5 22 3 11 1 5 0 2 2 2 0 5 W 2 0 0 0 0 0 0 0 0 1 0 1 Y 1 0 4 2 0 1 0 0 2 4 0 1 V 6 3 1 1 2 15 0 0 2 12 0 28 Consensus Sequences Zinc Finger (C2H2 type) C.{2,4} C.{12} H.{3,5} H Sequences of Common Structure or Function Sequence Alignments 10 20 30 40 50 1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGS : : : : : : : : : : : 2 HLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN 10 20 30 40 50 AA1 Profiles, PSI-BLAST Hidden Markov Models D 2 I 1 I 2 I 3 I 4 I 5 AA2 D 3 AA3 D 4 AA4 D 5 AA5 AA6
Sequence Alignment Problem T C A T G C A T T G
Sequence Alignment Problem T C A T G C A T T G
Sequence Alignment Problem T C A T G C A T T G T C A T G C A T T G
Sequence Alignment (exact) X 220 230 240 250 X F--SGGNTHIYMNHVEQCKEILRREPKELCELVISGLPYKFRYLSTKE-QLK-Y GDFIHTLGDAHIYLNHIEPLKIQLQREPRPFPKLRILRKVEKIDDFKAEDFQIEGYN X 260 270 280 290 X
Needleman-Wunsch Algorithm (1)
Needleman-Wunsch Algorithm (2)
Needleman-Wunsch Algorithm (3)
Needleman-Wunsch Algorithm (4)
Sequence Alignment X 220 230 240 250 X F--SGGNTHIYMNHVEQCKEILRREPKELCELVISGLPYKFRYLSTKE-QLK-Y : :: : : : : : ::::: :: GDFIHTLGDAHIYLNHIEPLKIQLQREPRPFPKLRILRKVEKIDDFKAEDFQIEGYN X 260 270 280 290 X
Sequence Alignment and Typical Scoring Function X 220 230 240 250 X F--SGGNTHIYMNHVEQCKEILRREPKELCELVISGLPYKFRYLSTKE-QLK-Y : :: : : : : : ::::: :: GDFIHTLGDAHIYLNHIEPLKIQLQREPRPFPKLRILRKVEKIDDFKAEDFQIEGYN X 260 270 280 290 X Region End Score = Similarity-weights - Penalties Region Start where: Region End Region Start Penalty = Gap-penalty + Size-of-gap x Gap-size-penalty
.. Sequence Similarity vs Evolutionary Distance 0 100 20 80 40 60 60 40 80 20 100 40 80 120 160 200 240 280 320 360 400 Mutations Introduced per 100 Residues 0
Dayhoff s Acceptable Point Mutations Ala A Arg R 30 Asn N 109 17 Asp D 154 0 532 Cys C 33 10 0 0 Gln Q 93 120 50 76 0 Glu E 266 0 94 831 0 422 Gly G 579 10 156 162 10 30 112 His H 21 103 226 43 10 243 23 10 Ile I 66 30 36 13 17 8 35 0 3 Leu L 95 17 37 0 0 75 15 17 40 253 Lys K 57 477 322 85 0 147 104 60 23 43 39 Met M 29 17 0 0 0 20 7 7 0 57 207 90 Phe F 20 7 7 0 0 0 0 17 20 90 167 0 17 Pro P 345 67 27 10 10 93 40 49 50 7 43 43 4 7 Ser S 772 137 432 98 117 47 86 450 26 20 32 168 20 40 269 Thr T 590 20 169 57 10 37 31 50 14 129 52 200 28 10 73 696 Trp W 0 27 3 0 0 0 0 0 3 0 13 0 0 10 0 17 0 Tyr Y 20 3 36 0 30 0 10 0 40 13 23 10 0 260 0 22 23 6 Val V 365 20 13 17 33 27 37 97 30 661 303 17 77 10 50 43 186 0 17 A R N D C Q E G H I L K M F P S T W Y V Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
Dayhoff s PAM 250 Matrix (Log-Odds Form) A Ala.18 R Arg -.15.61 N Asn.02 0.20 D Asp.03 -.13.21.39 C Cys -.20 -.36 -.36 -.51 1.19 Q Gln -.04.13.08.16 -.54.40 E Glu.03 -.11.14.34 -.53.25.38 G Gly.13 -.26.03.06 -.34 -.53.25.38 H His -.14.16.16.07 -.34.29.07 -.21.65 I Ile -.05 -.20 -.18 -.24 -.23 -.20 -.20 -.26 -.24.45 L Leu -.19 -.30 -.29 -.40 -.60 -.18 -.34 -.41 -.21.24.59 K Lys -.12.34.10.01 -.54.07 -.01 -.17 0 -.19 -.29.47 M Met -.11 -.04 -.17 -.26 -.52 -.10 -.21 -.28 -.21.22.37.04.64 F Phe -.35 -.45 -.35 -.56 -.43 -.47 -.54 -.48 -.18.10.18 -.53.02.91 P Pro.11 -.02 -.05 -.10 -.28.02 -.06 -.05 -.02 -.20 -.25 -.11 -.21 -.46.59 S Ser.11 -.03.07.03 0 -.05 0.11 -.08 -.14 -.28 -.02 -.16 -.32.09.16 T Thr.12 -.09.04 -.01 -.22 -.08 -.04 0 -.13.01 -.17 0 -.06 -.31.03.13.26 W Trp -.58.22 -.42 -.68 -.78 -.48 -.70 -.70 -.28 -.51 -.18 -.35 -.42.04 -.56 -.25 -.52 1.73 Y Tyr -.35 -.42 -.21 -.43.03 -.40 -.43 -.52 -.01 -.09 -.09 -.44 -.24.70 -.49 -.28 -.27 -.02 1.01 V Val.02 -.25 -.17 -.21 -.19 -.19 -.18 -.14 -.22.37.19 -.24.18 -.12 -.12 -.10.03 -.62 -.25.43 A R N D C Q E G H I L K M F P S T W Y V Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
Mutation Data Matrix (MDM-78) Cys C 12 Ser S 0 2 Thr T -2 1 3 Pro P -3 1 0 6 Ala A -2 1 1 1 2 Gly G -3 1 0-1 1 5 Asn N -4 1 0-1 0 0 2 Asp D -5 0 0-1 0 1 2 4 Glu E -5 0 0-1 0 0 1 3 4 Gln Q -5-1 -1 0 0-1 1 2 2 4 His H -3-1 -1 0-1 -2 2 1 1 3 6 Arg R -4 0-1 0-2 -3 0-1 -1 1 2 6 Lys K -5 0 0-1 -1-2 1 0 0 1 0 3 5 Met M -5-2 -1-2 -1-3 -2-3 -2-1 -2 0 0 6 Ile I -2-1 0-2 -1-3 -2-2 -2-2 -2-2 -2 2 5 Leu L -6-3 -2-3 -2-4 -3-4 -3-2 -2-3 -3 4 2 6 Val V -2-1 0-1 0-1 -2-2 -2-2 -2-2 -2 2 4 2 4 Phe F -4-3 -3-5 -4-5 -4-6 -5-5 -2-4 -5 0 1 2-1 9 Tyr Y 0-3 -3-5 -3-5 -2-4 -4-4 0-4 -4-2 -1-1 -2 7 10 Trp W -8-2 -5-6 -6-7 -4-7 -7-5 -3 2-3 -4-5 -2-6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W
Comparison of Scoring Matrices Sequences Compared Unitary Matrix Genetic Code Matrix Amino Acid Matrix PAM 250 Matrix Antibacterial substance A Streptomyces vs. Neocarzinostatin Streptomyces 3.1 3.2 2.6 2.9 Ferredoxin Clostridium vs Ferredoxin Spirulina 0.1 1.6 1.8 3.4 -Hemoglobin Human vs. Myoglobin Human 5.8 6.6 9.9 10.7 -Hemoglobin Human vs. Globin CTT-III Midge 2.0 2.4 3.2 3.5 Cytochrome C Horse vs. Cytochrome C6 Spirulina 4.5 4.3 7.3 6.1 Cytochrome C Horse vs. Cytochrome C553 Desulfovibrio 0.2 0.4 0.4 3.9 2-microglobulin Human vs. IG Human chain C4 region 3.6 3.3 4.7 4.8 Ig chain C4 region Human vs. Ig chain C4 Human 4.7 9.0 9.2 12.1
Significance of Alignments vs PAMs
Detecting Evolutionary Relationships 300 million years 200 million years 100 years million Today PAM100 PAM100 PAM 100 PAM 100 PAM 200 PAM 150
Block Signatures for a Protein Family (http://www.blocks.fhcrc.org/) (After Henikoff and Henikoff) INKHIQ VSRVVN ASRALM VSHVIN VSAILN IRRDLN THVRVE GSSELA MTRGSN VGRILK LSHLFR LAHLFR ISRLLG LHRLFK HSGEQLAETLGMSRAAINKHIQ VTLYDVAEYAGVSYQTVSRVVN AMIKDVALKAKVSTATVSRALM ATIKDVAKRAGVSTTTVSHVIN ITIYDLAELSGVSASAVSAILN LHLKDAAALLGVSEMTIRRDLN TAYAELAKQFGVSPGTIHVRVE GSLTEAAHLLGTSQPTVSRELA MSQRELKNELGAGIATITRGSN ITRQEIGQIVGCSRETVGRILK FDIASVAQHVCLSPSRLSHLFR LRIDEVARHVCLSPSRLAHLFR MTRGDIGNYLGLTVETISRLLG VTLEALADQVGMSPFHLHRLFK 10-45 25-55 40 SRAAINKHIVA VSYQTVSRVVN VSTATVSRALA GVTTTVSHVIN SGVSAVSAILN GVSEMTRRDLN TAYATIHVRVE GSQPTVSRELA MSIATITRGSN ISRETVGRILK FDISRLSHLFR LRPSRLAHLFR MTVETISRLLG TLEFHLHRLFK
Smith-Waterman Similarity Search Query: HU-NS1 Maximal Score: 452 PAM Matrix: 200 Gap Penalty: 5 Gap Extension: 0.5 No. Score Match Length DB ID Description Pred. No. 1 452 100.0 90 2 DBHB_ECOLI DNA-BINDING PROTEIN H 8.74e-86 2 451 99.8 90 2 DBHB_SALTY DNA-BINDING PROTEIN H 1.54e-85 3 336 74.3 90 2 DBHA_ECOLI DNA-BINDING PROTEIN H 1.64e-57 4 336 74.3 90 2 DBHA_SALTY DNA-BINDING PROTEIN H 1.64e-57 5 328 72.6 90 2 DBH_BACST DNA-BINDING PROTEIN I 1.35e-55 6 328 72.6 92 2 DBH_BACSU DNA-BINDING PROTEIN I 1.35e-55 7 327 72.3 90 2 DBH_VIBPR DNA-BINDING PROTEIN H 2.35e-55 8 302 66.8 90 2 DBH_PSEAE DNA-BINDING PROTEIN H 2.14e-49 9 273 60.4 91 2 DBH1_RHILE DNA-BINDING PROTEIN H 1.47e-42 10 272 60.2 91 2 DBH_CLOPA DNA-BINDING PROTEIN H 2.52e-42 11 263 58.2 90 2 DBH_RHIME DNA-BINDING PROTEIN H 3.18e-40 12 261 57.7 91 2 DBH5_RHILE DNA-BINDING PROTEIN H 9.29e-40 13 250 55.3 94 2 DBH_ANASP DNA-BINDING PROTEIN H 3.32e-37 14 233 51.5 93 2 DBH_CRYPH DNA-BINDING PROTEIN H 2.70e-33 15 226 50.0 95 2 DBH_THETH DNA-BINDING PROTEIN I 1.07e-31 16 210 46.5 99 3 IHFA_SERMA INTEGRATION HOST FACT 4.46e-28 17 206 45.6 100 3 IHFA_RHOCA INTEGRATION HOST FACT 3.52e-27 18 205 45.4 99 3 IHFA_SALTY INTEGRATION HOST FACT 5.90e-27 19 204 45.1 99 3 IHFA_ECOLI INTEGRATION HOST FACT 9.87e-27 20 200 44.2 94 3 IHFB_ECOLI INTEGRATION HOST FACT 7.71e-26 21 200 44.2 94 3 IHFB_SERMA INTEGRATION HOST FACT 7.71e-26 22 165 36.5 99 5 TF1_BPSP1 TRANSCRIPTION FACTOR 3.42e-18 23 147 32.5 90 2 DBH_THEAC DNA-BINDING PROTEIN H 2.12e-14 24 76 16.8 477 2 GLGA_ECOLI GLYCOGEN SYNTHASE (EC 3.80e-01
GAPPED BLAST Starts with a Two Hit Approach
GAPPED BLAST Extension of Two Hit HSP
GAPPED BLAST Alignment
Decypher Search Engine [http://decypher decypher.stanford.edu/]
Decypher Database Search Engine (http://decypher2.stanford.edu/)
Extreme Value Distribution of Scores
Expectation of Extreme Values r r ob(s > X) 1 exp{ Ke X } i =1 here r j =1 is the root of the equation: p i p j exp { s ij } = 1 p i and p j are the probabilities of each esidue in each sequence, ij are the similarity scores of wo residues. f the expected value of he scores for random sequences is 0, i. e. r r p i p j s ij < 0 i =1 j =1 hen there are two solutions for, ero and one other positive root. 1 0.1 0.01 0.001 Distribution of Scores > S Score S
. Dynamic Programming Query Database G L I V S R A D G I R E M T S P L K S G F V G V I L S K A E G I R D V S T
Generalized Dynamic Programming Database Query A R N D C Q E G H I L K M F P S T W Y V (3 4 5 1-5 2 7 1 9-3 5 0-6 1 2 5 6-7 3 4) (3 0 5 9-5-3 2 2-3-3 2 0-2 1 1-5 6-7 3 4) (1 3 5 2-5-3 2 2-3 2 2 0-2 1 1-5 6-7 3 4) (6 4-3 0 2-1-3-1 4 3-5 1-3 3 4-5 2-3 2-1) (1 3 5 2-5-3 2 2-3 2 2 0-2 1 1-5 6-7 3 4) (3 0 5 9-5-3 2 2-3-3 2 0-2 1 1-5 6-7 3 4) (2-3 4-2 5 2-3 1-1 0 2 5-4 2-3 4 5-1 0 4) (6 4-3 0 2-1-3-1 4 3-5 1-3 3 4-5 2-3 2-1) (2-3 4-2 5 2-3 1-1 0 2 5-4 2-3 4 5-1 0 4) (1 3 5 2-5-3 2 2-3 2 2 0-2 1 1-5 6-7 3 4) (3 4 5 1-5 2 7 1 9-3 5 0-6 1 2 5 6-7 3 4) (1 3 5 2-5-3 2 2-3 2 2 0-2 1 1-5 6-7 3 4) (6 4-3 0 2-1-3-1 4 3-5 1-3 3 4-5 2-3 2-1) (2-3 4-2 5 2-3 1-1 0 2 5-4 2-3 4 5-1 0 4) (2-3 4-2 5 2-3 1-1 0 2 5-4 2-3 4 5-1 0 4) (6 4-3 0 2-1-3-1 4 3-5 1-3 3 4-5 2-3 2-1) (3 0 5 9-5-3 2 2-3-3 2 0-2 1 1-5 6-7 3 4) (1 3 5 2-5-3 2 2-3 2 2 0-2 1 1-5 6-7 3 4) G L I V S R A D G I R E M T S P L K S G F V
Profiles & Hidden Markov Models (http://pfam.wustl.edu/) D 2 D 3 D 4 D 5 I 1 I 2 I 3 I 4 I 5 AA1 AA2 AA3 AA4 AA5 AA6
Discovering Function from Protein Sequence BLOCK, Weight Matrix or Position Specific Scoring Matrix Position 1 2 3 4 5 6 7 8 9 10 11 12 A 2 1 3 13 10 12 67 4 13 9 1 2 R 7 5 8 9 4 0 1 16 7 0 1 0 N 0 8 0 1 0 0 0 2 1 1 10 0 D 0 1 0 1 13 0 0 12 1 0 4 0 C 0 0 1 0 0 0 0 0 0 2 2 1 Q 1 1 21 8 10 0 0 7 6 0 0 2 E 2 0 0 9 21 0 0 15 7 3 3 0 G 9 7 1 4 0 0 8 0 0 0 46 0 H 4 3 1 1 2 0 0 2 2 0 5 0 I 10 0 11 1 2 10 0 4 9 3 0 16 L 16 1 17 0 1 31 0 3 11 24 0 14 K 3 4 5 10 11 1 1 13 10 0 5 2 M 7 1 1 0 0 0 0 0 5 7 1 8 F 4 0 3 0 0 4 0 0 0 10 0 0 P 0 6 0 1 0 0 0 0 0 0 0 0 S 1 17 0 8 3 1 3 0 2 2 2 0 T 5 22 3 11 1 5 0 2 2 2 0 5 W 2 0 0 0 0 0 0 0 0 1 0 1 Y 1 0 4 2 0 1 0 0 2 4 0 1 V 6 3 1 1 2 15 0 0 2 12 0 28 Consensus Sequences Zinc Finger (C2H2 type) C.{2,4} C.{12} H.{3,5} H Sequences of Common Structure or Function Sequence Alignments 10 20 30 40 50 1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGS : : : : : : : : : : : 2 HLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN 10 20 30 40 50 AA1 Profiles, PSI-BLAST Hidden Markov Models D 2 I 1 I 2 I 3 I 4 I 5 AA2 D 3 AA3 D 4 AA4 D 5 AA5 AA6
Hidden Markov Models (after Haussler) D 2 D 3 D 4 D 5 I 1 I 2 I 3 I 4 I 5 AA1 AA2 AA3 AA4 AA5 AA6
Globin HMM Model
Decypher Database Search Engine (http://decypher2.stanford.edu/)
General DNA Similarity Search Principles Search both Strands Translate ORFs Use most sensitive search possible BLAST for infinite gap penalty Smith Waterman for cdna/genome comparisons cdna =>Zero gap-length penalty Consider transition matrices Ensure that expected value of score is negative Examine results with exp.. between 0.05 and 10 Reevaluate results of borderline significance using limited query Beware of long results Limit query length to 1,000 bases Segment query if 1,000 bases
General Protein Similarity Search Principles Chose between local or global search algorithm Use most sensitive search algorithm available Original BLAST for no gaps Smith-Waterman for most flexibility Gapped BLAST for well delimited regions PSI-BLAST for families Initially BLOSUM62 and default gap penalties If no significant results, use BLOSUM30 and lower gap penalties Ensure expected score is negative Examine results between exp.. 0.05 and 10 for biological significance Beware of long hits or those with unusual amino acid composition Reevaluate results of borderline significance using limited query Segment long queries 300 amino acids Segments around known motifs