Is it worth to parameterize sequence alignment with an explicit evolutionary model? Sean Eddy & E.R. p. 1/33
Channelrhodopsin-1 adapted from www.calvin.edu p. 2/33
Bacterial Rhodopsins BACS2_HALSA.TWFWVGAVGMLAGTVLPI..RD CIRHP SHRRYDLVLAGITGLAAIAYTTMG LGITATTVGD...RTVY.. LARYIDWLVTTPL...IVLYLAMLA RPG... BACS2_NATPH TTLFWLGAIGMLVGTLAFAWAGR DAGSG E.RRYYVTLVGISGIAAVAYVVMA LGVGWVPVAE...RTVF.. APRYIDWILTTPL...IVYFLGLLA GLD... BACS2_HALVA TTWFTLGLLGELLGTAVLAY.GY TLVPE ETRKRYLLLIAIPGIAIVAYALMA LGFGSIQSEG...HAVY.. VVRYVDWLLTTPL...NVWFLALLA GAS... BACS1_HALSA ATAYLGGAVALIVGVAFVWLLYR SLDGS PHQSALAPLAIIPVFAGLSYVGMA YDIGTVIVNG...NQIV.. GLRYIDWLVTTPI...LVGYVGYAA GAS... C7P1Y4_HALMD TTVYGLTAVVYAVALVVLWGWLR QV.SP EHRRFCTPIVLVVALAGVASAVVA AGVGTITVNG...SEVV.. VPLFVESMIAYGV...LYAVMARLA DVE... D3SUL9_NATMM FVLLVVSSIVFISAAAIFVGYSR TLPDG PNQYGYAAAVA.AGSMGLAYVVMA LVNGISG...ADTD.. LFRFLGYTAMWTV...IVLVVCSVA GVD... BACH_NATPH ASSLYINIALAGLSILLFVFMTR GLDDP RAKLIAVSTILVPVVSIASYTGLA SGLTISVLEMPAGHFAEGSSVMLGGEEVDGVVTM WGRYLTWALSTPM...ILLALGLLA GSN... BACR_HALAR AIWLWLGTAGMFLGMLYFIARGW GETDS RRQKFYIATILITAIAFVNYLAMA LGFGLTIVEFAGEE...HPIY.. WARYSDWLFTTPL...LLYDLGLLA GAD... BACR_HALSA WIWLALGTALMGLGTLYFLVKGM GVSDP DAKKFYAITTLVPAIAFTMYLSML LGYGLTMVPFGGEQ...NPIY.. WARYADWLFTTPL...LLLDLALLV DAD... BACR1_HALSS TLWLGIGTLLMLIGTFYFIVKGW GVTDK EAREYYSITILVPGIASAAYLSMF FGIGLTEVQVGSEM...LDIY.. YARYADWLFTTPL...LLLDLALLA KVD... B6BSG6_9PROT GISFWVISMGMLAATAFFFMETG NVAAG W.RTSVIVAGLVTGIAFIHYMYMR EVWVTTG...DSPT.. VYRYIDWLITVPLQMVEFYLILSAVG KAN... C4YF64_CANAW WAAFSVFLLLTIIHLLLFLYGNF R.KPG VKNSLLVIPLFTNAVFSVFYFTYA SNLGYAWQAVEFQH...AGTGLRQIF.. YAKFIAWFVGWPA...VLALFEIV TST.VLDRIEENPNIFKKFFLI B9W6Y7_CANDC WAVFSVFALFAIVHGFIYSFTDV R.KSG LKRALLTIPLFNSAVFAFAYYTYA SNLGYTWILAEFNH...AGTGFRQIF.. YAKFVAWFLGWPL...VLAIFQIV TNT.SFTTTEDESDLLKKFISL A3LUH9_PICST WALFSVFSLFAVVHAFVYGFTSS E.KKS LKKTLLVIPLFINAVMAYTYFTYA SNLGWTSTPTEFQH...VTTSEDLDVRQIF.. YVKWVGYFLTWPL...VLTIIEVT TQS...TDFFEEGDILTKFFSL B5RTR5_DEBHA WAVFSIFATLAVVHAFVFSFTSS R.THR LKKILFIVPLFTNAIMAYCYFTYA ANLGWTSTRVEFNH...VSTNRLLGVRQVF.. YVKYIGWFLAWPF...VLFAIEVA THTLESTNLADGGETVTGILSL C5E3Q5_LACTC WAVFSVFGLVSLIYAALFVVFEH R.GTK IHRYAVAGPLSISLVLAFSYFTMA SNLGWTAVQAEFNN..LTTPNQSEVPGIRQIF.. YAKYVAWFLTWPA...LLYLTELT GVV.TRDSSNILGPRPWSFYDL C5DYF7_ZYGRC WTVTAIFGLLAVVYVLLFFVTQV RNGSG LSRYSLAAPFLIAFFEFFAYFTYA SNLGWTGTNAEFHHISVSKPVTGESPGIRQVF.. YCKYIAWFLSWPI...VLFLQDLA ALS...TIKRDALGSASVLDL HMM WTVFSVGALLALVGTLLFFVTAR RVKDG EKRKLLVILLLIPAIAAVAYVTMA LGLGLTGVEAEFEH--------------RQVF-- YARYIDWLLTTPL----LVLVLAELA GAD------------------- HRTSAWLLAADVFVIAAGIAAAL T..T GVQ...RWLFFAVGAAGYAALLYGLL.GTLPRALGDDPR VR..SLFVTLRNITVVL...WTLYPVVWLL SPAGIGILQ TEMYTIVVVYLDFISKVAFVAFAVLGADA VSRLV SREFGIVITLNTVVMLAGFAGAM V..P GIE...RYALFGMGAVAFLGLVYYLV.GPMTESASQRSS GIK.SLYVRLRNLTVIL...WAIYPFIWLL GPPGVALLT PTVDVALIVYLDLVTKVGFGFIALDAAAT LRAEH REDTVKLVVLQALTIVFGFAGAV T..P SPV...SYALFAVGGALFGGVIYLLY.RNIAVAAKSTLS DIEVSLYRTLRNFVVVL...WLVYPVVWLL GAAGVGLMD VETATLVVVYLDVVTKVGFGVIALLAMID LGSAG RRSIIGVMVADALMIAVGAGAVV T..D GTL...KWALFGVSSIFHLSLFAYLY.VIFPRVVPDVPE QI..GLFNLLKNHIGLL...WLAYPLVWLF GPAGIGEAT AAGVALTYVFLDVLAKVPYVYFFYARRRV FMHSE GRALAAIVLTPVVQRIAFEVAAV S..G GIV...ALIGLVVVVGGHLAIAAYLL.GPVWTQTRGVPE QRR.LLHWKARNLVLFLIGMLIAYAVIALF GVF...D AFVSLAISQYMAVLIRVGFAGFLLANLDA VGSAS RRLTLFLFAAVLGRLWITLGSWF V..D GTL...ALVATLGTFAALGFGLYLLF.GPFTRAAAALES ERR.LLFSKLKYLIVLG...WVGL.VATGI MAQGAGLAD DFVGQLVVIYVEVILILGFGAIVVRSRTA LSQTA ATKLFTAITFDIAMCVTGLAAAL TTSS HLM...RWFWYAISCACFLVVLYILL.VEWAQDAKAAGT A...DMFNTLKLLTVVM...WLGYPIVWAL GVEGIAVLP VGVTSWGYSFLDIVAKYIFAFLLLNYLTS NESVV RNTITSLVSLDVLMIGTGLVATL SPGS GVLSAGAERLVWWGISTAFLLVLLYFLF.SSLSGRVADLPS DTR.STFKTLRNLVTVV...WLVYPVWWLI GTEGIGLVG IGIETAGFMVIDLTAKVGFGIILLRSHGV LDGAA QGTILALVGADGIMIGTGLVGAL T.KV YSY...RFVWWAISTAAMLYILYVLF.FGFTSKAESMRP EVA.STFKVLRNVTVVL...WSAYPVVWLI GSEGAGIVP LNIETLLFMVLDVSAKVGFGLILLRSRAI FGEAE RVSIGTLVGVDALMIVTGLVGAL S.HT PLA...RYTWWLFSTICMIVVLYFLA.TSLRAAAKERGP EVA.STFNTLTALVLVL...WTAYPILWII GTEGAGVVG LGIETLLFMVLDVTAKVGFGFILLRSRAI LGDTE SGMFWRLLLGSVVMLVGGYLGEA...G YIN...ATLGFIIGMAGWVYILYEVF SGEAGKAAAKSGN KALVTAFGAMRMIVTVG...WAIYPLGYVF GYLTGGV.D AESLNVVYNLADFVNKIAFGLVIWAAATS SSGKR FQTWLVKFIFVEIYVLGLLIGSI I..F STY...KFGYFTFAVFFQLLLMVWVG.RDLHRSFKSPSH S...NIANFFLIFFYLV...WILYPVAWGL SEGGNVI.Q PDSEAVFYGILDLITFGLMPTILIFFAIK GCDEE FEALFTRVLAIEVFVLGLLIGAL I..E STY...KWGYFTFAVVFQLFAIYLVI.NDVVVSFGSSSH S...VFGNALILAFVVV...WILYPVAWGL SEGGNVI.Q PDSEAVFYGILDLITFGVIPIILTWIAIN NVDEE FSRLFAKILATEVFVIGLLIGAL I..E STY...KWGYFTFSVTAQLFAEIYIF.VNVMTAWRQSTQ...KLGLILVLCQLVI...WILYPIAWGL SEGGNKI.Q PDSEAAFYGVLDFFTFFFIPVGLTWLAIN NVDEE LSGLIVKTFATEIYVLGLLIGIL I..P SSY...RWGYFTFAVSAQLFAMSLIL.VSMFSAAKSVHT N...KAAIIFIAFQLLV...WILYPICWGL SEGGNRI.Q PDSEAVFYGILDLITFSFVPIILTWINAS GVDED VHGLFLQICGSWFFIIGLLVGSL I..H SSY...KWGYWTMAAFAQLLVTYLIF...KHQLVDLTI S...GIKLVLLVFTHVC...IYLYLVAWGL SDGGNVI.T VDSSHVFFGILDLLIFVLVPALLVATATS SGVMP IHSLLVQIFGHYFWVIALLVGAL I..P STY...RWGYWTIGAFTMLVTEGLVL...QRQVQALRT R...GIYLILLMFMCLI...VWCYFIAWAV SEGGNKI.Q PDSEAVFYGILDLVVFAIYPSILVWIITV RGEWP RRTLLVLVLADVVMIVGGLVGAL I--E STY-----RWVYFTISVAAQLVLLYLLL -GELARAAKSLSS EI--SLFNTLRNLVVVL---WLLYPVAWLL GEEGNGI-Q ADSEAVVYGILDLVAKVGFGLILLASATS NES-- static (fixed in time) HMM p. 3/33
Bacterial Rhodopsins Alphaproteobacteria M M M M M M M D D D D D D D M M M M E V W V T T G - - - - - - - D S P T M M M M M M M M M M M M M IIII M M M M M M M M M V Y R Y I D W L I T V P L QMVE F Y L I L S A V G HMM L G L G L T G V E A E F E H R Q V F Y A R Y I D W L L T T P L L V L V L A E L A S G L T I S V L E M P A G H FAEGSSVMLGGEEV D G V V TM M M M M M M M M M M M M M M IIIIIIIIIIIIII M M M M II Halobacteria (Archaea) W G R Y L T W A L S T P M I L L A L G L L A M M M M M M M M M M M M M M M M M M M M M M M an evolved ancestral residue (a substitution) D I a deleted ancestral residue an insertion relative to the ancestral sequence evolutionary (time-dependent) HMM p. 4/33
Homology is an evolutionary question Homology detection is hypothesis testing Forward score Posterior of H given s F = log P(s H) P(s R) P(H s)= ef+ρ 1 + e F+ρ Evolutionary distance is a nuisance parameter in P(s H) Current approaches assume (implicitly or explicitly) afixedevolutionarydistance p. 5/33
An explicit time-parameterization allows to Integrate over Evolutionary Distance Homology detection Homology coverage Optimize for Evolutionary Distance Alignment of homologs p. 6/33
Affine Gap cost A way of dealing with variability A V G S P I V L - K A H G - - - V L S K S(A,A) + S(V,H) + S(G,G) + β + η + η + S(V,V) + S(L,L) + β + S(K,K) substitution matrix gap open cost β gap extent cost η affine gap cost BLAST syncs (empirically) the choice of substitution matrix with that of the affine gap costs substitution matrix BLOSUM62 gap open -11 gap extent -1 p. 7/33
HMMs formalize sw-like affine methods From Smith-Waterman to an HMM x 1..i-1 y 1..k-1 x i+1..l y k+1..m β match σ(x i y k ) β insert x i - delete - y k η η ε S (x 1..i-1 ) insert t MI ε I (x i ) match ε M (x i y k ) t IM t MM t SMk ε T (x i+1..l ) t MkT t DM t MD delete - y k t DkT t MkT +t SMk t MM + t MI insert match ε M (x i y k ) +t IM +t MM t MM t MD +t DM t MM - delete - y k t II t DD t II t DD Eddy & Castellano unpublished A probabilistic evolutionary model provides time-dependent HMM transitions p. 8/33
Is it worth to parameterize pair and profile HMMs with an explicit evolutionary model? p. 9/33
Alignment Accuracy Benchmark Score Efficiency (%) 1 8 6 4 2 A P(seqs Model) P(seqs Optimal-branch Model) Global Homology Set P(seqs Model) P(seqs Optimal-branch Model) B SEN = PPV = aligned positions inferred correctly true aligned positions aligned positions inferred correctly inferred aligned positions F = 1 2 1 1 1 SEN PPV ( + ) NCBIBLAST AUC= 78.9 phmmer (no filters) AUC= 78.7 AUC=71.4 AUC=8.4 MSAProbs AUC=81.7 1 8 6 4 2 F (%) 1 8 6 4 % ID trusted alignment 2 1 8 6 4 % ID trusted alignment 2 p. 1/33
Evolution of residue substitutions A A A A Assume For very small times ε: RATES α ε C G T residue changes 3α ε A residue is unchanged propose and solve differential equations 1 α =.1 Infer For finite time t: substitution probability.8.6.4.2 P t ( A A) = 1 4 P t ( C A) = - 4 α t ( 1 + 3 e ) 1 4 5 1 15 2 divergence time - 4 α t ( 1 - e ).25 Substitution Matrix P t Jukes & Cantor (1969) p. 11/33
Evolution of Insertions compatible with affine models T T G P L L V L Ancestral sequence t S T - P M Q M V E F Y L Descendant sequence Substitutions infinitesimal rate α Insertions & Deletions rate for deleting an ancestral residue rate for starting a new insert with n residue rate for deleting a whole insert with n residues rate for adding to an insert x residue rate for removing from an insert x residues μ Α λ (1- s ) s n-1 Ι Ι μ (1- s ) s n-1 D D x-1 λ Ι (1- ν Ι ) ν I μ Ι (1- ν D ) ν D x-1 P t ( S T) P t ( Descendant Ancestral ) p. 12/33
Not affine Models number of inserts 1 1 1 1 A Geometric ML fit (q=.947) G = 1348, p < 1e-6 χ 2 = 1585, p < 1e-6 Simulation variables L = 1, N = 1 time = PAM24 μ =.5 λ =.5 μ Ι =.2 λ Ι = 1.2 μ A =.3 v Ι = s Ι =.4 v D = s D =.9 B Geometric ML fit (q=.859) G = 735, p < 1e-6 χ 2 = 833, p < 1e-6 μ =.5 λ =.5 μ Ι =.2 λ Ι = 1.2 μ A =.3 v Ι = s Ι =. v D = s D =. Simulation variables L = 1, N = 1 time = PAM24 1 1 1 1 number of inserts 5 1 15 2 25 insert length 5 1 15 2 25 insert length number of inserts 1 1 1 1 C Geometric ML fit (q=.94) G = 6638, p < 1e-6 χ 2 = 6231, p < 1e-6 μ = μ Ι =.35 λ = λ Ι =.65 μ A =.3 v Ι = s Ι =.9 v D = s D =.4 D Geometric ML fit (q=.669) G = 13.2, p =.59 χ 2 = 13.1, p =.664 μ = μ Ι =.35 λ = λ Ι =.65 μ A =.3 v Ι = s Ι =. v D = s D =. 1 1 1 1 number of inserts 1 1 Simulation variables L = 1, N = 1 time = PAM24 Simulation variables L = 1, N = 1 time = PAM24 1 1 5 1 15 2 25 3 35 insert length 5 1 15 2 25 3 35 insert length p. 13/33
Analytical closed-form solutions AIF (fragment) Model Gap opens: β t = λ Ι 1 - e (λ Ι - μ Ι ) t μ Ι - λ Ι e (λ Ι - μ Ι ) t Gap extends: η t = λ Ι (1- r) + μ Ι r μ Ι - λ Ι e - e (λ Ι - μ Ι ) t λ Ι (λ Ι - μ Ι ) t Ancestral residue dies: γ t = 1 - e - (μ A ) t More realistic microscopic models result in non-affine macroscopic solutions p. 14/33
Evolved BLAST -2-4 Gap extend -1.5-1. standard empirical values -.8-6 -8 score -1-12 -14 Gap open -11.7-11. -1.7-16 -18 very similar high cost for insertions blosum9 blosum62 blosum45 very divergent lower cost for insertions p. 15/33
An evolved HMM 1 Position in a conserved region Transition Probability.8.6.4.2.23.9966 M M ancestral alive / no insertions (1 - γ ) ( 1 - β ) t t M Ι Start Insert = β t 1 t =1 5 1 15 2 divergence time t Position at start of an insertion Transition Probability.8.6.4.2.45.2331 M Ι Start Insert = β t M M ancestral alive / no insertions (1 - γ ) ( 1 - β ) t t 5 1 15 2 divergence time t time at which parameters were trained from data p. 16/33
Affine Evolutionary Models ACatalog Microscopic model Macroscopic model EVOLUTIONARY total # free geometric # states rates other properties MODEL parameters parameters minimal HMM single-residue models AALI 6 λ I, µ I, µ {M,D,I} A p 3 not reversible in general LI 4 λ I, µ I, µ A p 1 not reversible in general LR 2 λ I, µ A, (µ I = λ I + µ A ) (p LR = λ I /µ A ) 1 reversible TKF91 2 λ, µ (p TKF = λ/µ) 2 reversible, ref. [?] fragment models AFGX 9 λ I, µ I, µ {M,D,I} A r M, r D, r I,p 3 not reversible AFG 7 λ I, µ I, µ A r M, r D, r I,p 3 not reversible AFGR 4 λ I, µ A r M, r DI,(p LR ) 3 reversible AFR 3 λ I, µ A r,(p LR ) 3 reversible TKF92 3 λ, µ r,(p TKF92 ) 3 reversible, ref. [?] FID 2 λ r,(p =1) 3 reversible, ref. [?] fragment affine model for profile HMMs AIFX 7 λ I, µ I, µ {M,D,I} A r I,p 3 not reversible AIF 5 λ I, µ I, µ A r I,p 3 not reversible no-fragment affine model for profile HMMs (plan7 HMMER) AGAX 9 λ {M,D}, µ {M,D}, µ {M,D,I} A s I,p 3 not reversible AGA 7 λ {M,D}, µ {M,D}, µ A s I,p 3 not reversible p. 17/33
A fixed long-branch parameterization is sufficient to align global homologies of all degrees of conservation. p. 18/33
Score Efficiency (%) Sensitivity (%) 1 8 6 4 2 1 1 8 6 4 2 1 A C P(seqs Model) P(seqs Optimal-branch Model) 8 8 6 4 % ID trusted alignment NCBIBLAST AUC= 77.6 6 phmmer (no filters) AUC=76.7 AUC=67.6 4 % ID trusted alignment Global Homology Set P(seqs Model) P(seqs Optimal-branch Model) 2 AUC=79.9 2 MSAProbs AUC=83.8 1 1 B SEN = PPV = D aligned positions inferred correctly true aligned positions aligned positions inferred correctly inferred aligned positions F = 1 2 1 1 1 SEN PPV ( + ) 8 8 NCBIBLAST AUC= 78.9 6 6 AUC= 78.7 AUC=81.9 phmmer (no filters) AUC=71.4 4 % ID trusted alignment NCBIBLAST AUC= 81.4 4 % ID trusted alignment AUC= 82.6 AUC=8.4 2 phmmer (no filters) MSAProbs AUC=79.9 AUC=83.7 2 1 8 MSAProbs AUC=81.7 6 4 2 1 8 6 4 2 F (%) Positive Predictive Value (%) p. 19/33
A fixed short-branch parameterization reduces non-homologous alignment overextension for high-identity local homologies. 5 amino acid homologies p. 2/33
F (%) A 1 8 6 4 SSEARCH36 (BLOSUM62, -11/-1) AUC=71.7 Local Homology Set Alignment Accuracy Homology Coverage AUC=68.2 NCBIBLAST AUC=68.4 AUC=68.2 phmmer (no filters) AUC=72.9 SSEARCH36 (BLOSUM62, -11/-1) AUC=77.1 AUC=73.9 NCBIBLAST AUC=72.7 AUC=69.5 phmmer (no filters) AUC=77.4 B 1 8 6 4 F (%) 2 2 SEN (%) 1 1 8 6 4 8 6 phmmer (no filters) AUC=74.3 NCBIBLAST AUC=71.4 4 % ID aligned homologous domain 2 AUC=76.1 SSEARCH36 (BLOSUM62, -11/-1) AUC=74.9 1 8 6 4 % ID aligned homologous domain phmmer (no filters) AUC=76.5 NCBIBLAST AUC=74. 2 AUC=8.2 SSEARCH36 (BLOSUM62, -11/-1) AUC=78.3 1 8 6 4 SEN (%) 2 AUC=64.5 AUC=65.1 2 1 8 6 4 % ID aligned homologous domain 2 1 8 6 4 % ID aligned homologous domain 2 PPV (%) 1 8 6 4 SSEARCH36 phmmer (BLOSUM62, -11/-1) NCBIBLAST (no filters) AUC=68.9 AUC=66.3 AUC=71.9 AUC=61.9 AUC=77.2 AUC=7. AUC=84.4 phmmer SSEARCH36 (no filters) (BLOSUM62, -11/-1) NCBIBLAST AUC=8.4 AUC=77. AUC=73.8 1 8 6 4 PPV (%) 2 2 1 8 6 4 % ID aligned homologous domain 2 1 8 6 4 % ID aligned homologous domain 2 p. 21/33
Optimal branch parameterization A variable optimal-branch parameterization is best to align local homologies of any percentage identity. p. 22/33
e2msa - pairhmm aligner Score Efficiency (%) 1 8 6 4 2 Alignment Accuracy- Evolutionary pair HMM (e2msa) P(seqs Model) P(seqs Optimal-branch Model) Global Homology Set P(seqs Model) P(seqs Optimal-branch Model) Local Homology Set P(seqs Model) P(seqs Optimal-branch Model) P(seqs Model) P(seqs Optimal-branch Model) 1 8 6 4 2 Score Efficiency (%) 1 1 8 8 AUC=8.4 6 4 % ID alignment 2 Optimal-branch AUC=8.3 1 8 6 4 % ID aligned homologous domain AUC=68.2 2 Optimal-branch AUC=73.6 1 8 F (%) 6 4 2 AUC=71.4 AUC=68.2 6 4 2 F (%) 1 8 6 4 % ID alignment 2 1 8 6 4 % ID aligned homologous domain 2 p. 23/33
ephmmer Score Efficiency (%) 1 8 6 4 2 Alignment Acuracy - Evolutionary phmmer (ephmmer) Global Homology Set P(seqs Model) P(seqs Optimal-branch Model) P(seqs Model) P(seqs Optimal-branch Model) Score Efficiency (%) 1 8 6 4 2 Local Homology Set P(seqs Model) P(seqs Optimal-branch Model) P(seqs Model) P(seqs Optimal-branch Model) F (%) 1 1 8 6 4 2 8 ephmmer AUC=78.8 6 phmmer 3.1b1 AUC=79.4 4 2 F (%) 1 % ID alignment % ID aligned homologous domain ephmmer AUC=66.7 Optimal-branch ephmmer AUC=77.2 1 8 6 4 2 8 6 ephmmer AUC=72.7 4 ephmmer AUC=68.3 2 Optimal-branch ephmmer AUC=74.5 phmmer 3.1b1 AUC=72.8 1 8 6 4 % ID alignment 2 1 8 6 4 % ID aligned homologous domain 2 p. 24/33
Performance of different models Method ALIGNMENT ACCURACY [ AUC for F measure (%) ] Global Homology Set PARAMETERIZATION SHORT LONG OPTIMAL Local Homology Set PARAMETERIZATION SHORT LONG OPTIMAL 71.4 8.4 8.3 68.2 68.2 73.6 e2msa.aga 71.3 8.4 8.2 68.1 67.3 73.6 e2msa.aif 71.3 8.4 8.2 68.1 68.3 73.3 e2msa.tkf92 71.2 8. 79.9 68.1 68.2 73.4 e2msa.li 71. 78.7 78.6 67.9 66.4 72.7 e2msa.tkf91 69.5 75.4 74.5 66.2 69.1 7.7 ephmmer (no filters) 66.7 78.8 77.2 68.3 72.7 74.5 phmmer (no filters) 78.7 72.9 SSEARCH (BLOSUM62, -11/-1) 8. 71.7 NCBIBLAST 78.9 68.4 MSAProbs 81.7 36.1 MUSCLE 8.8 33.5 Evolutionary models with more parameters tend to perform better p. 25/33
The detection and coverage of embedded global homologies is robust with just one long-branch parameterization p. 26/33
Embedded Global Homologies % of True Positives before 5 False Positives 1 8 6 4 2 1 8 6 HMMER 3.1b1 Optimal-branch % average ID of test domain to query msa 4 Homology Detection 2 Homolog Residue Coverage (F measure %) 1 8 6 4 2 1 8 Optimal-branch HMMER 3.1b1 6 % average ID of test domain to query msa 4 Homology Coverage 2 p. 27/33
Short Local Homologies The detection and coverage of embedded short local homologies improves with a variable optimal-branch parameterization p. 28/33
% of True Positives before 5 False Positives % of True Positives before 5 False Positives 1 8 6 4 2 1 1 8 6 4 2 1 HMMER 3.1b1 8 8 6 % average ID of test domain to query msa 6 % average ID of test domain to query msa 4 4 Embedded 5 aa Local Homologies Optimal-branch 2 2 Embedded 3 aa Local Homologies HMMER 3.1b1 Optimal-branch Homolog Residue Coverage (F measure %) Homolog Residue Coverage (F measure %) 1 8 6 4 2 1 1 8 6 4 2 1 HMMER 3.1b1 8 8 6 % average ID of test domain to query msa 6 HMMER 3.1b1 % average ID of test domain to query msa 4 4 Optimal-branch 2 Optimal-branch 2 Homology Detection Homology Coverage p. 29/33
Fragments The detection of very short naked local homologies improves with a short-branch or optimal-branch parameterization p. 3/33
Naked Fragments % of True Positives before 5 False Positives 1 8 6 4 2 1 AUC=72.5 8 AUC=69. 6 4 Naked 3 aa Homologies HMMER 3.1b1 AUC=72.7 % average ID of test domain to query msa Optimal-branch AUC=71.8 2 1 Optimal-branch AUC=77.2 8 6 AUC=72.6 % average ID of test domain to query msa 4 HMMER 3.1b1 AUC=76.9 AUC=76.7 2 1 8 6 4 2 Coverage (F measure %) % of True Positives before 5 False Positives 1 8 6 4 2 1 HMMER 3.1b1 AUC=4.1 AUC=36.7 8 6 AUC=42.6 % average ID of test domain to query msa 4 Naked 15 aa Homologies Optimal-branch AUC=42.8 Homology Detection 2 1 AUC=43.3 8 HMMER 3.1b1 AUC=46.9 6 Optimal-branch AUC=51.2 % average ID of test domain to query msa 4 AUC=5.2 Homology Coverage 2 1 8 6 4 2 Coverage (F measure %) p. 31/33
Explicit evolutionary models?? It is nice to wind up and down a model without additional information For Sensitivity > For SEN/PPV > Use a long-branch parameterization (12% id). Except for metagenomics < 3 aa, then use a short-branch parameterization (45% id). Use a optimal-branch for short embbeded homologies. For global embedded homologies still OK using a long-branch parameterization. Ancestral reconstruction p. 32/33
p. 33/33