Is it worth to parameterize sequence alignment with an explicit evolutionary model?

Relevanta dokument
Chapter 2: Random Variables

12.6 Heat equation, Wave equation

Kurskod: TAMS24 / Provkod: TEN (8:00-12:00) English Version

Biochemistry 201 Advanced Molecular Biology (

Lösningar till Tentamen i Reglerteknik AK EL1000/EL1100/EL

Viktig information för transmittrar med option /A1 Gold-Plated Diaphragm

Kurskod: TAMS28 MATEMATISK STATISTIK Provkod: TEN1 05 June 2017, 14:00-18:00. English Version

Adding active and blended learning to an introductory mechanics course

Tentamen i Matematik 3: M0031M.

Robust och energieffektiv styrning av tågtrafik

Motif-based Hidden Markov Models for Multiple Sequence Alignment

Eternal Employment Financial Feasibility Study

Pre-Test 1: M0030M - Linear Algebra.

Exam Molecular Bioinformatics X3 (1MB330) - 1 March, Page 1 of 6. Skriv svar på varje uppgift på separata blad. Lycka till!!

Tentamen i Matematik 2: M0030M.

x 2 2(x + 2), f(x) = by utilizing the guidance given by asymptotes and stationary points. γ : 8xy x 2 y 3 = 12 x + 3

Statistik för bioteknik SF1911 Föreläsning 11: Hypotesprövning och statistiska test del 2. Timo Koski

Labokha AA et al. xlnup214 FG-like-1 xlnup214 FG-like-2 xlnup214 FG FGFG FGFG FGFG FGFG xtnup153 FG FGFG xtnup153 FG xlnup62 FG xlnup54 FG FGFG

PRESS FÄLLKONSTRUKTION FOLDING INSTRUCTIONS

Schenker Privpak AB Telefon VAT Nr. SE Schenker ABs ansvarsbestämmelser, identiska med Box 905 Faxnr Säte: Borås

This exam consists of four problems. The maximum sum of points is 20. The marks 3, 4 and 5 require a minimum

LUNDS TEKNISKA HÖGSKOLA Institutionen för Elektro- och Informationsteknik

Mapping sequence reads & Calling variants

Statistical modelling and alignment of protein sequences

Image quality Technical/physical aspects

SUPPLEMENTARY INFORMATION

A study of the performance

Discrete choice models with multiplicative error terms

Isometries of the plane

PRESS FÄLLKONSTRUKTION FOLDING INSTRUCTIONS

Support Manual HoistLocatel Electronic Locks

Tunga metaller / Heavy metals ICH Q3d & Farmakope. Rolf Arndt Cambrex Karlskoga

Affärsmodellernas förändring inom handeln

SOLAR LIGHT SOLUTION. Giving you the advantages of sunshine. Ningbo Green Light Energy Technology Co., Ltd.

8 < x 1 + x 2 x 3 = 1, x 1 +2x 2 + x 4 = 0, x 1 +2x 3 + x 4 = 2. x 1 2x 12 1A är inverterbar, och bestäm i så fall dess invers.

Tentamenskrivning: TMS145 - Grundkurs i matematisk statistik och bioinformatik,

Supplementary Data. Figure S1: EIMS spectrum for (E)-1-(3-(3,7-dimethylocta-2,6-dienyl)-2,4,6-trihydroxyphenyl)butan-1-one (3d) 6'' 7'' 3' 2' 1' 6

Högskolan i Skövde (SK, JS) Svensk version Tentamen i matematik

1. Compute the following matrix: (2 p) 2. Compute the determinant of the following matrix: (2 p)

Rev No. Magnetic gripper 3

Measuring child participation in immunization registries: two national surveys, 2001

Energy and Quality oriented modeling and control of REFiners

Kurskod: TAIU06 MATEMATISK STATISTIK Provkod: TENA 17 August 2015, 8:00-12:00. English Version

Övning 3 - Tillämpad datalogi 2012

ALGEBRA I SEMESTER 1 EXAM ITEM SPECIFICATION SHEET & KEY

JTS snabbstartsguide. Endast för användning av utbildad personal

Indikatorer för utvecklingen av de Europeiska energisystemen

SF1911: Statistik för bioteknik

(4x 12) n n. is convergent. Are there any of those x for which the series is not absolutely convergent, i.e. is (only) conditionally convergent?

Installation Instructions

Kurskod: TAIU06 MATEMATISK STATISTIK Provkod: TENA 15 August 2016, 8:00-12:00. English Version

INDUKTIV SLINGDETEKTOR INDUCTIVE LOOP DETECTOR

Solutions to exam in SF1811 Optimization, June 3, 2014

4.3 Stokastiska variabler (slumpmässiga variabler) 4.4 Väntevärde och varians till stokastiska variabler

Enkel linjär regression. Enkel linjär regression. Enkel linjär regression

and u = och x + y z 2w = 3 (a) Finn alla lösningar till ekvationssystemet

NO NEWS ON MATRIX MULTIPLICATION. Manuel Kauers Institute for Algebra JKU

Styrteknik: Binära tal, talsystem och koder D3:1

LOG/iC2. Introduction

7.5 Experiment with a single factor having more than two levels

Resultat av den utökade första planeringsövningen inför RRC september 2005

English Version. Number of sold cakes Number of days

English Version. + 1 n 2. n 1

Authentication Context QC Statement. Stefan Santesson, 3xA Security AB

Mälardalens Högskola. Formelsamling. Statistik, grundkurs

Measuring void content with GPR Current test with PaveScan and a comparison with traditional GPR systems. Martin Wiström, Ramboll RST

Thinning the branches of the GNSS decision tree. Sten Bergstrand Per Jarlemark Jan Johansson

Kanban är inte din process. (låt mig berätta varför) #DevLin Mars 2012

2.1 Installation of driver using Internet Installation of driver from disk... 3

Metodprov för kontroll av svetsmutterförband Kontrollbestämmelse Method test for inspection of joints of weld nut Inspection specification

Mönster. Ulf Cederling Växjö University Slide 1

Gradientbaserad Optimering,

Module 4 Applications of differentiation

Beijer Electronics AB 2000, MA00336A,

Läcksökning som OFP-metod

The Swedish National Patient Overview (NPO)

balans Serie 7 - The best working position is to be balanced - in the centre of your own gravity! balans 7,45

S 1 11, S 2 9 and S 1 + 2S 2 32 E S 1 11, S 2 9 and 33 S 1 + 2S 2 41 D S 1 11, S 2 9 and 42 S 1 + 2S 2 51 C 52 S 1 + 2S 2 60 B 61 S 1 + 2S 2 A

SVENSK STANDARD SS-EN ISO 19108:2005/AC:2015

Kurskod: TAMS11 Provkod: TENB 12 January 2015, 08:00-12:00. English Version

Basic reliability concepts. Sven Thelandersson Structural Engineering Lund University

sin(x 2 ) 4. Find the area of the bounded region precisely enclosed by the curves y = e x and y = e.

English Version. 1 f(x) = if 0 x θ; 0 otherwise, ) = V (X) = E(X2 ) (E(X)) 2 =

Module 1: Functions, Limits, Continuity

Kroppstemperaturen hos människa anses i regel vara 37,0 C/ 98,6 F. För att beräkna och rita grafer har programmet Minitab använts.

Magic Grippers System för att enkelt bygga robotgrippers / grippers. -- Kort presentation -- Beställ komplett katalog

Second handbook of research on mathematics teaching and learning (NCTM)

F ξ (x) = f(y, x)dydx = 1. We say that a random variable ξ has a distribution F (x), if. F (x) =

Tentamen i 2D1396 Bioinformatik, 11 mars 2006

Module 6: Integrals and applications

Room E3607 Protein bioinformatics Protein Bioinformatics. Computer lab Tuesday, May 17, 2005 Sean Prigge Jonathan Pevsner Ingo Ruczinski

Mer om Rainflowcykler

Methods to increase work-related activities within the curricula. S Nyberg and Pr U Edlund KTH SoTL 2017

FÖRBÄTTRA DIN PREDIKTIVA MODELLERING MED MACHINE LEARNING I SAS ENTERPRISE MINER OSKAR ERIKSSON - ANALYSKONSULT

STATISTISK ANALYS AV KOMPLEXA DATA

Tentamen del 2 SF1511, , kl , Numeriska metoder och grundläggande programmering

Hur fattar samhället beslut när forskarna är oeniga?

Undergraduate research:

ARC 32. Tvättställsblandare/Basin Mixer. inr.se

Gradientbaserad strukturoptimering

Transkript:

Is it worth to parameterize sequence alignment with an explicit evolutionary model? Sean Eddy & E.R. p. 1/33

Channelrhodopsin-1 adapted from www.calvin.edu p. 2/33

Bacterial Rhodopsins BACS2_HALSA.TWFWVGAVGMLAGTVLPI..RD CIRHP SHRRYDLVLAGITGLAAIAYTTMG LGITATTVGD...RTVY.. LARYIDWLVTTPL...IVLYLAMLA RPG... BACS2_NATPH TTLFWLGAIGMLVGTLAFAWAGR DAGSG E.RRYYVTLVGISGIAAVAYVVMA LGVGWVPVAE...RTVF.. APRYIDWILTTPL...IVYFLGLLA GLD... BACS2_HALVA TTWFTLGLLGELLGTAVLAY.GY TLVPE ETRKRYLLLIAIPGIAIVAYALMA LGFGSIQSEG...HAVY.. VVRYVDWLLTTPL...NVWFLALLA GAS... BACS1_HALSA ATAYLGGAVALIVGVAFVWLLYR SLDGS PHQSALAPLAIIPVFAGLSYVGMA YDIGTVIVNG...NQIV.. GLRYIDWLVTTPI...LVGYVGYAA GAS... C7P1Y4_HALMD TTVYGLTAVVYAVALVVLWGWLR QV.SP EHRRFCTPIVLVVALAGVASAVVA AGVGTITVNG...SEVV.. VPLFVESMIAYGV...LYAVMARLA DVE... D3SUL9_NATMM FVLLVVSSIVFISAAAIFVGYSR TLPDG PNQYGYAAAVA.AGSMGLAYVVMA LVNGISG...ADTD.. LFRFLGYTAMWTV...IVLVVCSVA GVD... BACH_NATPH ASSLYINIALAGLSILLFVFMTR GLDDP RAKLIAVSTILVPVVSIASYTGLA SGLTISVLEMPAGHFAEGSSVMLGGEEVDGVVTM WGRYLTWALSTPM...ILLALGLLA GSN... BACR_HALAR AIWLWLGTAGMFLGMLYFIARGW GETDS RRQKFYIATILITAIAFVNYLAMA LGFGLTIVEFAGEE...HPIY.. WARYSDWLFTTPL...LLYDLGLLA GAD... BACR_HALSA WIWLALGTALMGLGTLYFLVKGM GVSDP DAKKFYAITTLVPAIAFTMYLSML LGYGLTMVPFGGEQ...NPIY.. WARYADWLFTTPL...LLLDLALLV DAD... BACR1_HALSS TLWLGIGTLLMLIGTFYFIVKGW GVTDK EAREYYSITILVPGIASAAYLSMF FGIGLTEVQVGSEM...LDIY.. YARYADWLFTTPL...LLLDLALLA KVD... B6BSG6_9PROT GISFWVISMGMLAATAFFFMETG NVAAG W.RTSVIVAGLVTGIAFIHYMYMR EVWVTTG...DSPT.. VYRYIDWLITVPLQMVEFYLILSAVG KAN... C4YF64_CANAW WAAFSVFLLLTIIHLLLFLYGNF R.KPG VKNSLLVIPLFTNAVFSVFYFTYA SNLGYAWQAVEFQH...AGTGLRQIF.. YAKFIAWFVGWPA...VLALFEIV TST.VLDRIEENPNIFKKFFLI B9W6Y7_CANDC WAVFSVFALFAIVHGFIYSFTDV R.KSG LKRALLTIPLFNSAVFAFAYYTYA SNLGYTWILAEFNH...AGTGFRQIF.. YAKFVAWFLGWPL...VLAIFQIV TNT.SFTTTEDESDLLKKFISL A3LUH9_PICST WALFSVFSLFAVVHAFVYGFTSS E.KKS LKKTLLVIPLFINAVMAYTYFTYA SNLGWTSTPTEFQH...VTTSEDLDVRQIF.. YVKWVGYFLTWPL...VLTIIEVT TQS...TDFFEEGDILTKFFSL B5RTR5_DEBHA WAVFSIFATLAVVHAFVFSFTSS R.THR LKKILFIVPLFTNAIMAYCYFTYA ANLGWTSTRVEFNH...VSTNRLLGVRQVF.. YVKYIGWFLAWPF...VLFAIEVA THTLESTNLADGGETVTGILSL C5E3Q5_LACTC WAVFSVFGLVSLIYAALFVVFEH R.GTK IHRYAVAGPLSISLVLAFSYFTMA SNLGWTAVQAEFNN..LTTPNQSEVPGIRQIF.. YAKYVAWFLTWPA...LLYLTELT GVV.TRDSSNILGPRPWSFYDL C5DYF7_ZYGRC WTVTAIFGLLAVVYVLLFFVTQV RNGSG LSRYSLAAPFLIAFFEFFAYFTYA SNLGWTGTNAEFHHISVSKPVTGESPGIRQVF.. YCKYIAWFLSWPI...VLFLQDLA ALS...TIKRDALGSASVLDL HMM WTVFSVGALLALVGTLLFFVTAR RVKDG EKRKLLVILLLIPAIAAVAYVTMA LGLGLTGVEAEFEH--------------RQVF-- YARYIDWLLTTPL----LVLVLAELA GAD------------------- HRTSAWLLAADVFVIAAGIAAAL T..T GVQ...RWLFFAVGAAGYAALLYGLL.GTLPRALGDDPR VR..SLFVTLRNITVVL...WTLYPVVWLL SPAGIGILQ TEMYTIVVVYLDFISKVAFVAFAVLGADA VSRLV SREFGIVITLNTVVMLAGFAGAM V..P GIE...RYALFGMGAVAFLGLVYYLV.GPMTESASQRSS GIK.SLYVRLRNLTVIL...WAIYPFIWLL GPPGVALLT PTVDVALIVYLDLVTKVGFGFIALDAAAT LRAEH REDTVKLVVLQALTIVFGFAGAV T..P SPV...SYALFAVGGALFGGVIYLLY.RNIAVAAKSTLS DIEVSLYRTLRNFVVVL...WLVYPVVWLL GAAGVGLMD VETATLVVVYLDVVTKVGFGVIALLAMID LGSAG RRSIIGVMVADALMIAVGAGAVV T..D GTL...KWALFGVSSIFHLSLFAYLY.VIFPRVVPDVPE QI..GLFNLLKNHIGLL...WLAYPLVWLF GPAGIGEAT AAGVALTYVFLDVLAKVPYVYFFYARRRV FMHSE GRALAAIVLTPVVQRIAFEVAAV S..G GIV...ALIGLVVVVGGHLAIAAYLL.GPVWTQTRGVPE QRR.LLHWKARNLVLFLIGMLIAYAVIALF GVF...D AFVSLAISQYMAVLIRVGFAGFLLANLDA VGSAS RRLTLFLFAAVLGRLWITLGSWF V..D GTL...ALVATLGTFAALGFGLYLLF.GPFTRAAAALES ERR.LLFSKLKYLIVLG...WVGL.VATGI MAQGAGLAD DFVGQLVVIYVEVILILGFGAIVVRSRTA LSQTA ATKLFTAITFDIAMCVTGLAAAL TTSS HLM...RWFWYAISCACFLVVLYILL.VEWAQDAKAAGT A...DMFNTLKLLTVVM...WLGYPIVWAL GVEGIAVLP VGVTSWGYSFLDIVAKYIFAFLLLNYLTS NESVV RNTITSLVSLDVLMIGTGLVATL SPGS GVLSAGAERLVWWGISTAFLLVLLYFLF.SSLSGRVADLPS DTR.STFKTLRNLVTVV...WLVYPVWWLI GTEGIGLVG IGIETAGFMVIDLTAKVGFGIILLRSHGV LDGAA QGTILALVGADGIMIGTGLVGAL T.KV YSY...RFVWWAISTAAMLYILYVLF.FGFTSKAESMRP EVA.STFKVLRNVTVVL...WSAYPVVWLI GSEGAGIVP LNIETLLFMVLDVSAKVGFGLILLRSRAI FGEAE RVSIGTLVGVDALMIVTGLVGAL S.HT PLA...RYTWWLFSTICMIVVLYFLA.TSLRAAAKERGP EVA.STFNTLTALVLVL...WTAYPILWII GTEGAGVVG LGIETLLFMVLDVTAKVGFGFILLRSRAI LGDTE SGMFWRLLLGSVVMLVGGYLGEA...G YIN...ATLGFIIGMAGWVYILYEVF SGEAGKAAAKSGN KALVTAFGAMRMIVTVG...WAIYPLGYVF GYLTGGV.D AESLNVVYNLADFVNKIAFGLVIWAAATS SSGKR FQTWLVKFIFVEIYVLGLLIGSI I..F STY...KFGYFTFAVFFQLLLMVWVG.RDLHRSFKSPSH S...NIANFFLIFFYLV...WILYPVAWGL SEGGNVI.Q PDSEAVFYGILDLITFGLMPTILIFFAIK GCDEE FEALFTRVLAIEVFVLGLLIGAL I..E STY...KWGYFTFAVVFQLFAIYLVI.NDVVVSFGSSSH S...VFGNALILAFVVV...WILYPVAWGL SEGGNVI.Q PDSEAVFYGILDLITFGVIPIILTWIAIN NVDEE FSRLFAKILATEVFVIGLLIGAL I..E STY...KWGYFTFSVTAQLFAEIYIF.VNVMTAWRQSTQ...KLGLILVLCQLVI...WILYPIAWGL SEGGNKI.Q PDSEAAFYGVLDFFTFFFIPVGLTWLAIN NVDEE LSGLIVKTFATEIYVLGLLIGIL I..P SSY...RWGYFTFAVSAQLFAMSLIL.VSMFSAAKSVHT N...KAAIIFIAFQLLV...WILYPICWGL SEGGNRI.Q PDSEAVFYGILDLITFSFVPIILTWINAS GVDED VHGLFLQICGSWFFIIGLLVGSL I..H SSY...KWGYWTMAAFAQLLVTYLIF...KHQLVDLTI S...GIKLVLLVFTHVC...IYLYLVAWGL SDGGNVI.T VDSSHVFFGILDLLIFVLVPALLVATATS SGVMP IHSLLVQIFGHYFWVIALLVGAL I..P STY...RWGYWTIGAFTMLVTEGLVL...QRQVQALRT R...GIYLILLMFMCLI...VWCYFIAWAV SEGGNKI.Q PDSEAVFYGILDLVVFAIYPSILVWIITV RGEWP RRTLLVLVLADVVMIVGGLVGAL I--E STY-----RWVYFTISVAAQLVLLYLLL -GELARAAKSLSS EI--SLFNTLRNLVVVL---WLLYPVAWLL GEEGNGI-Q ADSEAVVYGILDLVAKVGFGLILLASATS NES-- static (fixed in time) HMM p. 3/33

Bacterial Rhodopsins Alphaproteobacteria M M M M M M M D D D D D D D M M M M E V W V T T G - - - - - - - D S P T M M M M M M M M M M M M M IIII M M M M M M M M M V Y R Y I D W L I T V P L QMVE F Y L I L S A V G HMM L G L G L T G V E A E F E H R Q V F Y A R Y I D W L L T T P L L V L V L A E L A S G L T I S V L E M P A G H FAEGSSVMLGGEEV D G V V TM M M M M M M M M M M M M M M IIIIIIIIIIIIII M M M M II Halobacteria (Archaea) W G R Y L T W A L S T P M I L L A L G L L A M M M M M M M M M M M M M M M M M M M M M M M an evolved ancestral residue (a substitution) D I a deleted ancestral residue an insertion relative to the ancestral sequence evolutionary (time-dependent) HMM p. 4/33

Homology is an evolutionary question Homology detection is hypothesis testing Forward score Posterior of H given s F = log P(s H) P(s R) P(H s)= ef+ρ 1 + e F+ρ Evolutionary distance is a nuisance parameter in P(s H) Current approaches assume (implicitly or explicitly) afixedevolutionarydistance p. 5/33

An explicit time-parameterization allows to Integrate over Evolutionary Distance Homology detection Homology coverage Optimize for Evolutionary Distance Alignment of homologs p. 6/33

Affine Gap cost A way of dealing with variability A V G S P I V L - K A H G - - - V L S K S(A,A) + S(V,H) + S(G,G) + β + η + η + S(V,V) + S(L,L) + β + S(K,K) substitution matrix gap open cost β gap extent cost η affine gap cost BLAST syncs (empirically) the choice of substitution matrix with that of the affine gap costs substitution matrix BLOSUM62 gap open -11 gap extent -1 p. 7/33

HMMs formalize sw-like affine methods From Smith-Waterman to an HMM x 1..i-1 y 1..k-1 x i+1..l y k+1..m β match σ(x i y k ) β insert x i - delete - y k η η ε S (x 1..i-1 ) insert t MI ε I (x i ) match ε M (x i y k ) t IM t MM t SMk ε T (x i+1..l ) t MkT t DM t MD delete - y k t DkT t MkT +t SMk t MM + t MI insert match ε M (x i y k ) +t IM +t MM t MM t MD +t DM t MM - delete - y k t II t DD t II t DD Eddy & Castellano unpublished A probabilistic evolutionary model provides time-dependent HMM transitions p. 8/33

Is it worth to parameterize pair and profile HMMs with an explicit evolutionary model? p. 9/33

Alignment Accuracy Benchmark Score Efficiency (%) 1 8 6 4 2 A P(seqs Model) P(seqs Optimal-branch Model) Global Homology Set P(seqs Model) P(seqs Optimal-branch Model) B SEN = PPV = aligned positions inferred correctly true aligned positions aligned positions inferred correctly inferred aligned positions F = 1 2 1 1 1 SEN PPV ( + ) NCBIBLAST AUC= 78.9 phmmer (no filters) AUC= 78.7 AUC=71.4 AUC=8.4 MSAProbs AUC=81.7 1 8 6 4 2 F (%) 1 8 6 4 % ID trusted alignment 2 1 8 6 4 % ID trusted alignment 2 p. 1/33

Evolution of residue substitutions A A A A Assume For very small times ε: RATES α ε C G T residue changes 3α ε A residue is unchanged propose and solve differential equations 1 α =.1 Infer For finite time t: substitution probability.8.6.4.2 P t ( A A) = 1 4 P t ( C A) = - 4 α t ( 1 + 3 e ) 1 4 5 1 15 2 divergence time - 4 α t ( 1 - e ).25 Substitution Matrix P t Jukes & Cantor (1969) p. 11/33

Evolution of Insertions compatible with affine models T T G P L L V L Ancestral sequence t S T - P M Q M V E F Y L Descendant sequence Substitutions infinitesimal rate α Insertions & Deletions rate for deleting an ancestral residue rate for starting a new insert with n residue rate for deleting a whole insert with n residues rate for adding to an insert x residue rate for removing from an insert x residues μ Α λ (1- s ) s n-1 Ι Ι μ (1- s ) s n-1 D D x-1 λ Ι (1- ν Ι ) ν I μ Ι (1- ν D ) ν D x-1 P t ( S T) P t ( Descendant Ancestral ) p. 12/33

Not affine Models number of inserts 1 1 1 1 A Geometric ML fit (q=.947) G = 1348, p < 1e-6 χ 2 = 1585, p < 1e-6 Simulation variables L = 1, N = 1 time = PAM24 μ =.5 λ =.5 μ Ι =.2 λ Ι = 1.2 μ A =.3 v Ι = s Ι =.4 v D = s D =.9 B Geometric ML fit (q=.859) G = 735, p < 1e-6 χ 2 = 833, p < 1e-6 μ =.5 λ =.5 μ Ι =.2 λ Ι = 1.2 μ A =.3 v Ι = s Ι =. v D = s D =. Simulation variables L = 1, N = 1 time = PAM24 1 1 1 1 number of inserts 5 1 15 2 25 insert length 5 1 15 2 25 insert length number of inserts 1 1 1 1 C Geometric ML fit (q=.94) G = 6638, p < 1e-6 χ 2 = 6231, p < 1e-6 μ = μ Ι =.35 λ = λ Ι =.65 μ A =.3 v Ι = s Ι =.9 v D = s D =.4 D Geometric ML fit (q=.669) G = 13.2, p =.59 χ 2 = 13.1, p =.664 μ = μ Ι =.35 λ = λ Ι =.65 μ A =.3 v Ι = s Ι =. v D = s D =. 1 1 1 1 number of inserts 1 1 Simulation variables L = 1, N = 1 time = PAM24 Simulation variables L = 1, N = 1 time = PAM24 1 1 5 1 15 2 25 3 35 insert length 5 1 15 2 25 3 35 insert length p. 13/33

Analytical closed-form solutions AIF (fragment) Model Gap opens: β t = λ Ι 1 - e (λ Ι - μ Ι ) t μ Ι - λ Ι e (λ Ι - μ Ι ) t Gap extends: η t = λ Ι (1- r) + μ Ι r μ Ι - λ Ι e - e (λ Ι - μ Ι ) t λ Ι (λ Ι - μ Ι ) t Ancestral residue dies: γ t = 1 - e - (μ A ) t More realistic microscopic models result in non-affine macroscopic solutions p. 14/33

Evolved BLAST -2-4 Gap extend -1.5-1. standard empirical values -.8-6 -8 score -1-12 -14 Gap open -11.7-11. -1.7-16 -18 very similar high cost for insertions blosum9 blosum62 blosum45 very divergent lower cost for insertions p. 15/33

An evolved HMM 1 Position in a conserved region Transition Probability.8.6.4.2.23.9966 M M ancestral alive / no insertions (1 - γ ) ( 1 - β ) t t M Ι Start Insert = β t 1 t =1 5 1 15 2 divergence time t Position at start of an insertion Transition Probability.8.6.4.2.45.2331 M Ι Start Insert = β t M M ancestral alive / no insertions (1 - γ ) ( 1 - β ) t t 5 1 15 2 divergence time t time at which parameters were trained from data p. 16/33

Affine Evolutionary Models ACatalog Microscopic model Macroscopic model EVOLUTIONARY total # free geometric # states rates other properties MODEL parameters parameters minimal HMM single-residue models AALI 6 λ I, µ I, µ {M,D,I} A p 3 not reversible in general LI 4 λ I, µ I, µ A p 1 not reversible in general LR 2 λ I, µ A, (µ I = λ I + µ A ) (p LR = λ I /µ A ) 1 reversible TKF91 2 λ, µ (p TKF = λ/µ) 2 reversible, ref. [?] fragment models AFGX 9 λ I, µ I, µ {M,D,I} A r M, r D, r I,p 3 not reversible AFG 7 λ I, µ I, µ A r M, r D, r I,p 3 not reversible AFGR 4 λ I, µ A r M, r DI,(p LR ) 3 reversible AFR 3 λ I, µ A r,(p LR ) 3 reversible TKF92 3 λ, µ r,(p TKF92 ) 3 reversible, ref. [?] FID 2 λ r,(p =1) 3 reversible, ref. [?] fragment affine model for profile HMMs AIFX 7 λ I, µ I, µ {M,D,I} A r I,p 3 not reversible AIF 5 λ I, µ I, µ A r I,p 3 not reversible no-fragment affine model for profile HMMs (plan7 HMMER) AGAX 9 λ {M,D}, µ {M,D}, µ {M,D,I} A s I,p 3 not reversible AGA 7 λ {M,D}, µ {M,D}, µ A s I,p 3 not reversible p. 17/33

A fixed long-branch parameterization is sufficient to align global homologies of all degrees of conservation. p. 18/33

Score Efficiency (%) Sensitivity (%) 1 8 6 4 2 1 1 8 6 4 2 1 A C P(seqs Model) P(seqs Optimal-branch Model) 8 8 6 4 % ID trusted alignment NCBIBLAST AUC= 77.6 6 phmmer (no filters) AUC=76.7 AUC=67.6 4 % ID trusted alignment Global Homology Set P(seqs Model) P(seqs Optimal-branch Model) 2 AUC=79.9 2 MSAProbs AUC=83.8 1 1 B SEN = PPV = D aligned positions inferred correctly true aligned positions aligned positions inferred correctly inferred aligned positions F = 1 2 1 1 1 SEN PPV ( + ) 8 8 NCBIBLAST AUC= 78.9 6 6 AUC= 78.7 AUC=81.9 phmmer (no filters) AUC=71.4 4 % ID trusted alignment NCBIBLAST AUC= 81.4 4 % ID trusted alignment AUC= 82.6 AUC=8.4 2 phmmer (no filters) MSAProbs AUC=79.9 AUC=83.7 2 1 8 MSAProbs AUC=81.7 6 4 2 1 8 6 4 2 F (%) Positive Predictive Value (%) p. 19/33

A fixed short-branch parameterization reduces non-homologous alignment overextension for high-identity local homologies. 5 amino acid homologies p. 2/33

F (%) A 1 8 6 4 SSEARCH36 (BLOSUM62, -11/-1) AUC=71.7 Local Homology Set Alignment Accuracy Homology Coverage AUC=68.2 NCBIBLAST AUC=68.4 AUC=68.2 phmmer (no filters) AUC=72.9 SSEARCH36 (BLOSUM62, -11/-1) AUC=77.1 AUC=73.9 NCBIBLAST AUC=72.7 AUC=69.5 phmmer (no filters) AUC=77.4 B 1 8 6 4 F (%) 2 2 SEN (%) 1 1 8 6 4 8 6 phmmer (no filters) AUC=74.3 NCBIBLAST AUC=71.4 4 % ID aligned homologous domain 2 AUC=76.1 SSEARCH36 (BLOSUM62, -11/-1) AUC=74.9 1 8 6 4 % ID aligned homologous domain phmmer (no filters) AUC=76.5 NCBIBLAST AUC=74. 2 AUC=8.2 SSEARCH36 (BLOSUM62, -11/-1) AUC=78.3 1 8 6 4 SEN (%) 2 AUC=64.5 AUC=65.1 2 1 8 6 4 % ID aligned homologous domain 2 1 8 6 4 % ID aligned homologous domain 2 PPV (%) 1 8 6 4 SSEARCH36 phmmer (BLOSUM62, -11/-1) NCBIBLAST (no filters) AUC=68.9 AUC=66.3 AUC=71.9 AUC=61.9 AUC=77.2 AUC=7. AUC=84.4 phmmer SSEARCH36 (no filters) (BLOSUM62, -11/-1) NCBIBLAST AUC=8.4 AUC=77. AUC=73.8 1 8 6 4 PPV (%) 2 2 1 8 6 4 % ID aligned homologous domain 2 1 8 6 4 % ID aligned homologous domain 2 p. 21/33

Optimal branch parameterization A variable optimal-branch parameterization is best to align local homologies of any percentage identity. p. 22/33

e2msa - pairhmm aligner Score Efficiency (%) 1 8 6 4 2 Alignment Accuracy- Evolutionary pair HMM (e2msa) P(seqs Model) P(seqs Optimal-branch Model) Global Homology Set P(seqs Model) P(seqs Optimal-branch Model) Local Homology Set P(seqs Model) P(seqs Optimal-branch Model) P(seqs Model) P(seqs Optimal-branch Model) 1 8 6 4 2 Score Efficiency (%) 1 1 8 8 AUC=8.4 6 4 % ID alignment 2 Optimal-branch AUC=8.3 1 8 6 4 % ID aligned homologous domain AUC=68.2 2 Optimal-branch AUC=73.6 1 8 F (%) 6 4 2 AUC=71.4 AUC=68.2 6 4 2 F (%) 1 8 6 4 % ID alignment 2 1 8 6 4 % ID aligned homologous domain 2 p. 23/33

ephmmer Score Efficiency (%) 1 8 6 4 2 Alignment Acuracy - Evolutionary phmmer (ephmmer) Global Homology Set P(seqs Model) P(seqs Optimal-branch Model) P(seqs Model) P(seqs Optimal-branch Model) Score Efficiency (%) 1 8 6 4 2 Local Homology Set P(seqs Model) P(seqs Optimal-branch Model) P(seqs Model) P(seqs Optimal-branch Model) F (%) 1 1 8 6 4 2 8 ephmmer AUC=78.8 6 phmmer 3.1b1 AUC=79.4 4 2 F (%) 1 % ID alignment % ID aligned homologous domain ephmmer AUC=66.7 Optimal-branch ephmmer AUC=77.2 1 8 6 4 2 8 6 ephmmer AUC=72.7 4 ephmmer AUC=68.3 2 Optimal-branch ephmmer AUC=74.5 phmmer 3.1b1 AUC=72.8 1 8 6 4 % ID alignment 2 1 8 6 4 % ID aligned homologous domain 2 p. 24/33

Performance of different models Method ALIGNMENT ACCURACY [ AUC for F measure (%) ] Global Homology Set PARAMETERIZATION SHORT LONG OPTIMAL Local Homology Set PARAMETERIZATION SHORT LONG OPTIMAL 71.4 8.4 8.3 68.2 68.2 73.6 e2msa.aga 71.3 8.4 8.2 68.1 67.3 73.6 e2msa.aif 71.3 8.4 8.2 68.1 68.3 73.3 e2msa.tkf92 71.2 8. 79.9 68.1 68.2 73.4 e2msa.li 71. 78.7 78.6 67.9 66.4 72.7 e2msa.tkf91 69.5 75.4 74.5 66.2 69.1 7.7 ephmmer (no filters) 66.7 78.8 77.2 68.3 72.7 74.5 phmmer (no filters) 78.7 72.9 SSEARCH (BLOSUM62, -11/-1) 8. 71.7 NCBIBLAST 78.9 68.4 MSAProbs 81.7 36.1 MUSCLE 8.8 33.5 Evolutionary models with more parameters tend to perform better p. 25/33

The detection and coverage of embedded global homologies is robust with just one long-branch parameterization p. 26/33

Embedded Global Homologies % of True Positives before 5 False Positives 1 8 6 4 2 1 8 6 HMMER 3.1b1 Optimal-branch % average ID of test domain to query msa 4 Homology Detection 2 Homolog Residue Coverage (F measure %) 1 8 6 4 2 1 8 Optimal-branch HMMER 3.1b1 6 % average ID of test domain to query msa 4 Homology Coverage 2 p. 27/33

Short Local Homologies The detection and coverage of embedded short local homologies improves with a variable optimal-branch parameterization p. 28/33

% of True Positives before 5 False Positives % of True Positives before 5 False Positives 1 8 6 4 2 1 1 8 6 4 2 1 HMMER 3.1b1 8 8 6 % average ID of test domain to query msa 6 % average ID of test domain to query msa 4 4 Embedded 5 aa Local Homologies Optimal-branch 2 2 Embedded 3 aa Local Homologies HMMER 3.1b1 Optimal-branch Homolog Residue Coverage (F measure %) Homolog Residue Coverage (F measure %) 1 8 6 4 2 1 1 8 6 4 2 1 HMMER 3.1b1 8 8 6 % average ID of test domain to query msa 6 HMMER 3.1b1 % average ID of test domain to query msa 4 4 Optimal-branch 2 Optimal-branch 2 Homology Detection Homology Coverage p. 29/33

Fragments The detection of very short naked local homologies improves with a short-branch or optimal-branch parameterization p. 3/33

Naked Fragments % of True Positives before 5 False Positives 1 8 6 4 2 1 AUC=72.5 8 AUC=69. 6 4 Naked 3 aa Homologies HMMER 3.1b1 AUC=72.7 % average ID of test domain to query msa Optimal-branch AUC=71.8 2 1 Optimal-branch AUC=77.2 8 6 AUC=72.6 % average ID of test domain to query msa 4 HMMER 3.1b1 AUC=76.9 AUC=76.7 2 1 8 6 4 2 Coverage (F measure %) % of True Positives before 5 False Positives 1 8 6 4 2 1 HMMER 3.1b1 AUC=4.1 AUC=36.7 8 6 AUC=42.6 % average ID of test domain to query msa 4 Naked 15 aa Homologies Optimal-branch AUC=42.8 Homology Detection 2 1 AUC=43.3 8 HMMER 3.1b1 AUC=46.9 6 Optimal-branch AUC=51.2 % average ID of test domain to query msa 4 AUC=5.2 Homology Coverage 2 1 8 6 4 2 Coverage (F measure %) p. 31/33

Explicit evolutionary models?? It is nice to wind up and down a model without additional information For Sensitivity > For SEN/PPV > Use a long-branch parameterization (12% id). Except for metagenomics < 3 aa, then use a short-branch parameterization (45% id). Use a optimal-branch for short embbeded homologies. For global embedded homologies still OK using a long-branch parameterization. Ancestral reconstruction p. 32/33

p. 33/33