Motif-based Hidden Markov Models for Multiple Sequence Alignment

Relevanta dokument
Biochemistry 201 Advanced Molecular Biology (

Exam Molecular Bioinformatics X3 (1MB330) - 1 March, Page 1 of 6. Skriv svar på varje uppgift på separata blad. Lycka till!!

Health café. Self help groups. Learning café. Focus on support to people with chronic diseases and their families

Isolda Purchase - EDI

Documentation SN 3102

Viktig information för transmittrar med option /A1 Gold-Plated Diaphragm

Michael Q. Jones & Matt B. Pedersen University of Nevada Las Vegas

Module 6: Integrals and applications

1. Compute the following matrix: (2 p) 2. Compute the determinant of the following matrix: (2 p)

PORTSECURITY IN SÖLVESBORG

Is it worth to parameterize sequence alignment with an explicit evolutionary model?

State Examinations Commission

CUSTOMER READERSHIP HARRODS MAGAZINE CUSTOMER OVERVIEW. 63% of Harrods Magazine readers are mostly interested in reading about beauty

Measuring child participation in immunization registries: two national surveys, 2001

FORSKNINGSKOMMUNIKATION OCH PUBLICERINGS- MÖNSTER INOM UTBILDNINGSVETENSKAP

SVENSK STANDARD SS-EN ISO 19108:2005/AC:2015

Swedish framework for qualification

Adding active and blended learning to an introductory mechanics course

Kursplan. MT1051 3D CAD Grundläggande. 7,5 högskolepoäng, Grundnivå 1. 3D-CAD Basic Course

Installation Instructions

12.6 Heat equation, Wave equation

Könsfördelningen inom kataraktkirurgin. Mats Lundström

Room E3607 Protein bioinformatics Protein Bioinformatics. Computer lab Tuesday, May 17, 2005 Sean Prigge Jonathan Pevsner Ingo Ruczinski

FÖRBERED UNDERLAG FÖR BEDÖMNING SÅ HÄR

Styrteknik: Binära tal, talsystem och koder D3:1

Resultat av den utökade första planeringsövningen inför RRC september 2005

Den framtida redovisningstillsynen

Tentamen Molekylärbiologi X3 (1MB608) 10 March, 2008 Page 1 of 5. Skriv svaren på varje fråga på SEPARATA blad.

8 < x 1 + x 2 x 3 = 1, x 1 +2x 2 + x 4 = 0, x 1 +2x 3 + x 4 = 2. x 1 2x 12 1A är inverterbar, och bestäm i så fall dess invers.

SVENSK STANDARD SS

Technique and expression 3: weave. 3.5 hp. Ladokcode: AX1 TE1 The exam is given to: Exchange Textile Design and Textile design 2.

Mapping sequence reads & Calling variants

LUNDS TEKNISKA HÖGSKOLA Institutionen för Elektro- och Informationsteknik

Sri Lanka Association for Artificial Intelligence

Semantic and Physical Modeling and Simulation of Multi-Domain Energy Systems: Gas Turbines and Electrical Power Networks

Support Manual HoistLocatel Electronic Locks

Kursplan. NA3009 Ekonomi och ledarskap. 7,5 högskolepoäng, Avancerad nivå 1. Economics of Leadership

Theory 1. Summer Term 2010

Module 1: Functions, Limits, Continuity

Windlass Control Panel v1.0.1

Oförstörande provning (NDT) i Del M Subpart F/Del 145-organisationer

SUPPLEMENTARY INFORMATION

Manhour analys EASA STI #17214

Rättningstiden är i normalfall 15 arbetsdagar, annars är det detta datum som gäller:

Läkemedelsverkets Farmakovigilansdag

Quicksort. Koffman & Wolfgang kapitel 8, avsnitt 9

Kurskod: TAMS28 MATEMATISK STATISTIK Provkod: TEN1 05 June 2017, 14:00-18:00. English Version

Kurskod: TAIU06 MATEMATISK STATISTIK Provkod: TENA 15 August 2016, 8:00-12:00. English Version

PORTSECURITY IN SÖLVESBORG

CHANGE WITH THE BRAIN IN MIND. Frukostseminarium 11 oktober 2018

EXPERT SURVEY OF THE NEWS MEDIA

EXTERNAL ASSESSMENT SAMPLE TASKS SWEDISH BREAKTHROUGH LSPSWEB/0Y09

REHAB BACKGROUND TO REMEMBER AND CONSIDER

Custom-made software solutions for increased transport quality and creation of cargo specific lashing protocols.

Taking Flight! Migrating to SAS 9.2!

Mönster. Ulf Cederling Växjö University Slide 1

District Application for Partnership

MONTERINGSINSTRUKTION ASSEMBLY INSTRUCTION

En bild säger mer än tusen ord?

Förändrade förväntningar

DE TRE UTMANINGARNA..

INTERAKTIVA UTBILDNINGAR. UPPDRAG: Trafikutbildning åt Örebro kommun. KUND: Agresso Unit4

Anders Persson Philosophy of Science (FOR001F) Response rate = 0 % Survey Results. Relative Frequencies of answers Std. Dev.

Sammanfattning. Revisionsfråga Har kommunstyrelsen och tekniska nämnden en tillfredställande intern kontroll av att upphandlade ramavtal följs.

SWESIAQ Swedish Chapter of International Society of Indoor Air Quality and Climate

JTS snabbstartsguide. Endast för användning av utbildad personal

!"#$ $ % &'(')*+* +, 012/( 3-0$ (4 (5 /& 0- -(4 (5 /& 06/7*)).)*+* 8 09

SVENSK STANDARD SS-ISO 8734

Sara Skärhem Martin Jansson Dalarna Science Park

Methods to increase work-related activities within the curricula. S Nyberg and Pr U Edlund KTH SoTL 2017

Isometries of the plane

Information technology Open Document Format for Office Applications (OpenDocument) v1.0 (ISO/IEC 26300:2006, IDT) SWEDISH STANDARDS INSTITUTE

Projektmodell med kunskapshantering anpassad för Svenska Mässan Koncernen

Sannolikhetsteori. Tentamenskrivning: TMS145 - Grundkurs i matematisk statistik och bioinformatik,

Om oss DET PERFEKTA KOMPLEMENTET THE PERFECT COMPLETION 04 EN BINZ ÄR PRECIS SÅ BRA SOM DU FÖRVÄNTAR DIG A BINZ IS JUST AS GOOD AS YOU THINK 05

Examensarbete i matematik på grundnivå med inriktning mot optimeringslära och systemteori

Högskolan i Skövde (SK, JS) Svensk version Tentamen i matematik

Tulldeklarationernas datainnehåll

Enterprise App Store. Sammi Khayer. Igor Stevstedt. Konsultchef mobila lösningar. Teknisk Lead mobila lösningar

Labokha AA et al. xlnup214 FG-like-1 xlnup214 FG-like-2 xlnup214 FG FGFG FGFG FGFG FGFG xtnup153 FG FGFG xtnup153 FG xlnup62 FG xlnup54 FG FGFG

1. Varje bevissteg ska motiveras formellt (informella bevis ger 0 poang)

Det här med levels.?

The Municipality of Ystad

Questionnaire on Nurses Feeling for Hospital Odors

What Is Hyper-Threading and How Does It Improve Performance

TAKE A CLOSER LOOK AT COPAXONE (glatiramer acetate)

PEC: European Science Teacher: Scientific Knowledge, Linguistic Skills and Digital Media

The present situation on the application of ICT in precision agriculture in Sweden

Företagsekonomi, allmän kurs. Business Administration, General Course. Business Administration until further notice

Senaste trenderna inom redovisning, rapportering och bolagsstyrning Lars-Olle Larsson, Swedfund International AB

Arbetstillfällen

Calculate check digits according to the modulus-11 method

SUPPLEMENTARY FIGURE LEGENDS

(D1.1) 1. (3p) Bestäm ekvationer i ett xyz-koordinatsystem för planet som innehåller punkterna

Accessing & Allocating Alternatives

Kursplan. PR1017 Portugisiska: Muntlig språkfärdighet II. 7,5 högskolepoäng, Grundnivå 1. Portuguese: Oral Proficiency II

Vilka ska vi inte operera?

Statistical modelling and alignment of protein sequences

Course syllabus 1(7) School of Management and Economics. FEN305 Reg.No. EHVc 2005:6 Date of decision Course Code. Företag och Marknad I

Övning 5 ETS052 Datorkommuniktion Routing och Networking

Transkript:

Motif-based Hidden Markov Models for Multiple Sequence Alignment William N. Grundy Charles P. Elkan Dept. of Computer Science & Engineering University of California, San Diego

Abstract Protein families are well characterized by a collection of motifs (Sonnhammer & Kahn 1994), sometimes referred to as the common core (Chothia & Lesk 1986). These motifs can have structural and functional significance, and they may frequently be operated upon as units by diverse evolutionary mechanisms. The quality of a multiple alignment depends upon how accurately it identifies an ordered series of motifs (McClure et al. 1994, 1995). Hidden Markov models (HMMs) provide a theoretically sound modeling paradigm for collections of motifs for which efficient algorithms exist. Meta-MEME (Grundy et al. 1996, 1997) is a software toolkit that builds left-to-right, motif-based HMMs that focus upon the common core. Meta-MEME has been shown to detect remote homologies using smaller training sets than are required by standard HMMs. In an analysis of four protein families, Meta-MEME alignments are shown to be of higher quality than those produced by standard HMMs. In a previous analysis of nine other multiple alignment methods (McClure et al. 1994), only one method yields significantly higher quality alignments than Meta-MEME.

Meta-MEME Step 1: MEME builds motif models with maximal posterior probability, given a set of related sequences. Step 2: MAST uses only the motifs that appear in more than half the training sequences to search a database for the typical order and spacing of those motifs. Step 3: Meta-MEME constructs a hidden Markov model that combines the motif models in a linear fashion according to the typical schema.

Methods Four families were analyzed: globins, eukaryotic kinases, aspartic acid proteases, and the RH domain of the RNA-directed DNA polymerase (reverse transcriptase). Each family was represented by a test set of twelve sequences (McClure et al. 1994). Sequence identity among the sequences was 10-70% for the globins and 8-30% for all other families. Standard HMMs (Krogh et al. 1994) were created using the HMMER v1.8 tool hmmt (Eddy 1995). Meta-MEME and HMMER models were aligned using hmma with default parameter settings.

Multiple Alignments Pre-determined test motifs are indicated by color. The x s above the alignments indicate Meta-MEME motif positions or HMMER match state positions. Amino acids are unaligned outside of motifs in Meta- MEME alignments or within insertions in HMMER alignments.

Meta-MEME Alignment of Globin Sequences...xxxxxxxx...xxxxxxxxxxxxxxxxx... HAHU vlspa...dktnvkaawgkvgahag...eygaealermflsfpttktyfphfd... HAOR mltda...ekkevtalwgkaaghge...eygaealerlfqafpttktyfshfd... HADK vlsaa...dktnvkgvfskigghae...eygaetlermfiaypqtktyfphfd... HBHU vhltp...eeksavtalwgkvnvde...vggealgrllvvypwtqrffesfgdls HBOR vhlsg...geksavtnlwgkvnine...lggealgrllvvypwtqrffeafgdls HBDK vhwta...eekqlitglwgkvnvad...cgaealarllivypwtqrffasfgnls MYHU glsdg...ewqlvlnvwgkveadip...ghgqevlirlfkghpetlekfdkfkhlk MYOR glsdg...ewqlvlkvwgkvegdlp...ghgqevlirlfkthpetlekfdkfkglk IGLOB mkffavlalcivgaiaspltadeaslvqsswkavshn...eveilaavfaaypdiqnkfsqf... GPUGNI altek...qeallkqswevlkqnip...ahslrlfaliieaapeskyvfsflkds. GPYL gvl...tdvqvalvkssfeefnanipknthrfftlvleiapgakdlfsflkgs. GGZLB mldqqtin...iikatvpvlkehgvtit...ttfyknlfakhpevrplfdmg......xxxxxxxxxxxx...xxxxxxxxx...xxxxxxxxxxxxxxx... HAHU...lshGSAQVKGHGKKVadalt...navaHVDDMPNALsa...lSDLHAHKLRVDPVNFk... HAOR...lshGSAQIKAHGKKVadals...taagHFDDMDSALsa...lSDLHAHKLRVDPVNFk... HADK...lshGSAQIKAHGKKVaaalv...eavnHVDDIAGALsk...lSDLHAQKLRVDPVNFk... HBHU tpdavmgnpkvkahgkkvlgafs...dglahldnlkgtfat...lselhcdklhvdpenfr... HBOR sagavmgnpkvkahgakvltsfg...dalknlddlkgtfak...lselhcdklhvdpenfn... HBDK sptailgnpmvrahgkkvltsfg...davknldnikntfaq...lselhcdklhvdpenfr... MYHU sedemkasedlkkhgatvltalggi..lkkkghheaeikplaq...shatkhkipvkylef... MYOR tedemkasadlkkhggtvltalgni..lkkkgqheaelkplaq...shatkhkisikfley... IGLOB...agkDLASIKDTGAFAt...HATRIVSFLse...vIALSGNTSNAAAVNSlvsklg GPUGNI.neipeNNPKLKAHAAVIfkticesatelrqkgHAVWDNNTLkr...lGSIHLKNKITDPHFEvmk... GPYL.sevpqNNPDLQAHAGKVfklt...yeaaIQLEVNGAVasdatlkslGSVHVSKGVVDAHFPvvk... GGZLB...rqeSLEQPKALAMTVlaa...aqNIENLPAILpav...kkiAVKHCQAGVAAAHYPi......xxxxxxxxxxxxxxxxxxxxxxxxxxx...xxxxxxxxxx HAHU...lLSHCLLVTLAAHLPAEFTPAVHASLDKfl...asVSTVLTSKYR HAOR...lLAHCILVVLARHCPGEFTPSAHAAMDKfl...skVATVLTSKYR HADK...fLGHCFLVVVAIHHPAALTPEVHASLDKfm...caVGAVLTAKYR HBHU...lLGNVLVCVLAHHFGKEFTPPVQAAYQKvv...agVANALAHKYH HBOR...rLGNVLIVVLARHFSKDFSPEVQAAWQKlv...sgVAHALGHKYH HBDK...lLGDILIIVLAAHFTKDFTPECQAAWQKlv...rvVAHALARKYH MYHU...ISECIIQVLQSKHPGDFGADAQGAMNKalelf..rkdmaSNYKELGFQG MYOR...ISEAIIHVLQSKHSADFGADAQAAMGKalelf..rndmaAKYKEFGFQG IGLOB ddhkargvsaaqfgefrtalvaylqanvswgdnvaaawnkal...dntfaivvprl GPUGNI...gaLLGTIKEAIKENWSDEMGQAWTEAYNQl...VATIKAEMKE GPYL...eaILKTIKEVVGDKWSEELNTAWTIAYDEla...iIIKKEMKDAA GGZLB...VGQELLGAIKEVLGDAATDDILDAWGKaygviadvfiqvEADLYAQAVE

HMMER Alignment of Globin Sequences xxxxxxxxxxx.xxxxxxxxx.xxxxx...xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxx HAHU V.LSPADKTN..VKAAWGKVG.AHAGE...YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSA HAOR M.LTDAEKKE..VTALWGKAA.GHGEE...YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSA HADK V.LSAADKTN..VKGVFSKIG.GHAEE...YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSA HBHU VHLTPEEKSA..VTALWGKVN.VDEVG...G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPD HBOR VHLSGGEKSA..VTNLWGKVN.INELG...G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAG HBDK VHWTAEEKQL..ITGLWGKVNvAD.CG...A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPT MYHU G.LSDGEWQL..VLNVWGKVE.ADIPG...HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSED MYOR G.LSDGEWQL..VLKVWGKVE.GDLPG...HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTED IGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKD GPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA...HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPE GPYL GVLTDVQVAL..VKSSFEEFN.ANIPK...N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQ GGZLB M.L.DQQTIN..IIKATVPVLkEHGVT...ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxx HAHU QVKGH.GKKVADA.LTN...AVA.HVDDMPNA...LSALS.D.LHAHKL...RVDPVNF.KLLSHCLL HAOR QIKAH.GKKVADA.L.S...TAAGHFDDMDSA...LSALS.D.LHAHKL...RVDPVNF.KLLAHCIL HADK QIKAH.GKKVAAA.LVE...AVN.HVDDIAGA...LSKLS.D.LHAQKL...RVDPVNF.KFLGHCFL HBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL...HVDPENF.RL.LGNVL HBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL...HVDPENFNRL..GNVL HBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL...HVDPENF.RL.LGDIL MYHU EMKASeDLKKHGA.TVL...TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECII MYOR EMKASaDLKKHGG.TVL...TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAII IGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA...R.GVSAA.QF..GEFR GPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK...N.KITDP.HF.EVMKG GPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS...K.GVVDA.HF.PVVKE GGZLB Q...PKALAM.TVL...AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..x HAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..R HAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..R HADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..R HBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..H HBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..H HBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..H MYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..G MYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..G IGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..L GPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..E GPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaA GGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E

Meta-MEME Alignment of RH Sequences...xxxxxxxx...xxxxxxxx HTLV-II ldt...apclfsdgspqkaayvlwdqtilqqd...itplpshethsaqkgellalicg SRV-I lnn...allvftdgsstgmaaytladtti...kfqtnlnsaqlvelqaliav RSV pvp...gptvftdasssthkgvvvwregprw...eikeiadlgasvqqlearavama HIV-II ipg...aetfytdgscnrqskegkagyvtdrg...kdkvkkleqttnqqaeleafama MoMLV pda...dhtwytdgssllqegqrkagaavttet...eviwakaldagtsaqraelialtqa Ingi pre...hyklwtdgsvslgeklgaaallhrnntl...icapktgagelscsyraecvaleig CAMV pee...kliietdasddywggmlkaikinegtntelicryasgsfkaaeknyhsndketlavint 17.6 ftk...kftlttdasdvalgavlsqdghplsyi...srtlneheinystiekellaivwa MAUP fnnstnlqepsdsrllyrkgswvnirfaay...lysklseekhglvpk HBV rpg...lcqvfadatptgwglvmghqrmr...gtfsaplpihtaellaacfa Copia fen...kiigyvdsdwagseidrksttgylfkmfdf.nlicwntkrqnsvaassteaeymalfea E.coli mlk...qveiftdgsclgnpgpggygailryrg...rektfsagytrttnnrmelmaaiva x...xxxxxxxxxx... HTLV-II Lraak...pwpsLNIFLDSKYLikylhslaigaflgtsahqtlqaalp... SRV-I Lsafp...nqpLNIYTDSAYLahsiplletvaqikhisetaklflqcqq... RSV Lllwp...ttpTNVVTDSAFVakmllkmgqegvpstaaafiledal... HIV-II Ltds...gpkVNIIVDSQYVmgisasqpteseskivnqiie... MoMLV Lkmae...gkkLNVYTDSRYAfatahihgeiyrrrglltsegkeiknkdeil... Ingi Lqrllkwlp...ryrstpsrLSIFSDSLSMltalqtgplavtdpilrrlwrll... CAMV Ikkfsiy...ltpvhFLIRTDNTHFksfvnlnykgdsklgrnir... 17.6 Tktfrhy...llgrhFEISSDHQPLswlyrmkdpnskltrwr... MAUP Flek...lreINFALDKVDVteidsklsrlmkfsvsaaydevgtlalkslfkfrnseres HBV Rsrs...gaNIIGTDNSVVlsrkytsfpwllgcaanwilrgtsfvyvpsa... Copia VrealwlkflltsiniklenpIKIYEDNQGCisiannpschkrakhidiky... E.coli Lealk...ehceVILSTDSQYVrqgitqwihnwkkrgwktadkkpvknvd......xxxxxxxxx... HTLV-II...pllqgktiylhhvrshtnlpdpistfNEYTDSLILapl... SRV-I...liynrsipfyighvrahsglpgpiahgNQKADLATKtvasn... RSV...sqrsamaavlhvrshsevpgfftegNDVADSQATfqay... HIV-II...emikkeaiyvawvpahkgiggNQEVDHLVSqgirqvl... MoMLV...allkalflpkrlsiihcpghqkghsaeargNRMADQAARkaaitetpdtstll... Ingi...lqvqrrkirirlqfvfdhcgvkrNEVCDEMAKkaadlpql... CAMV...wqawlshysfdvehikgtdNHFADFLSRefnkvns... 17.6...vklsefdfdikyikgkeNCVADALSRikleety... MAUP ikasfkqlrengkiaefsearrlwfeilklirldlfnasslacddllshlqdrrsi... HBV...lnpaddpsrgrlglsrpllrlpfrpttgrtSLYADSPSVpshlpdrvh... Copia...hfareqvqnnvicleyipteNQLADIFTKplpaarfve... E.coli...lwqrldaalgqhqikwewvkghaghpeNERCDELARaaamnptledtgyqvev

HMMER Alignment of RH Sequences #=RF xxxxxxxxxxxxxx.xxxxxxx..xxxxx.xxxx..x...xxxx...xxxxxxxx...xxxxx..x HTLV-II LDTAPCLFSDGSPQ.KAAYVLW..DQTIL.QQDI..T...PLP...SH.ETHS...AQ...K SRV-I LNNALLVFTDGSS...TGMAA...YTL.A..D..T...TIKF...QT.NLNS...AQ...L RSV PVPGPTVFTDASSS.THKGVVV...W.R.E.GP..R...WEIK...EIADLGA...SVQQ... HIV-II IPGAETFYTDGSCN.RQSKEGK..AGY.V.T.DR..G...KDKV...KKLEQTT...NQQ... MoMLV PDADHTWYTDGSSL.LQEGQRK..AGAAV.T.TE..T...EVIW...AK.ALDAG...TSAQ...R Ingi PREHYKLWTDGSVS.LGEKLGA...AALL.H.RN..N...TLIC...AP.KTGA...GELSCsyR CAMV PEEKLIIETDASDD.YWGGMLK..AIKIN.EGTN..T...ELICryasgsfkAA.EKNY...HSND... 17.6 FTKKFTLTTDASDVaLGAVLSQ..DGHPL.S.YIs.R...TLN...EH.EINY...STI...E MAUP FNNSTNLQEP.SDS.RLLYRKG...SWVN.I.RFaaY...LYSK...LSEEKHGLvpkfLEKL...R HBV RPGLCQVFADATP..TGWGLVM...GHQR.M..R..G...TFSA...PL.PIHT...AELL... Copia FENKIIGYVDSDWA.GSEIDRKstTGYLFkMFDF..N...LICW...NTKRQNSVa...ASST...E E.coli MLKQVEIFTDGSCL.GNPGPGG..YGAIL.R.YR..GrekTFSA...GYTRTTNNr...MELM... #=RF xxxxxxx...xxxxxx.xxxxxxx.xxx.xxxxxxx..xxxx.xxxxxxxxxxxx...xxxxxxxx.. HTLV-II GELLAL...ICGLRAaKPWPSL..NI..FLD.S...KYLI.KYL.HSLAIGAFl...GTSAHQ... SRV-I VELQAL...IAVLSA.FPNQPL..NI..YTD.S...AYLA.HSI.PLLETVAQikhiSETAKL... RSV LEARAV...AMAL...LLWPTTP.TNV.VTDSAF...VAKM.LLK.MGQEGV.P...STAAAF... HIV-II AELEAF...AMAL...TDSGPKV.NI..IVD.S...QYVM.GIS.ASQPTESE...SKIVNQ... MoMLV AELIALTqa...lKMAEGK.KLNVYTD.SRYaFATAHI...HGEI.YRR.RGLLTS.E...GKEIKN... Ingi AECVALE...IGLQRL.LKWLPRYrST..PSRLSIFs.DSLS.MLT.ALQTGPLA...VTDPIL... CAMV KETLAV...INTI.K.KFSIYL..TPVhFLI.R...TDNT.HFK.SFVNLN.Y...KGDSKL... 17.6 KELLAI...VWAT.K.TFRHYLL.GRH.FEISSD...HQPL.SWL.YR.M.KD...PNSKL... MAUP EINFALDkvdvteIDSKLS.RLMKFSV.SA..AYDEVGTl.ALKS.LFKFRNSERESI...KASFKQLRen HBV AACFARSr...SGANIIgTDNSVVL.SR..KYTSFPWllGCAA.NWILRGTSFV.Y...VPSALNPA.. Copia AEYMAL...FEAVREaLWLKFLL.TS..INIKL...ENPI.KIYEDNQGCISIa..nNPSCHKRAk. E.coli AAIVALE...ALKEHC.EVILSTD.SQ..YVRQGI...TQWIhNWKKRGWKTADK...KPVKNV... #=RF...xxxxxxxxxxxxxxxxx.xxxxxx.xxxxxxxxx..xxxxxxxxxxxx...xxxx.xxx...xxxx HTLV-II...TLQA..ALPPLLQGKT...IYLH.HVRSHT...N.LPDPISTFN...EYTDsLIL...APL. SRV-I...FLQCQ.QLIYNRSIPF...YIGH.V.RAHS...G.LPGPIAHGN...QKAD.LATktvASN. RSV...ILED..ALS..QRSAM...AAVLH.V.RSHSEVPgfFT.EGNDVADS...QAT..F...QAY. HIV-II...IIEEM.IKK..EAIYV...AWVPA.H.KGIGG...NQEV.D.HLV...SQG...IR...QVL. MoMLV...KDEIL.ALLKALFLPKRlSIIHCPgHQKGHSAEAr.GN.RMADQAARKaaiTET..PDT...STLL Ingi...RRLWR.LLLQVQRRKI...RIRLQ.FVFDHCGVK..RN.EVCD.EMA...KKA..ADL...PQL. CAMV...GRNIRWQAW..LSHYS...FDVEH.I.KGTD...N.HFAD.FLS...REF..N...KVNS 17.6...TRWRV.KLS..EFDFD...IKY..I.KGKE...N.CVAD.ALSR...IKL..E...ETY. MAUP gkiaefsear.rlw.feilkl...irld.l.fnas...s.lacd.dll...shl..qdr...rsi. HBV...DDPSRGRLG..LSRPL...LRLP.F.RPTTGR...TSLYAD.SPSV...PSHL.PD...RVH. Copia...hIDIKYHFAR.EQVQNN...VICLE.Y.IPTE...N.QLADIFTK...PLP..AAR...FVE. E.coli...DLWQRLDAALGQHQIKW.EWVKGH.AGHPENER...CDELAR.AAAM...NPTL.EDTg.yQVEV

Alignment Scores Globins 1 2 3 4 5 Total META-MEME 12 11 11* 11* 12* 57 HMMER 11 12* 11* 10* 11* 55 AMULT 12 12 12 12 12 60 ASSEMBLE 12 11 12 12 12 59 CLUSTAL V 12 11 12 12 12 59 DFALIGN 12 12 12 12 12 60 GENALIGN 11*' 12 12 10' 11 56 MULTAL 12 11 12 12 12 59 MACAW 9 11 9 8 8 45 PIMA 12 12 12 12 12 60 PRALIGN 8 8* 9* 8* 10 43 Kinases 1 2 3 4 5 6 7 8 Total META-MEME 12 11 10 12 12 12 12 11 92 HMMER 12 12* 12* 12 12 12 12 12 96 AMULT 12 10 11 12 12 12 12 12 93 ASSEMBLE 11 7 10 12 12 12 12 12* 87 CLUSTALV 12 11 11* 12 12 12 12 12* 94 DFALIGN 12 12 12 12 12 12 12 12 96 GENALIGN 12' 9* 10 12 12 12 12* 11* 90 MULTAL 12 9* 10* 12 12 12* 12 12 91 MACAW 8 0 9 12 12 10 12 0 63 PIMA 12 11 11 12 12 12 12 12 94 PRALIGN 12 10* 6* 4 9* 9* 4 4 58 Proteases 1 2 3 Total META-MEME 12 4*' 9* 25 HMMER 11 8 6* 25 AMULT 11 7 10 28 ASSEMBLE 0 CLUSTALV 12 9* 6* 27 DFALIGN 12 12* 12 36 GENALIGN 11 8*' 7* 26 MULTAL 10 7* 9* 26 MACAW 12 4 8 24 PIMA 12 5* 5* 22 PRALIGN 8* 4* 8* 20 RHs 1 2 3 4 Total META-MEME 11 9 11 12 43 HMMER 11 10* 5*' 7* 33 AMULT 11 9* 8* 7* 35 ASSEMBLE 0 CLUSTALV 12 9 9* 9* 39 DFALIGN 12 12 10 12 46 GENALIGN 12*' 7 8*' 9*' 36 MULTAL 11* 11* 9* 10 41 MACAW 7 5 7 3 22 PIMA 10 9 8* 11* 38 PRALIGN 9 8* 6* 3 26 Scores represent the number of sequences in which each motif was correctly aligned. Scores for non-hmm methods are from (McClure et al. 1994). A * indicates that the motif was correctly aligned in two or more misaligned subsets of the test sequences. A indicates that a gap was inserted into the motif.

Results Globins Kinases Proteases RHs Total DFALIGN 60 96 36 46 238 CLUSTALV 59 94 27 39 219 META-MEME 57 92 25 43 217 MULTAL 59 91 26 41 217 AMULT 60 93 28 35 216 PIMA 60 94 22 38 214 HMMER 55 96 25 33 209 GENALIGN 56 90 26 36 208 MACAW 45 63 24 22 154 PRALIGN 43 58 20 26 147 ASSEMBLE 59 87 0 0 146 Meta-MEME ranks third overall. MEME discovers 19 of the 20 motifs in the test set. Only DFALIGN (Feng & Doolittle, 1987) significantly outperforms other methods.

Discussion Meta-MEME s focus on the common core allows accurate models to be trained from fewer sequences than are required by standard HMMs. By focusing its models on highly conserved regions of the training set, Meta-MEME effectively ignores noisy portions of the data, thereby allowing the software to properly align distant homologs. Meta-MEME alignments could be extended by running another alignment program on the unaligned, non-motif regions. A server for MEME and MAST, as well as the source code for Meta-MEME, is available at http://www.sdsc.edu/meme.

References Bailey, T. L. and C. P. Elkan. Fitting a mixture model by expectationmaximization to discover motifs in biopolymers. 2nd ISMB, 1994. Chothia, C. and A. M. Lesk. The relation between the divergence of sequence and structure in proteins. EMBO, 5:823-826, 1986. Eddy, S. R. Multiple Alignment Using Hidden Markov Models. 3rd ISMB, 114-120, 1995. Feng, D.-F. and R. F. Doolittle. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25:351-360, 1987. Grundy, W. N., T. L. Bailey, C. P. Elkan and M. E. Baker. Meta- MEME: Motif-based hidden Markov models of protein families. CABIOS, 1997. To appear. Grundy, W. N., T. L. Bailey, C. P. Elkan and M. E. Baker. Hidden Markov model analysis of motifs in steroid dehydrogenases and their homologs. BBRC, 231(3):760-766, 1997. Krogh, A., M. Brown, I. Mian, K. Sjolander and D. Haussler. Hidden Markov Models in Computational Biology: Applications to Protein Modeling. JMB, 234:1501-1531, 1994. McClure, M. A., T. K. Vasi and W. M. Fitch. Comparative analysis of multiple protein-sequence alignment methods. Mol. Bio. Evol., 11(4):571-592, 1994. McClure, M. A. and R. Raman. Parametrization studies of hidden Markov models representing highly divergent protein sequences. 28th HICSS, 184-193, 1995. Sonnhammer, E. and D. Kahn. The modular arrangement of proteins as inferred from analysis of homology. Protein Science, 3:482-492, 1994.