Statistical modelling and alignment of protein sequences

Relevanta dokument
Biochemistry 201 Advanced Molecular Biology (

Exam Molecular Bioinformatics X3 (1MB330) - 1 March, Page 1 of 6. Skriv svar på varje uppgift på separata blad. Lycka till!!

Is it worth to parameterize sequence alignment with an explicit evolutionary model?

Adding active and blended learning to an introductory mechanics course

Mapping sequence reads & Calling variants

Isometries of the plane

Tentamen Molekylärbiologi X3 (1MB608) 10 March, 2008 Page 1 of 5. Skriv svaren på varje fråga på SEPARATA blad.

Room E3607 Protein bioinformatics Protein Bioinformatics. Computer lab Tuesday, May 17, 2005 Sean Prigge Jonathan Pevsner Ingo Ruczinski

12.6 Heat equation, Wave equation

Preschool Kindergarten

The Arctic boundary layer

Module 6: Integrals and applications

Robust och energieffektiv styrning av tågtrafik

A QUEST FOR MISSING PULSARS

Molecular Biology Primer

Support Manual HoistLocatel Electronic Locks

Hur fattar samhället beslut när forskarna är oeniga?

Supplementary Data. Figure S1: EIMS spectrum for (E)-1-(3-(3,7-dimethylocta-2,6-dienyl)-2,4,6-trihydroxyphenyl)butan-1-one (3d) 6'' 7'' 3' 2' 1' 6

LUNDS TEKNISKA HÖGSKOLA Institutionen för Elektro- och Informationsteknik

Stad + Data = Makt. Kart/GIS-dag SamGIS Skåne 6 december 2017

Senaste trenderna från testforskningen: Passar de industrin? Robert Feldt,

8 < x 1 + x 2 x 3 = 1, x 1 +2x 2 + x 4 = 0, x 1 +2x 3 + x 4 = 2. x 1 2x 12 1A är inverterbar, och bestäm i så fall dess invers.

Michael Q. Jones & Matt B. Pedersen University of Nevada Las Vegas

Authentication Context QC Statement. Stefan Santesson, 3xA Security AB

Module 1: Functions, Limits, Continuity

Changes in value systems in Sweden and USA between 1996 and 2006

SUPPLEMENTARY FIGURE LEGENDS

Grafisk teknik IMCDP IMCDP IMCDP. IMCDP(filter) Sasan Gooran (HT 2006) Assumptions:

Health café. Self help groups. Learning café. Focus on support to people with chronic diseases and their families

MOLECULAR SHAPES MOLECULAR SHAPES

Measuring child participation in immunization registries: two national surveys, 2001

Rev No. Magnetic gripper 3

Beijer Electronics AB 2000, MA00336A,

Klimat och miljö vad är aktuellt inom forskningen. Greppa Näringen 5 okt 2011 Christel Cederberg SIK och Chalmers

Kurskod: TAIU06 MATEMATISK STATISTIK Provkod: TENA 17 August 2015, 8:00-12:00. English Version

Tentamenskrivning: TMS145 - Grundkurs i matematisk statistik och bioinformatik,

Custom-made software solutions for increased transport quality and creation of cargo specific lashing protocols.

Förändrade förväntningar

Grafisk teknik IMCDP. Sasan Gooran (HT 2006) Assumptions:

Kurskod: TAMS28 MATEMATISK STATISTIK Provkod: TEN1 05 June 2017, 14:00-18:00. English Version

Accomodations at Anfasteröd Gårdsvik, Ljungskile

Styrteknik: Binära tal, talsystem och koder D3:1

Chapter 2: Random Variables

Grafisk teknik. Sasan Gooran (HT 2006)

Grass to biogas turns arable land to carbon sink LOVISA BJÖRNSSON

1. Compute the following matrix: (2 p) 2. Compute the determinant of the following matrix: (2 p)

denna del en poäng. 1. (Dugga 1.1) och v = (a) Beräkna u (2u 2u v) om u = . (1p) och som är parallell

A Framework for Understanding Rosetta. Xavier Ambroggio

Schenker Privpak AB Telefon VAT Nr. SE Schenker ABs ansvarsbestämmelser, identiska med Box 905 Faxnr Säte: Borås

Datasäkerhet och integritet

Installation Instructions

Theory 1. Summer Term 2010

Labokha AA et al. xlnup214 FG-like-1 xlnup214 FG-like-2 xlnup214 FG FGFG FGFG FGFG FGFG xtnup153 FG FGFG xtnup153 FG xlnup62 FG xlnup54 FG FGFG

Semantic and Physical Modeling and Simulation of Multi-Domain Energy Systems: Gas Turbines and Electrical Power Networks

F ξ (x) = f(y, x)dydx = 1. We say that a random variable ξ has a distribution F (x), if. F (x) =

Vision 2025: Läkemedel i miljön är inte längre ett problem

This exam consists of four problems. The maximum sum of points is 20. The marks 3, 4 and 5 require a minimum

Sannolikhetsteori. Tentamenskrivning: TMS145 - Grundkurs i matematisk statistik och bioinformatik,

Use of alcohol, tobacco and illicit drugs: a cause or an effect of mental ill health in adolescence? Elena Raffetti 31 August 2016

Tentamen i 2D1396 Bioinformatik, 2 juni 2006

Hidden Markov Models and other Multiple-sequence Profile approaches

The present situation on the application of ICT in precision agriculture in Sweden

Säkerhetsfunktioner rstå varandra? Finns behov av att avvika från normal säkerhetsfunktion s vissa betingelser under uppstart, ändringar i processen

INSTALLATION INSTRUCTIONS

Motif-based Hidden Markov Models for Multiple Sequence Alignment

2.1 Installation of driver using Internet Installation of driver from disk... 3

Rastercell. Digital Rastrering. AM & FM Raster. Rastercell. AM & FM Raster. Sasan Gooran (VT 2007) Rastrering. Rastercell. Konventionellt, AM

Make a speech. How to make the perfect speech. söndag 6 oktober 13

FÖRBÄTTRA DIN PREDIKTIVA MODELLERING MED MACHINE LEARNING I SAS ENTERPRISE MINER OSKAR ERIKSSON - ANALYSKONSULT

Kurskod: TAMS24 / Provkod: TEN (8:00-12:00) English Version

FANNY AHLFORS AUTHORIZED ACCOUNTING CONSULTANT,

Second handbook of research on mathematics teaching and learning (NCTM)

English Version. + 1 n 2. n 1

GPS GPS. Classical navigation. A. Einstein. Global Positioning System Started in 1978 Operational in ETI Föreläsning 1

- den bredaste guiden om Mallorca på svenska! -

Kundfokus Kunden och kundens behov är centrala i alla våra projekt

Sustainability transitions Från pilot och demonstration till samhällsförändring

Resultat av den utökade första planeringsövningen inför RRC september 2005

Om oss DET PERFEKTA KOMPLEMENTET THE PERFECT COMPLETION 04 EN BINZ ÄR PRECIS SÅ BRA SOM DU FÖRVÄNTAR DIG A BINZ IS JUST AS GOOD AS YOU THINK 05

Högskolan i Skövde (SK, JS) Svensk version Tentamen i matematik

Regional Carbon Budgets

Isolda Purchase - EDI

Inför projektuppgiften. Markus Buschle,

TEXTURED EASY LOCK BLOCK INSTALLATION GUIDE. australianpaving.com.au

The reception Unit Adjunkten - for newly arrived pupils

Mer om Rainflowcykler

FYTA11-ma1, ht13. Respondents: 11 Answer Count: 9 Answer Frequency: 81,82 %

Viktig information för transmittrar med option /A1 Gold-Plated Diaphragm

Examensarbete i matematik på grundnivå med inriktning mot optimeringslära och systemteori

Kristina Säfsten. Kristina Säfsten JTH

Kursplan. NA3009 Ekonomi och ledarskap. 7,5 högskolepoäng, Avancerad nivå 1. Economics of Leadership

Sri Lanka Association for Artificial Intelligence

Collaborative Product Development:

Affärsmodellernas förändring inom handeln

D-RAIL AB. All Rights Reserved.

TAKE A CLOSER LOOK AT COPAXONE (glatiramer acetate)

Swedish International Biodiversity Programme Sida/SLU

Kurskod: TAMS11 Provkod: TENB 28 August 2014, 08:00-12:00. English Version

SOLAR LIGHT SOLUTION. Giving you the advantages of sunshine. Ningbo Green Light Energy Technology Co., Ltd.

Transkript:

Statistical modelling and alignment of protein sequences Martin Weigt Laboratoire de Biologie Computationnelle et Quantitative Université Pierre et Marie Curie Paris ENS Paris 11 July 2016

What is the information in -LNQFADDLAHELRTPVNILLGKNQVMLS-QERSAEEYQQALVDNIEELEGLSRLTENILFLARAEH- ALGELTAGIAHEINNPTAVILGNTELIRFLGADASRV-EEEIDAILLQIERIRNITRSLLQYSRQG-- SQRQFVTNASHELKTPIAIISANTEVLEI----TMGK-NQWTETILKQVKRLSGLVNDMVALAKLEE- ---AFVSNASHELRTPVTSIKGFAETIKG-MSAEEEAKDDFLDIIYKESLRLEHIVEHLLTLSKAQ-- -VGQLTGGIAHDFNNMLTGVIGSLDLIKLS----GRLVERFMDAALISAQRAASLTDRLLAFSRRQS- ---RMTHQVSHEVGNMIGIITGSLGLLERETGFNDRQ-KRHIARIRKAADRGRSLASSMLTIGS---- ALGEMLDHIAHQWKQPINSISLIAQDMADYGELTDGDVQTTIDKIMSLLEHMSQTVDVFRGFYR---- -VGRLAGGVAHDFNNLLSVINGYCEMLAA-QVSDRPQALREVSEIHRAGLRAAGLTRQLLAFGRRQ-- SLGELAAGVAHEINNPNAVILLNVDLVKKWSEMSEEL-PLLLTEMEEGAGRIKRIVDDLKDFARGD-- -MGEFAAYIAHEINQPLSAIMTNANAGTRNEPSNIPEAKEALARIIRDSDRAAEIIRMVRSFLKRQ-- --GQLAGGIAHDFNNILQIISGNTQILQYQTNPDPP----QLLEILKAVERGTALTRSMLAFSRKQT- --GQLTGGIAHDFNNLLQVILGNLEFVRAKLDGDAK-LQTRIERAAWAAQRGATLTGQLLAFARKQ-- AKTDFLSNMSHEIRTPLNAILGFIQVLKD-AEMKPKD-REYLELMDESSKNLLSLVNDIIEIDLIESG --GREVLHLVHDLKTPLATIEGLVSLMET-RWPDPKM-QEYCQTIYGSITSMSKMVSEILY------- -RARLLADVAHELRTPVATLTGYLEAVEDVRPLDAST----IAVLRDQAVRLTRLAQDLADVTHAEGG SMKRMLTNMSHDLKTPLTVILGYIETIQSDPNMPDEERERLLGKLRQKTNELIQMINSFFDLAKLES- AKSEFLANMSHELRTPLNAIIGFSEMIQAFGPLGSDRYEEYINDIHTSGNFLLNVINDILDMSKIEAG -MQRFIADATHQLRTPLAAIDAEVELLTD-QTRDPKA----LDKLRGRIADLARLASQLLDHAM---- -RKKAVHTITHELRTPLTAITGYAGLIRK-EQCEDKS-GQYIQNILQSSDRMRDMLNTLLDFFRLDNG -REEFMNMTSHELMNPLSAAVQAAHTMISLHDDNSKSNIEIAKIILACGEHQQKLVEDARMMSKLD-- -KSRYVVGLSHELRSPLNAISGYAQLLEQDTSLAPKP-RDQVRVVRRSADHLSGLIDGILDISKIEAG ----AFSYMRHAINNPLSGMLYSRKALKN-TDLNEEQ-MRQIHVSDNCHHQLNKILADL--------- -QENFIDMTSHEMRNPLSAILQCSDEITST------LCLEAANTIALCASHQKRIVDDILTFSKLDS- SQRTLTNAIAHDLRQPLYRIRFALEMFND-SLLSIEQRQQYRQSIENSLRDLDHLINQSLQLSRYT-- --KLLLLSLSHDIKTPLSAIKLNAKALSRLYKDAEKQ-REAAEHINARADEIENFVSRITKASSE--- --HAFIADAAHELRTPLTALKLQLQLTER---ATSDVREVGFVKLNERLDRSIHLVKQLLTLARSES- -QKNFISNASHELNTPLTSIIVTADLALS-KQRTDEEYRTALSRIMDAAGHLE--------------- -RGALLTSISHDLRTPLASILGATSSLESGEELDENARKELLSTIHDEADRLNRFVANLLDMTRLEAG -KSEFLANMSHELRTPLNGVIGFTRLTLK-TELTPTQ-RDHLNTIERSANNLLAIINDVLDFSKLEAG AKSEFLANMSHDIRTPMNAITGMTAIATA-HIDDPKQVKNCLRKIALSSRHLLGLINDVLDMSKIESG -LSQFSADLAHDFRTPLANLIGQTEVTLA-HPRSAEEYRAVLESSLEEYARLSRMIEDMLFLARADH- SKSMFLATVSHELRTPLYGIIGNLDLLQT-KELPKGV-DRLVTAMNNSSSLLLKIISDILDFSKIES- AKTAFLATLSHEIRTPMNGVLGTAQILLK-TPLSTEQ-EKHLKSLYDSGDHMMTLLNEILDFSKIEQG SKKQLIDGIAHELRTPLVRLRYRLEMSEN---LTPPE----SQALNRDIGQLEALIEELLTYARLDR- -KTQFFINTAHDIRTPLTLIKAPLEELLEEETLTDNG-ITRTNIALRNVEVLLRLVSNLINFERT---...?

Sequence data are accumulating 100 UniProt database millions of sequence entries 10 1 without manual annotation UniProtKB/TrEMBL UniProtKB/SwissProt with manual annotation 0.1 2004 2007 2010 2013 2016

Protein can be classified into families

Protein can be classified into families Families of homologous proteins common evolutionary ancestry conserved structure and function diverged sequences (20-30% sequence identity) Questions: Can we identify and align homologous proteins? Can we extract family-specific signal from alignment? What are the underlying principles relating protein evolution and protein structure / function?

Protein can be classified into families Pfam 29.0 (2015) vs. 30.0 (2016): 16295 vs. 16306 families (22 new, 11 deleted) 116 domains of unknown function (DUF) newly annotated (with >3750 remaining unknown) 11.9 million proteins vs. 17.7 million proteins families contain protein domains

Protein can be classified into families

Domains as modular building blocks domains = structural and functional modules [Casino et al. 09]

Pfam provides multiple-sequence alignments -LNQFADDLAHELRTPVNILLGKNQVMLS-QERSAEEYQQALVDNIEELEGLSRLTENILFLARAEH- ALGELTAGIAHEINNPTAVILGNTELIRFLGADASRV-EEEIDAILLQIERIRNITRSLLQYSRQG-- SQRQFVTNASHELKTPIAIISANTEVLEI----TMGK-NQWTETILKQVKRLSGLVNDMVALAKLEE- ---AFVSNASHELRTPVTSIKGFAETIKG-MSAEEEAKDDFLDIIYKESLRLEHIVEHLLTLSKAQ-- -VGQLTGGIAHDFNNMLTGVIGSLDLIKLS----GRLVERFMDAALISAQRAASLTDRLLAFSRRQS- ---RMTHQVSHEVGNMIGIITGSLGLLERETGFNDRQ-KRHIARIRKAADRGRSLASSMLTIGS---- ALGEMLDHIAHQWKQPINSISLIAQDMADYGELTDGDVQTTIDKIMSLLEHMSQTVDVFRGFYR---- -VGRLAGGVAHDFNNLLSVINGYCEMLAA-QVSDRPQALREVSEIHRAGLRAAGLTRQLLAFGRRQ-- SLGELAAGVAHEINNPNAVILLNVDLVKKWSEMSEEL-PLLLTEMEEGAGRIKRIVDDLKDFARGD-- -MGEFAAYIAHEINQPLSAIMTNANAGTRNEPSNIPEAKEALARIIRDSDRAAEIIRMVRSFLKRQ-- --GQLAGGIAHDFNNILQIISGNTQILQYQTNPDPP----QLLEILKAVERGTALTRSMLAFSRKQT- --GQLTGGIAHDFNNLLQVILGNLEFVRAKLDGDAK-LQTRIERAAWAAQRGATLTGQLLAFARKQ-- AKTDFLSNMSHEIRTPLNAILGFIQVLKD-AEMKPKD-REYLELMDESSKNLLSLVNDIIEIDLIESG --GREVLHLVHDLKTPLATIEGLVSLMET-RWPDPKM-QEYCQTIYGSITSMSKMVSEILY------- -RARLLADVAHELRTPVATLTGYLEAVEDVRPLDAST----IAVLRDQAVRLTRLAQDLADVTHAEGG SMKRMLTNMSHDLKTPLTVILGYIETIQSDPNMPDEERERLLGKLRQKTNELIQMINSFFDLAKLES- AKSEFLANMSHELRTPLNAIIGFSEMIQAFGPLGSDRYEEYINDIHTSGNFLLNVINDILDMSKIEAG -MQRFIADATHQLRTPLAAIDAEVELLTD-QTRDPKA----LDKLRGRIADLARLASQLLDHAM---- -RKKAVHTITHELRTPLTAITGYAGLIRK-EQCEDKS-GQYIQNILQSSDRMRDMLNTLLDFFRLDNG -REEFMNMTSHELMNPLSAAVQAAHTMISLHDDNSKSNIEIAKIILACGEHQQKLVEDARMMSKLD-- -KSRYVVGLSHELRSPLNAISGYAQLLEQDTSLAPKP-RDQVRVVRRSADHLSGLIDGILDISKIEAG ----AFSYMRHAINNPLSGMLYSRKALKN-TDLNEEQ-MRQIHVSDNCHHQLNKILADL--------- -QENFIDMTSHEMRNPLSAILQCSDEITST------LCLEAANTIALCASHQKRIVDDILTFSKLDS- SQRTLTNAIAHDLRQPLYRIRFALEMFND-SLLSIEQRQQYRQSIENSLRDLDHLINQSLQLSRYT-- --KLLLLSLSHDIKTPLSAIKLNAKALSRLYKDAEKQ-REAAEHINARADEIENFVSRITKASSE--- --HAFIADAAHELRTPLTALKLQLQLTER---ATSDVREVGFVKLNERLDRSIHLVKQLLTLARSES- -QKNFISNASHELNTPLTSIIVTADLALS-KQRTDEEYRTALSRIMDAAGHLE--------------- -RGALLTSISHDLRTPLASILGATSSLESGEELDENARKELLSTIHDEADRLNRFVANLLDMTRLEAG -KSEFLANMSHELRTPLNGVIGFTRLTLK-TELTPTQ-RDHLNTIERSANNLLAIINDVLDFSKLEAG AKSEFLANMSHDIRTPMNAITGMTAIATA-HIDDPKQVKNCLRKIALSSRHLLGLINDVLDMSKIESG -LSQFSADLAHDFRTPLANLIGQTEVTLA-HPRSAEEYRAVLESSLEEYARLSRMIEDMLFLARADH- SKSMFLATVSHELRTPLYGIIGNLDLLQT-KELPKGV-DRLVTAMNNSSSLLLKIISDILDFSKIES- AKTAFLATLSHEIRTPMNGVLGTAQILLK-TPLSTEQ-EKHLKSLYDSGDHMMTLLNEILDFSKIEQG SKKQLIDGIAHELRTPLVRLRYRLEMSEN---LTPPE----SQALNRDIGQLEALIEELLTYARLDR- -KTQFFINTAHDIRTPLTLIKAPLEELLEEETLTDNG-ITRTNIALRNVEVLLRLVSNLINFERT---...

Protein can be classified into homologous families If we assign a sequence to a family, we predict its structure and function

Aligning two sequences How to compare / align two amino-acid sequences (a 1,...,a La ), (b 1,...,b Lb )? take inspiration from evolution underlying evolutionary processes: mutation, insertion, deletion assume independent evolution of distinct positions for simplicity

Aligning two sequences How to compare / align two amino-acid sequences (a 1,...,a La ), (b 1,...,b Lb )? take inspiration from evolution underlying evolutionary processes: mutation, insertion, deletion assume independent evolution of distinct positions for simplicity two ingredients similarity between amino acids - based on physico-chemical properties - based on pre-existing sequence alignments: substitution matrix (e.g. BLOSUM) S(a, b) = log f(a, b) f(a)f(b) from frequency counts of aligned positions

Aligning two sequences How to compare / align two amino-acid sequences (a 1,...,a La ), (b 1,...,b Lb )? take inspiration from evolution underlying evolutionary processes: mutation, insertion, deletion assume independent evolution of distinct positions for simplicity two ingredients similarity between amino acids gap penalty for gap of length k... a i a i+1 a i+2... a i+k a i+k+1...... b j... b j+1... affine gap penalty d +(k 1)e, d > e > 0 (gap opening more costly than gap extension)

Aligning two sequences How to compare / align two amino-acid sequences (a 1,...,a La ), (b 1,...,b Lb )? take inspiration from evolution underlying evolutionary processes: mutation, insertion, deletion assume independent evolution of distinct positions for simplicity two ingredients similarity between amino acids gap penalty total alignment score = sum of substitution scores - gap penalties

Needleman-Wunsch algorithm (1970) global alignment maximise total alignment score by dynamic programming iterative construction of alignment score F (i, j) =Score(a 1,...,a i ; b 1,...,b j )

Needleman-Wunsch algorithm (1970) global alignment maximise total alignment score by dynamic programming iterative construction of alignment score initialisation F (0, 0) = 0 recursion by adding two aligned amino acids, or one amino acid, one gap until 8 >< F (i, j) = max >: F (i, j) =Score(a 1,...,a i ; b 1,...,b j ) F (L a,l b ) is reached F (i 1,j 1) + S(a i,b j ) adding a i b j F (i 1,j)+d adding a i F (i, j 1) + d adding bj

Needleman-Wunsch algorithm (1970) global alignment maximise total alignment score by dynamic programming iterative construction of alignment score initialisation F (0, 0) = 0 recursion by adding two aligned amino acids, or one amino acid, one gap 8 >< F (i, j) = max >: F (i, j) =Score(a 1,...,a i ; b 1,...,b j ) F (i 1,j 1) + S(a i,b j ) adding a i b j F (i 1,j)+d adding a i F (i, j 1) + d adding bj until F (L a,l b ) is reached traceback: follow backwards path leading from (0, 0)! (L a,l b )

Smith-Waterman algorithm (1981) local alignment: find similar sub-sequences (e.g. common domains) reset negative scores to zero 8 >< F (i, j) = max >: F (i 1,j 1) + S(a i,b j ) adding a i b j F (i 1,j)+d adding a i F (i, j 1) + d adding bj 0 restart local alignment traceback: start from maximal score traceback until zero score hit

BLAST (Altshul et al. 1990) Basic Local Alignment Search Tool dynamic programming too slow when searching one sequence against large sequence database (e.g. Uniprot) heuristic speedup: idea: alignments contain typically highly similar subsequences - construct all 3-letter subsequences from query sequence - construct list of similar 3-letter sequences - locate in search database - extend alignment around hits

Multiple-sequence alignments How to align M sequences: (a 1 1,a 1 2,...,a 1 L 1 ) (a 2 1,a 2 2,...,a 2 L 2 )... (a M 1,a M 2,...,a M L M ) dynamic programming: exact but time O(L 1 L 2... L M ) need heuristic methods for up to 10 6 sequences basic idea (Feng Dolittle 1987): organise data hierarchically align closest sequences first align alignments when proceeding into the tree possibly iteratively refined

Multiple-sequence alignments 1 2 3 4 1 2 3 4 A 34 = align(a 3,a 4 ) A 12 = align(a 1,a 2 ) A 1234 = align(a 12,A 34 )

Multiple-sequence alignments 1 2 3 4 1 2 3 4 A 34 = align(a 3,a 4 ) A 12 = align(a 1,a 2 ) A 1234 = align(a 12,A 34 ) need to align alignments e.g. and gives STAR STIR SKAT PIT PIG STAR STIR SKAT P-IT P-IG

Multiple-sequence alignments 1 2 3 4 1 2 3 4 A 34 = align(a 3,a 4 ) A 12 = align(a 1,a 2 ) A 1234 = align(a 12,A 34 ) need to align alignments e.g. and gives STAR STIR SKAT PIT PIG STAR STIR SKAT P-IT P-IG insertion of column of gaps into input alignments substitution score for two columns = sum over pairwise substitution scores e.g. for last column S(R, T )+S(R, G)+S(R, T )+S(R, G)+S(T,T)+S(T,G) standard pairwise alignment algorithms can be used

What is the information in -LNQFADDLAHELRTPVNILLGKNQVMLS-QERSAEEYQQALVDNIEELEGLSRLTENILFLARAEH- ALGELTAGIAHEINNPTAVILGNTELIRFLGADASRV-EEEIDAILLQIERIRNITRSLLQYSRQG-- SQRQFVTNASHELKTPIAIISANTEVLEI----TMGK-NQWTETILKQVKRLSGLVNDMVALAKLEE- ---AFVSNASHELRTPVTSIKGFAETIKG-MSAEEEAKDDFLDIIYKESLRLEHIVEHLLTLSKAQ-- -VGQLTGGIAHDFNNMLTGVIGSLDLIKLS----GRLVERFMDAALISAQRAASLTDRLLAFSRRQS- ---RMTHQVSHEVGNMIGIITGSLGLLERETGFNDRQ-KRHIARIRKAADRGRSLASSMLTIGS---- ALGEMLDHIAHQWKQPINSISLIAQDMADYGELTDGDVQTTIDKIMSLLEHMSQTVDVFRGFYR---- -VGRLAGGVAHDFNNLLSVINGYCEMLAA-QVSDRPQALREVSEIHRAGLRAAGLTRQLLAFGRRQ-- SLGELAAGVAHEINNPNAVILLNVDLVKKWSEMSEEL-PLLLTEMEEGAGRIKRIVDDLKDFARGD-- -MGEFAAYIAHEINQPLSAIMTNANAGTRNEPSNIPEAKEALARIIRDSDRAAEIIRMVRSFLKRQ-- --GQLAGGIAHDFNNILQIISGNTQILQYQTNPDPP----QLLEILKAVERGTALTRSMLAFSRKQT- --GQLTGGIAHDFNNLLQVILGNLEFVRAKLDGDAK-LQTRIERAAWAAQRGATLTGQLLAFARKQ-- AKTDFLSNMSHEIRTPLNAILGFIQVLKD-AEMKPKD-REYLELMDESSKNLLSLVNDIIEIDLIESG --GREVLHLVHDLKTPLATIEGLVSLMET-RWPDPKM-QEYCQTIYGSITSMSKMVSEILY------- -RARLLADVAHELRTPVATLTGYLEAVEDVRPLDAST----IAVLRDQAVRLTRLAQDLADVTHAEGG SMKRMLTNMSHDLKTPLTVILGYIETIQSDPNMPDEERERLLGKLRQKTNELIQMINSFFDLAKLES- AKSEFLANMSHELRTPLNAIIGFSEMIQAFGPLGSDRYEEYINDIHTSGNFLLNVINDILDMSKIEAG -MQRFIADATHQLRTPLAAIDAEVELLTD-QTRDPKA----LDKLRGRIADLARLASQLLDHAM---- -RKKAVHTITHELRTPLTAITGYAGLIRK-EQCEDKS-GQYIQNILQSSDRMRDMLNTLLDFFRLDNG -REEFMNMTSHELMNPLSAAVQAAHTMISLHDDNSKSNIEIAKIILACGEHQQKLVEDARMMSKLD-- -KSRYVVGLSHELRSPLNAISGYAQLLEQDTSLAPKP-RDQVRVVRRSADHLSGLIDGILDISKIEAG ----AFSYMRHAINNPLSGMLYSRKALKN-TDLNEEQ-MRQIHVSDNCHHQLNKILADL--------- -QENFIDMTSHEMRNPLSAILQCSDEITST------LCLEAANTIALCASHQKRIVDDILTFSKLDS- SQRTLTNAIAHDLRQPLYRIRFALEMFND-SLLSIEQRQQYRQSIENSLRDLDHLINQSLQLSRYT-- --KLLLLSLSHDIKTPLSAIKLNAKALSRLYKDAEKQ-REAAEHINARADEIENFVSRITKASSE--- --HAFIADAAHELRTPLTALKLQLQLTER---ATSDVREVGFVKLNERLDRSIHLVKQLLTLARSES- -QKNFISNASHELNTPLTSIIVTADLALS-KQRTDEEYRTALSRIMDAAGHLE--------------- -RGALLTSISHDLRTPLASILGATSSLESGEELDENARKELLSTIHDEADRLNRFVANLLDMTRLEAG -KSEFLANMSHELRTPLNGVIGFTRLTLK-TELTPTQ-RDHLNTIERSANNLLAIINDVLDFSKLEAG AKSEFLANMSHDIRTPMNAITGMTAIATA-HIDDPKQVKNCLRKIALSSRHLLGLINDVLDMSKIESG -LSQFSADLAHDFRTPLANLIGQTEVTLA-HPRSAEEYRAVLESSLEEYARLSRMIEDMLFLARADH- SKSMFLATVSHELRTPLYGIIGNLDLLQT-KELPKGV-DRLVTAMNNSSSLLLKIISDILDFSKIES- AKTAFLATLSHEIRTPMNGVLGTAQILLK-TPLSTEQ-EKHLKSLYDSGDHMMTLLNEILDFSKIEQG SKKQLIDGIAHELRTPLVRLRYRLEMSEN---LTPPE----SQALNRDIGQLEALIEELLTYARLDR- -KTQFFINTAHDIRTPLTLIKAPLEELLEEETLTDNG-ITRTNIALRNVEVLLRLVSNLINFERT---...?

Profile models Sequence profiles assume independent residue positions LY P (A 1,...,A L )= f i (A i ) i=1 Information in a column = amino-acid conservation score I i = log 2 (21) + X A f i (A) log 2 f i (A)

Profile Hidden Markov Models (phmm) S. Eddy - HMMer D: amino-acid deletion M: amino-acid match I: amino-acid insertion parameters (transition & emission probs) inferred from seed alignment alignment of query sequence to phmm = path from START to END (e.g. seq. HMMPATH aligned as hmmpath)

Profile models Sequence profiles = one of the most frequently used tools in bioinformatics detection of conserved residue multiple-sequence alignments homology detection structural modelling and functional annotation BUT: treats residues independently intrinsically unable to provide structural information intrinsically unable to detect protein-protein interaction intrinsically unable to detect epistasis between mutations What can we learn from residue-residue correlations?

From sequence variability to phenotype Sequence alignment -LNQFADDLAHELRTPVNILLGKNQVMLS-QERSAEEYQQALVDNIEELEGLSRLTENILFLARAEH- ALGELTAGIAHEINNPTAVILGNTELIRFLGADASRV-EEEIDAILLQIERIRNITRSLLQYSRQG-- SQRQFVTNASHELKTPIAIISANTEVLEI----TMGK-NQWTETILKQVKRLSGLVNDMVALAKLEE- ---AFVSNASHELRTPVTSIKGFAETIKG-MSAEEEAKDDFLDIIYKESLRLEHIVEHLLTLSKAQ-- -VGQLTGGIAHDFNNMLTGVIGSLDLIKLS----GRLVERFMDAALISAQRAASLTDRLLAFSRRQS- ---RMTHQVSHEVGNMIGIITGSLGLLERETGFNDRQ-KRHIARIRKAADRGRSLASSMLTIGS---- ALGEMLDHIAHQWKQPINSISLIAQDMADYGELTDGDVQTTIDKIMSLLEHMSQTVDVFRGFYR---- -VGRLAGGVAHDFNNLLSVINGYCEMLAA-QVSDRPQALREVSEIHRAGLRAAGLTRQLLAFGRRQ-- SLGELAAGVAHEINNPNAVILLNVDLVKKWSEMSEEL-PLLLTEMEEGAGRIKRIVDDLKDFARGD-- -MGEFAAYIAHEINQPLSAIMTNANAGTRNEPSNIPEAKEALARIIRDSDRAAEIIRMVRSFLKRQ-- --GQLAGGIAHDFNNILQIISGNTQILQYQTNPDPP----QLLEILKAVERGTALTRSMLAFSRKQT- --GQLTGGIAHDFNNLLQVILGNLEFVRAKLDGDAK-LQTRIERAAWAAQRGATLTGQLLAFARKQ-- AKTDFLSNMSHEIRTPLNAILGFIQVLKD-AEMKPKD-REYLELMDESSKNLLSLVNDIIEIDLIESG --GREVLHLVHDLKTPLATIEGLVSLMET-RWPDPKM-QEYCQTIYGSITSMSKMVSEILY------- -RARLLADVAHELRTPVATLTGYLEAVEDVRPLDAST----IAVLRDQAVRLTRLAQDLADVTHAEGG SMKRMLTNMSHDLKTPLTVILGYIETIQSDPNMPDEERERLLGKLRQKTNELIQMINSFFDLAKLES- AKSEFLANMSHELRTPLNAIIGFSEMIQAFGPLGSDRYEEYINDIHTSGNFLLNVINDILDMSKIEAG -MQRFIADATHQLRTPLAAIDAEVELLTD-QTRDPKA----LDKLRGRIADLARLASQLLDHAM---- -RKKAVHTITHELRTPLTAITGYAGLIRK-EQCEDKS-GQYIQNILQSSDRMRDMLNTLLDFFRLDNG -REEFMNMTSHELMNPLSAAVQAAHTMISLHDDNSKSNIEIAKIILACGEHQQKLVEDARMMSKLD-- -KSRYVVGLSHELRSPLNAISGYAQLLEQDTSLAPKP-RDQVRVVRRSADHLSGLIDGILDISKIEAG ----AFSYMRHAINNPLSGMLYSRKALKN-TDLNEEQ-MRQIHVSDNCHHQLNKILADL--------- -QENFIDMTSHEMRNPLSAILQCSDEITST------LCLEAANTIALCASHQKRIVDDILTFSKLDS- SQRTLTNAIAHDLRQPLYRIRFALEMFND-SLLSIEQRQQYRQSIENSLRDLDHLINQSLQLSRYT-- --KLLLLSLSHDIKTPLSAIKLNAKALSRLYKDAEKQ-REAAEHINARADEIENFVSRITKASSE--- --HAFIADAAHELRTPLTALKLQLQLTER---ATSDVREVGFVKLNERLDRSIHLVKQLLTLARSES- -QKNFISNASHELNTPLTSIIVTADLALS-KQRTDEEYRTALSRIMDAAGHLE--------------- -RGALLTSISHDLRTPLASILGATSSLESGEELDENARKELLSTIHDEADRLNRFVANLLDMTRLEAG -KSEFLANMSHELRTPLNGVIGFTRLTLK-TELTPTQ-RDHLNTIERSANNLLAIINDVLDFSKLEAG AKSEFLANMSHDIRTPMNAITGMTAIATA-HIDDPKQVKNCLRKIALSSRHLLGLINDVLDMSKIESG -LSQFSADLAHDFRTPLANLIGQTEVTLA-HPRSAEEYRAVLESSLEEYARLSRMIEDMLFLARADH- SKSMFLATVSHELRTPLYGIIGNLDLLQT-KELPKGV-DRLVTAMNNSSSLLLKIISDILDFSKIES- AKTAFLATLSHEIRTPMNGVLGTAQILLK-TPLSTEQ-EKHLKSLYDSGDHMMTLLNEILDFSKIEQG SKKQLIDGIAHELRTPLVRLRYRLEMSEN---LTPPE----SQALNRDIGQLEALIEELLTYARLDR- -KTQFFINTAHDIRTPLTLIKAPLEELLEEETLTDNG-ITRTNIALRNVEVLLRLVSNLINFERT--- ---VFIDNMTHEMKTPLTSIIGFSDLLRS-ARLDDETVHDYAESIYKEGKYLKSISSKLMDL------ Phenotype protein structure protein function P RR HK P ATP ADP RR target gene [Casino et al. 09] mutational effects [Podgornia et al. 15] using ONLY sequence information

First observation: Residue contacts induce residue coevolution contact in 3D co-evolution statistical analysis R I D H R L K N T D H F L N G R L R D T D H H E R Q E T G E L K H K Y R T R L T D L D H R R A M E V G N L K H T Q K E E L A N L K H K Q Q S E V E N A K H R L N Q R A D D L D H correlation

First observation: Residue contacts induce residue coevolution contact in 3D co-evolution statistical analysis R I D H R L K N T D H F L N G R L R D T D H H E R Q E T G E L K H K Y R T R L T D L D H R R A M E V G N L K H T Q K E E L A N L K H K Q Q S E V E N A K H R L N Q R A D D L D H correlation Inverse question: Are sequence correlations indicative for residue-residue contacts? [Gobel et al. 94, Neher 94, Ranganathan et al. 99 ]

First observation: Residue contacts induce residue coevolution contact in 3D co-evolution statistical analysis Mutual information measures pair correlation MI ij = A,B f ij (A, B) ln f ij(a, B) f i (A) f j (B) R I D H R L K N T D H F L N G R L R D T D H H E R Q E T G E L K H K Y R T R L T D L D H R R A M E V G N L K H T Q K E E L A N L K H K Q Q S E V E N A K H R L N Q R A D D L D H correlation f i (A) f j (B) f ij (A, B)

Strong correlations residue contacts Trypsin inhibitor: i j > 4 30 strongest correlations - contact - no contact

Second observation: Correlation is not coupling i j i j i j direct-coupling analysis contact pair prediction: only direct coupling inter-protein correlation: direct + indirect coupling i j i j correlations are mediated by network of direct couplings disentangle direct and indirect couplings: P (A 1,..., A L )

Direct coupling analysis (DCA) Maximum-entropy modeling (I) coherence with data: model generates empirical correlations P ij (A i, A j ) = {A k k=i,j} P (A 1,...,A L ) P (A 1,..., A L ) = f ij (A i, A j )!

Direct coupling analysis (DCA) Maximum-entropy modeling (I) coherence with data: model generates empirical correlations P ij (A i, A j ) = (II) minimally constrained statistical model P (A 1,..., A L ) maximum entropy {A i } {A k k=i,j} P (A 1,...,A L ) P (A 1,..., A L ) = f ij (A i, A j ) P (A 1,..., A L ) ln P (A 1,..., A L ) max!

Direct coupling analysis (DCA) Maximum-entropy modeling (I) coherence with data: model generates empirical correlations P ij (A i, A j ) = {A k k=i,j} P (A 1,..., A L ) = f ij (A i, A j ) (II) minimally constrained statistical model P (A 1,..., A L ) maximum entropy {A i } P (A 1,...,A L ) P (A 1,..., A L ) ln P (A 1,..., A L ) max! Potts model / Markov random field P (A 1,..., A L ) exp + e ij (A i, A j ) + i<j i h i (A i ) direct coupling of residues i and j

Direct coupling analysis (DCA) determine correlations generated by model! P ij (A i, A j ) = P (A 1,..., A L ) = f ij (A i, A j ) {A k k=i,j} exponential time complexity ~ 21 L our approximations - the first: belief propagation - the fastest: naive mean-field - the most accurate: pseudo-likelihood max - less overfitting: dimensional reduction and approximations by others - MCMC sampling - Bayesian networks - pseudo-likelihood maximization - sparse inverse covariance (PSICOV) - meta classification [Weigt et al, PNAS 09] [Morcos et al, PNAS 11] [Ekeberg et al, Phys Rev E 13] [Cocco et al, PLoS CB 13] [Lapedes et al, LANL preprint 02] [Burger et al, PLoS Comp Biol 10] [Balakrishnan et al., Proteins 11] [Jones et al., Bioinformatics 12] [Skwark et al., Bioinformatics 13]

DCA strongly improves contact prediction Trypsin inhibitor: i j > 4 30 strongest correlations 30 strongest couplings - contact - no contact works across numerous protein families accurate prediction requires >1000 sufficiently diverged sequences

Not all contacts co-vary, but... Ras (correlation) Ras (DCA) DCA can guide complex assembly: protein structure prediction: [Schug, MW, Onuchic, Hwa, Szurmant, PNAS 09] [Dago, Schug, Procaccini, Hoch, MW, Szurmant, PNAS 12] [Ovchinnikov et al., elife 14] [Marks et al., PLoS ONE 11] [Sadowski et al., Comp Biol Chem 11] [Sulkowska, Morcos, MW, Hwa, Onuchic, PNAS 12] [Hopf et al., Cell 12] [Nugent, Jones, PNAS 12] [Ovchinnikov et al., elife 15] RNA structure prediction: [De Leonardis, Lutz, Cocco, Monasson, Schug, MW, NAR 15]

From contacts to 3D structure [Sulkowska, Morcos, MW, Hwa, Onuchic, PNAS 12]

ab initio protein folding simulations: molecular-dynamics simulations of structure-based models (Go-models): r V = V bond + V torsion + V contact with V bond = k b bonds From contacts to 3D structure (r r 0 ) 2 V torsion = k a angles ( 0) 2 + k d dihedral [1 cos( 0 )] + 1 2 [1 cos 3( 0)] V contact = c contacts ij r ij 12 2 ij r ij 6 use only DCA contacts

DCA for protein-protein interaction how to detect inter-protein residue contacts in protein complexes? DCA on joint multiple sequence alignment : each row contains a pair of interacting proteins protein family 1 protein family 2 cf. talk by AF Bitbol, poster by T. Gueudré

DCA for protein-protein interaction how to detect inter-protein residue contacts in protein complexes? DCA on joint multiple sequence alignment : each row contains a pair of interacting proteins consider the strongest inter-protein residue couplings response regulator histidine sensor kinase [Weigt et al. PNAS 09] [Schug et al. PNAS 09] [Ovchinnikov et al. elife 14] 29 known complexes, 36 predictions [Hopf et al. elife 14] 76 known complexes, 32 predictions [Uguzzoni et al., in preparation 16] ~750 homo-dimeric proteins

DCA for protein-protein interaction how to detect inter-protein residue contacts in protein complexes? DCA on joint multiple sequence alignment : each row contains a pair of interacting proteins consider the strongest inter-protein residue couplings response regulator histidine sensor kinase Question: Can we discriminate between interacting & non-interacting protein families?

Inference of protein-protein interaction networks Bacterial ribosomal proteins Small ribosomal subunit 20 proteins 21 interactions (11% of 190 pairs) 5.8% of contacts between proteins Large ribosomal subunit 29 proteins 29 interactions (7% of 406 pairs) 4.5% of contacts between proteins sparse interaction network modular contact map [Feinauer, Szurmant, MW, Pagnani, PLoS ONE 16]

Inference of protein-protein interaction networks Bacterial ribosomal proteins Pairwise alignments (1000-3000 seqs.) Top 10 predictions for each subunit 16 true positive interactions (80% TP vs. 8% in random prediction) find most large interfaces fail to detect small interfaces false predictions appear in smaller alignments larger alignments needed [Feinauer, Szurmant, MW, Pagnani, PLoS ONE 16]

Predicting mutational effects in proteins Quantifying the fitness effect of mutations is crucial for understanding the determinants of genetic disease understanding the mechanism of evolution of drug resistance understanding the onset and proliferation of cancer helping to develop novel diagnostic and therapeutic tools cf. talks by M Kardar, M Lässig

Predicting mutational effects in proteins Quantifying the fitness effect of mutations is crucial for understanding the determinants of genetic disease understanding the mechanism of evolution of drug resistance understanding the onset and proliferation of cancer helping to develop novel diagnostic and therapeutic tools The most common approach supervised feature extraction using case/control studies (e.g. genome-wide association studies)

Predicting mutational effects in proteins Quantifying the fitness effect of mutations is crucial for understanding the determinants of genetic disease understanding the mechanism of evolution of drug resistance understanding the onset and proliferation of cancer helping to develop novel diagnostic and therapeutic tools The most common approach supervised feature extraction using case/control studies (e.g. genome-wide association studies) Our approach unsupervised modelling of evolutionary sequence data Bayesian integration of complementary knowledge (structure, mutagenesis)

Measuring mutational effects in proteins PNAS 110 (2013) 13067 Quantitative high-throughput mutagenesis TEM-1 protein causes antibiotic resistance generated ~10 4 random mutants 1,700 without mutation 990 distinct single AA changes measured resistance to amoxicillin minimum inhibitory concentration as proxy for fitness

Landscape inference by Direct-Coupling Analysis Beta-lactamase2 family (PF13354) TEM-1 Statistical landscape inference (DCA)... ~2,500 diverged sequences P (A 1 (a,...,a 1,...,a L L ) = X ) 8 ' i (a i )+ X 9 < LX ' ij (a LX i,a j ) = exp e : i ij (A i,a i,j j )+ h i (A i ) ; i,j=1 i=1? Score for mutant AA sequences = (mutant) P (mutant) (wildtype) = log P (wildtype) MIC changes of TEM-1 due to single-aa changes Evolutionary constraints across diverged homologs [Figliuzzi, Jacquier, Schug,Tenaillon, MW, Mol Biol Evol 16]

Landscape inference by Direct-Coupling Analysis Beta-lactamase2 family (PF13354) TEM-1 Statistical landscape inference (DCA)... ~2,500 diverged sequences P (A 1 (a,...,a 1,...,a L L ) = X ) 8 ' i (a i )+ X 9 < LX ' ij (a LX i,a j ) = exp e : i ij (A i,a i,j j )+ h i (A i ) ; i,j=1 i=1? Score for mutant AA sequences = (mutant) P (mutant) (wildtype) = log P (wildtype) MIC changes of TEM-1 due to single-aa changes Evolutionary constraints across diverged homologs [Figliuzzi, Jacquier, Schug,Tenaillon, MW, Mol Biol Evol 16]

Predicting mutational effects in proteins profile model DCA model SIFT PolyPhen2 Popmusic Imut+ MUpro Imut force fields solvent accessibility Blosum62 evolution based structural-stability based [Figliuzzi, Jacquier, Schug,Tenaillon, MW, Mol Biol Evol 16]

Capturing the context dependence of mutations A B i direct contacts all residue pairs i 0.55 3D structure MSA i R 2 0.5 1 i 0.45 residue fraction 0.5 i i 0 0 25 50 0.4 0 10 20 30 40 50 cutoff distance

Is there more information in ACSLPKVQGPCSGKHSYYYFNSANQQCETFVYGGCLGNTNRFATIEECNARC- VCLLPKSAGPCTGFTKKWYFDVDRNRCEEFQYGGCYGTNNRFDSLEQCQGTC- VCAMPPDAGVCTNYTPRWFFNSQTGQCEQFAYGSCGGNENNFFDRNTCERKCM TCSLSPSPGTCGPGVFKYHYNPQTQECESFEYLGCDGNSNTFASRAECENYCG -CHTEHSSGACPGAVTMFYHDPRTKKCTPFTFLGCGGNSNKFDTRPQCERFCK PCMLPSDKGNCQDILTRWYFDSQKHQCRAFLYSGCRGNANNFLTKTDCRNACM -----RLVGYCSPYLRRYFFNRTTEKCVLFIPERCEKDGNNFPNRKVCMKTCM PCSLKEDYGIGRAYYERWYFNTTTANCTRFIWGGNHKEWQQFR---------- PCKQDLDQGHGKTLQARYYFNKYAKVCEQFDYRGIDGNRNNFESLQECQQQC- -CFLKPDEGVGRAILKAFYYNPKNRRCEEFEYGGLGGNENNFETMEKCEEECK -CSQPAASGHGEQYLSRYFYSPEYRQCLHFIYSGERGNLNNFESLTDCLETCV LCNLKYDSGVGGEKSDKYFWVPKYTTCMRFSFYGTLGNANNFPNYNSCMATCG ---------RGADTIQRWYWDTNDLTCRTFKYHGQGGNFNNFGDKQGCLDFC- PCEQAIEEGIGNVLLRRWYFDPATRLCQPFYYKGFKGNQNNFMSFDTCNRACG PCGQPLDRGVGGSQLSRWYWNQQSQCCLPFSYCGQKGTQNNFLTKQDCDRTC- VCIQPLESGD-EPSVPRWWYNSATGTCVQFMWDPDTTNANNFRTAEHCESYCR TCVQPTATGP-NPTEPRWWYNSITGMCQQFLWDPTASGPNNFRTVEHCESFCR -CDQQLMLGVGGASMERFYYDTTDDACLVFNYSGVGGNENNFLTKAECQIAC- PCSVPLAPGTGNAGLARYYYNPDDRQCLPFQYNGKRGNQNNFENQADCERTC- ----PESEGVTGAPTSRWYYDQTDMQCKQFTYNGRRGNQNNFLTQEDCAATC- ACKMPLSVGIGGAPANRWYYDAAASTCKTFEYNGRKGNQNNFISEADCAATC- VCNLPMSTGEGNANLDRFYYDQQSKTCRPFVYNGLKGNQNNFISLRACQLSC- ICQQPMAVGTGGATLPRWYYNAQTMQCVQFNYAGRMGNQNNFQSQQACEQTC- PCSLPMFSGEGTGNLTRWYADSCSRQCKSFTYNGSKGNQNNFLTKQQCESKCK PCEEEMTQGEGSAALTRFYYDALQRKCLAFNYLGLKGNRNNFQSKEHCESTC- TCELPMTKGYGNSHLTRWHFDKNLNKCVKFIYSGEGGNQNMFLTQEDCLTVC- TCELTMTKGYGNSHLTRWHFDKNLNKCVKFIYSGEGGNQNMFLTQEDCLSVC- RCHLPPAVGYGKQRMRRFYFDWKTDACHELQYSGIGGNENIFMDYEQCERVCR -CMESLDRGSCEAMSNRYYFNKRARQCKGFHYTGCGKSGNNFLTKEECQTKC- PCQQPLQRGNCSQRIPLFYYNIHNHKCRKFMYRGCNGNENRFSNRRQCQAKCG?