Tentamen i 2D1396 Bioinformatik, 2 juni 2006

Tentamen i 2D396 Bioinformatik, 2 juni 2006 Kursansvarig: Lars Arvestad Inga hjälpmedel förutom skrivmedel är tillåtna. Skriv tydligt! Skriv bara på en sida av pappret och behandla bara en uppgift per pappersblad. Ge dina svar tydliga motiveringar. Lämna plats för kommentarer vid rättning. För godkänt krävs 5 poäng, 20 poäng ger betyg 4, och vid 25 poäng ges betyg 5. Du får tillgodoräkna dig bonuspoäng från hemtalen även på denna tenta. Lösningsförslag kommer att hittas på kursens hemsida. Resultaten anslås bredvid huvudingången till SBC:s korridor. Lycka till! No aids beyond writing equipment are accepted. Write clearly! Please use only one side of each paper and don t address more than one question per page. Justify your answers! Leave room for comments during grading. A passing grade is awarded at 5 points, 20 points are required for grade 4, and 25 points for grade 5. Suggested solutions will be available from the course web page. Exam results will be posted by SBC s main entrance. Good luck! Del. Varför finns det så många delprogram i Blast-paketet: blastp, blastn, tblastn, blastx, och tblastx? Vilka är de viktigaste skillnaderna mellan dem? (2p) Why are there som many subprograms in the Blast package: blastp, blastn, tblastn, blastx, och tblastx? What are the main differences between them? Var god börja nästa uppgift på nytt papper. Please start next question on a new paper. 2. I figur finner du ett fylogenetiskt träd. Ge en avståndsmatris som stämmer perfekt överens med kantlängderna i trädet! (2p) Figure shows a phylogenetic tree with branchlengths. Write down a distance matrix that agrees perfectly with the branchlengths! a 7 8 2 d b 7 0 e Figur. c

3. Förklara följande begrepp. (5p) (a) Homologi (b) Sekundärstruktur (c) Synonyma codon (d) Positive inside rule (e) Molekylär klocka Please explain the following terms. (a) Homology (b) Secondary structure (c) Synonymous codons (d) Positive inside rule (e) Molecular clock 4. Förklara i ord vad HMM:en avbildad i figur 2 modellerar. Ge ett (enkelt) exempel på DNA som modellen passar väl mot. (2p) Explain in words what the HMM in Figure 2 is modelling. Give a simple example of DNA data that the model fits well with. Figur 2. 5. I figur 3 finner du två olika linjeringar. Vilken av dessa kommer att få högst Z-värde, och varför? Vi linjerar med en scoringfunktion som sätter + på identiska par och - på icke-identiska. Varje indel-symbol (-) har score -. (2p) Figure 3 displays two different alignments. Which one will get the highest Z-value, and why? We align using a scoring function that sets + for identical letter pairs and - for non-identical letters. Each indel character (-) has score -. (a) ACGTACGTACGTACGT AAGTAAG---GTAGGT (b) AAAAAAAAAAAAAACC AAAAAAA---CAAAAA Figur 3. 6. Hur definierar SCOP respektive Pfam begreppet domän? (2p) How does SCOP respectively Pfam define the term domain? Del 2 6. SCOP och CATH är två system för hierarkisk klassificering av proteindomäner. De skiljer sig åt en smula, men de har gemensamt att de delar upp domänfamiljer i klasser efter sekundärstruktur och exempel på klasser är α, β, α + β, och α/β. Föreslå en metod för automatisk klassificering av proteindomäner efter dessa klasser. Ge en översiktlig beskrivning av vad som krävs och diskutera eventuella svagheter med ditt förslag. (2p) SCOP and CATH are two systems for hierarchical classifications of protein domains. They have som differences, but one thing in common is that they use secondary structure for class definitions and examples of classes are α, β, α + β, and α/β. Suggest a method for automatic classification of protein domains into these classes. Give a basic overview of what is necessary and discuss potential weaknesses with your suggestion. 2

7. Forskare på bioteknikinstitutionen har nyligen tittat på en uppsättning gener hos Populus och andra växter som är intressanta därför att de är inblandade i regleringen av cellulosaproduktion. Man fann en mycket välbevarad domän som är karakteristisk för dessa gener, men övriga delar av generna uppvisar mycket få likheter. Figur 4 visar en linjering av domänfamiljen, med övriga delar av generna borttagna. (a) En av generna från Arabidopsis, At5g37478, saknade som synes en viktig del av domänen. Givet hur välbevarad domänerna var i övrigt var detta förvånande och man kunde misstänka att det berodde på en dålig genprediktion. Därför ville man undersöka om man kunde hitta en en fullständig domän i den contig som genen/domänen kom ifrån. Föreslå två olika metoder för att göra detta och ge en fördel och en nackdel med var metod. (4p) (b) Skillnaderna mellan domänsekvenserna är tydliga, trots att de är välbevarade, och det finns anledning att tro att ett fylogenetiskt träd återskapat baserat på domänen (med resten av generna borttaget) är relativt pålitligt. Men en bootstrap-analys ger mycket svagt stöd till så gott som alla kanter. Hur kommer det sig? (p) Researchers at the Biotech department have recently looked at a set of genes in Populus and other plants that are interesting because they are involved in the regulation of cellulose production. It was found that a well conserved domain was characteristic for these genes, while the rest of the genes show very little similaities. Figure 4 shows an alignment of the domainfamily, with other parts of the genes removed. (a) One of the Arabidopsis genes, At5g37478, misses an important part of the domain. Given the conservation of the domain family, this was surprising and it was guessed that a bad gene prediction was the root of the problem. Therefore, it was investigated whether one could find a complete domain in the contig that the gene/domain came from. Suggest two different methods for doing this and give one advantage and one disadvantage for each method. (b) The difference between the domain sequences are clear, despite their conservation, and there is reason to expect a phylogenetic tree reconstructed from the domain (with the rest of the genes discarded) to be fairly reliable. But a bootstrap analysis gives very weak support for almost all edges. How come? Ptr592874 62 LHTGQRALKRAMFNYSVATKIYMNE-QQKRQIERIQKIIEE--EEVRTMRKEMVPRAQLM 8 At5g37478 39 -----------MFNYSVATNYYIQK-LQKKQEERLQKMIEE--EEIRMLRKEMVPKAQLM 84 Os09g3650. 89 LHTEERAIKRAGFNYQVASKINTNE-IIRRFEEKLSKVIEE--REIKMMRKEMVHKAQLM 45 AT3G005. 376 LHSDIRAVERAEFDYQVTEKINLVE-QYKTERERQQKLAEE--EEIRRLRKEFVPKAQPM 432 AT5G550. 386 LHSDVRAVERAEFDYQVAEKMSFIE-QYKMERERQQKFAEE--EEIRRLRKEFVPKAQPM 442 Ptr66865 83 LHSDIRAVERADFDHQVSEKMSLIE-QYKMERERQRKLAEE--EEIRRLRKELVPKAQPM 39 Ptr546644 387 LRSDIRAVERADFDHQVSEKMSLIE-QYKMERERQQKLAEE--EEVRRLRKELVPKAQPM 443 Os2g38790. 326 LHSEIRSVGRARFDHQVAERNSFLE-KLNMERERQQKLDEE--LEIKQLRKEQVPRAHPM 382 Os03g400. 92 LHSDVRAIERAEFDQYVSERNKFAE-QLRLERERQQKLEEE--EMIKQLRKELVPKAQPM 248 AT5G44270. 20 LHVDHRPIERADFDHKIKEKEMMYK-RHLEEAEAAKMVEEE--RALKQLRRTIVPQTRPV 266 ATG03780.2 66 LHVEHRAVERADFDHKIKEKENQYK-RYREESEAAKMVEEE--RALKQMRKTMVPHARPV 672 Ptr753 728 LNADHRAVGRAEFDQKVKEKEMLYK-RYREESETARMMEEE--KALKQLRRTMVPHARPV 784 Ptr594654 705 LHADQRAVERAEFDHKVKEKEMLYK-RYREESETAKMMEEE--KALKQLRRTMVPHARPV 76 Os07g32390. 698 LHVDERAVQRSEFDNMVKEKEITYK-RFREENEFAQKIEEE--KAFKQLRRTFVPQARPL 754 Ptr658207 87 FRSEERVAKRKEFFQKLGEKNNAKEDTEKKHLHARPKEKAE--HDLKKLRQSAVFRGKPS 244 AT3G26050. 403 FRSDERAEKRKEFFKKVEEKNKKEK-EDKFSCGFKANQNTNLASEEHKNPQVGGFQVTPM 46 Figur 4. Numbers indicate start and stop residues for the domain in the gene product. Grey levels are proportional to column conservation, with amino acid properties taken into account (some replacements are more dramatic than others). 3

8. I sammanfattningen för en artikel av Ternes et al. (J. Biol. Chem. 2006) står det så här. Fungal glucosylceramides play an important role in plant-pathogen interactions enabling plants to recognize the fungal attack and initiate specific defense responses. A prime structural feature distinguishing fungal glucosylceramides from those of plants and animals is a methyl group at the C9-position of the sphingoid base, the biosynthesis of which has never been investigated. Using information on the presence or absence of C9-methylated glucosylceramides in different fungal species, we developed a bioinformatics strategy to identify the gene responsible for the biosynthesis of this C9-methyl group. This phylogenetic profiling allowed the selection of a single candidate out of 24 7 methyltransferase sequences present in each of the fungal species with C9-methylated glucosylceramides. A Pichia pastoris knock-out strain lacking the candidate sphingolipid C9- methyltransferase was generated, and indeed, this strain contained only non-methylated glucosylceramides. Beskriv hur Ternes (antagligen) använde fylogenetisk profilering för att hitta den gen som ansvar för framtagningen av metyl-gruppen. (3p) The abstract of a paper by Ternes et al. (J. Biol. Chem. 2006) is given above (in the Swedish formulation). Describe how Ternes (most likely) used a phylogenetic profiling to find the gene creating the methyl-group. 9. I kursen har vi nämnt scoring-matriser för sekvensjämförelser, tex Blosum62 och PAM250, men ni har inte behövt bekymra er om vilken ni faktiskt har använt. Verktyg som Blast använder vettiga standardvärden och i fallet med scoring-matriser är det Blosum62 som används. Dessa matriser är dock väldigt generella och det finns exempel på data där de är långt ifrån optimala. (a) Ett exempel där det tagits fram alternativa scoringmatriser är för jämförelser av transmembranproteiner. Varför skulle Blosum62 inte vara lämplig om det var just transmembranregioner man jämför? (2p) (b) Antag att du blir ombedd att göra en scoring-matris för jämförelser av proteiner från två olika klasser av bakterier X och Y. Till din hjälp har du ett program som beräknar en scoringmatris givet en indata-matris som beskriver hur ofta aminosyra a har ersatts aminosyra b (för alla a och b), och hur ofta den hållits konserverad. Ditt arbetsmaterial är proteinsekvenser från ett X-genom och ett Y -genom. Beskriv hur du skulle gå till väga. (3p) We have mentioned scoring matrices for sequence comparisons in the course, for example Blosum62 and PAM250, but you have not had to worry about which one you have actually used. Tools such as Blast use reasonable default settings and Blosum62 is used in the case of scoring matrices. These matrices are however very general and there are examples of data where they are far from optimal. (a) One example of data for which specialized scoring matrices have been used is transmembrane proteins. Why would Blosum62 be unsuitable when transmembrane regions are compared? (b) Suppose you were asked to compute a scoring matrix for comparisons of proteins from two different classes of bacteria, X and Y. You have access to a computer program that computes a scoring matrix given an input matrix that describes how often a residue a has been replaced by residue b (for all pairs of a and b), and how often a stayed conserved. Your working material is protein sequences from one X genome and one Y genome. Describe your approach. 4

Suggested solutions to exam June 2, 2006, in 2D396 Bioinformatics Lars Arvestad Part. The programs handle two types of data, proteins and DNA. For example, blastp compares protein sequences with protein databases, while blastn compares DNA sequences to DNA databases. Translation of DNA is the other main point sought for in this question: tblastn and blastx compares proteins with DNA, and vice versa, while translating the DNA to make it comparable with proteins. Finally, tblastx compares DNA with DNA while translating both to amino acid sequences. 2. D a b c d e a 0 b 4 0 c 25 25 0 d 8 8 3 0 e 8 8 3 2 0 3. Look it up! 4. The HMM describes repetitions of a 0 bp region, where all positions but two and perfectly conserved. Example data: ACACATACGT ACACGTGCGT. Comment: This a so-called tandem repeat that is found in the mitochondrial DNA in dogs and wolves (see Savolainen et al, Mol Biol Evol, 2000) and it is due to a mechanism called replication slippage. 5. Alignment (a) will get the highest Z-score. The reason is that the complexity of the sequences in (b) is so low that any alignment of the shuffled sequences, according to Z-score methodology, will get a similar score to the given alignment. According to a FASTA program, alignment (a) has Z 75 and the value for (b) is so insignificant that it is not reported. 6. SCOP defines a protein domain to be an independent structural unit. Regardless of its context, i.e., other domains present, it will fold to the same molecular structure. In Pfam a domain is an independent evolving unit and no reference to structure is given. In practice, this means that domains are sets of subsequences that are all very similar to each other. Part 2 6. A simple approach to solving this is to do a secondary structure prediction, say using PSIpred, and use that as a basis for classification. For instance, if there are lots of α elements and no β are reported, we call it an all-α. A similar rule is needed for the all-β class. For α + β we check if on part (beginning or end) of the protein contains α elements (and

coil, etc) while the rest contains β elements. For α/β we check that the groups of α and β predictions take turns. An important problem with this solution is that it is sensitive to errors in predictions. In practice, an all-α protein may be predicted to contain some β elements, which would ruin the prediction. 7. (a) One could use Blast or Fasta and search with another domain sequence in the contig. This is easy to do and you don t have to rely on having good gene prediction at all. On the other hand, you might get a spurious hit and/or not see that the domain you find is not part of a gene at all. Also, if the domain is split up in different exons we might not find the similarity. Using an HMM for the domain family on all ORFs in the contig has the same advantages and disadvantages. One could also try different gene prediction programs and see if they make other predictions. That way, we could actually determine whether a full, protein coding, gene is there. This could reveal other interesting similarities that does not involve the domains. The drawback is that gene predictions are hard to do and all gene prediction softwares may make serious mistakes. Henrik Aspeborg at the Biotech department did use this latter method successfully, and At5g37478 was shown to have a full domain within a gene which was similar to the closest Populus homologue. (b) Since the domain family is both conserved and short, the bootstrapping procedure is handicapped by not having informational columns. In each bootstrap iteration, only about 63% of the columns are actually used and this means that 37% of data is not used. In this case, that is too much data to throw away and a lot of edges in the tree cannot be resolved in the bootstrap procedure. Using another method (not covered in the course) that does not throw away data, we could actually show that the phylogeny quite trustworthy. 8. Ternes and his coauthors looked at the gene content of all known fungi genomes to see if there was a gene that was present in all genomes that had the mentioned C9-methylation, and was absent when no C9-methylation was used. The reasoning was that the methylation gene would not be necessary in the latter genomes and therefore redundant which during evolution would cause it to disappear. In the former genomes, where the gene is necessary, selection would make sure it stays. Ternes did find such a gene, and some other candidates with almost as good correlation. A knockout experiment showed that they immediately had found the right gene. 9. (a) The elements of scoring matrices contain log-odds scores, i.e., logarithms of fractions of probabilities. These probabilities reflect (expected) amino acid composition and replacement frequencies of the sequences compared. If you are working with data with very different amino acid frequencies, the scoring matrix will give you the wrong idea about what is a common replacement and not. Hence, in transmembrane regions, the typical hydrophobic amino acid composition differs a lot from what you find on average in proteins. (b) A brief suggestion: find pairs of protein sequences, one from X and one from Y, that are homologous and preferably orthologous. This can be done using Blasting, perhaps together with a phylogenetic analysis. Then, when we have the pairs of homologous proteins, we align them pairwise and count replacements. This will be the input to that program of ours. 2