Tentamen i DD2396 Bioinformatik, 3 mars 2008

Tentamen i DD2396 Bioinformatik, 3 mars 2008 Kursansvarig: Lars Arvestad Inga hjälpmedel förutom skrivmedel är tillåtna. Skriv tydligt! Skriv bara på en sida av pappret. Ge dina svar tydliga motiveringar. Lämna plats för kommentarer vid rättning. För godkänt, betyg E, krävs 15 poäng, 20 poäng ger betyg C, och vid 25 poäng ges betyg A. Bonuspoäng från hemtal räknas bara på del 1, som också har maxpoäng 15. Se kurshemsidan hur tentabetyg kombineras med extrauppgifter för att ge kursbetyget! Obs: Del 2 kommer bara att rättas om del 1 gav minst tio poäng. Lösningsförslag kommer att hittas på kursens hemsida. Resultaten kommer att mejlas ut. Lycka till! No aids but writing tools are allowed. Write clearly! Justify your answers! Leave room for comments during grading. Please use only one side of each paper and don t address more than one question per page. A passing grade (E) is awarded at 15 points, 20 points are required for grade C, and 25 points for grade A. See the course webpage for an explanation as to how this relates to course grades! Each successful homework will give one bonus point for part one, but the maximum points for part one is 15. Note: Part 2 will only be graded if part 1 has been awarded at least 10 points (including bonus). Exam results will be emailed to the participants. Good luck! Del 1 1. (a) Vad avses med ett proteins sekundärstruktur? What is meant by protein secondary structure? (b) Vad används neighbor joining till? What is neighbor joining for? (c) Vad avses med coverage i samband med genomik och genomsammansättning? What is meant by coverage when discussed in connection to genome assembly? (d) Vad är skillnaden mellan scaffold och contig? What is the difference between scaffold and contig? (e) På vilket sätt är Kimuras modell av DNA-evolution bättre än Jukes-Cantor-modellen? In what way is Kimura s model of DNA evolution superior to the Jukes-Cantor model? (f) Nämn ett mått på sekvenslikhet och förklara i en mening vad det innebär. Name one measure of sequence similarity and explain, in a single sentence, what it means. 2. Nämn två vanliga sätt att representera sekvensmotiv (eng: motifs). Name two common ways of representing a sequence motif. 3. (a) Varför brukar det vara bättre att linjera med affin gapkostnad än med linjär gapkostnad? Why is it, usually, better to align using affine gap costs rather than a linear gap costs? (b) Beräknar Blast lokal eller global linjering? Varför har utvecklarna av Blast gjort det valet? Is Blast computing local or global alignments? Why have the Blast developers made that choice? 1

4. Gör en illustration över eukaryot genstruktur och peka ut tre typiska kännetecken (dvs egenskaper hos gen-/genom-sekvens) som kan användas för ab initio genprediktion. (3p) Make an illustration of eukaryot gene structure and identify three features in gene/genome sequences that can be utilized for ab initio gene prediction. 5. Betrakta det orotade genträdet (((a 1, b 1 ), a 2 ), b 2, c). Del 2 (a) Rota genträdet med a 1 och b 1 som utgrupp. (b) Rota trädet så antalet duplikationer minimeras. Antag att artträdet är (A, (B, C)), med kopplingen mellan gener och arter given av bokstäverna. Här antar vi att det bara är duplikationer, förluster och artdelningar som sker. Illustrera rotningarna med bilder som har roten till vänster och löven till höger. Consider the unrooted gene tree (((a 1, b 1 ), a 2 ), b 2, c). (a) Root the gene tree with a 1 and b 1 as an outgroup. (b) Root the tree so that the number of duplications are minimized. Assume (A, (B, C)) is the species tree, with the mapping from genes to species given by the letters. We assume that only duplications, losses, and speciations occur during evolution. Illustrate the rooting with phylogenies drawn such that the root is on the left and the leaves are on the right. 6. Figur 1 visar domänarkitekturen för två sekvenser A och B. Den enda synbara likheten vi hittar mellan dem består av den upprepade domänen Nebulin (rektangel). Illustrera hur en dotplot antagligen ser ut för sekvenserna A och B! Figure 1 shows the domain architecture for two sequences A and B. The only notable similarity between them consists of the repeated domain Nebulin (rectangle). Illustrate what a dotplot for the A and B sequences probably looks like! A B Figur 1: Two sequences containing the Nebulin domains (white rectangles) and two other domains (circles, only present in B), presented in Pfam-style schematics. 7. Ett av dessa bästa multilinjeringsprogrammen idag heter MAFFT. Det fungerar ungefär som ClustalW, dvs programmet använder den vanligaste grundprincipen hos multilinjeringsprogram som vi också diskuterade under kursen. En väsentlig skillnad är dock att MAFFT itererar sitt förfarande. (a) Förklara varför det skulle vara bra att iterera. (b) Scoring för linjering brukar vara likadan över hela sekvenserna. Ett undantag är ClustalW som använder enkla regler för att göra vissa positioner, exempelvis i en alfahelix, mindre benägna för gap. Erfarenheten säger dock att ClustalW inte har nån fördel av denna metod. Kanske är det pga att reglerna är för enkla? Argumentera för eller emot att basera linjeringsscore på modern sekundärstrukturprediktion! One of the best multialignment programs today is called MAFFT. It works pretty much like ClustalW, i.e., the program uses the most common principle for multialignment, which was discussed during the course. A significant difference is however that MAFFT iterates its computations. 2

(a) Explain why iterations would be a good thing in this context. (b) Alignment scoring is usually performed uniformly over the sequences. One exception is in ClustalW, which uses simple rules for determining that some positions, such as those in an alpha helix, are less likely to have gaps. Experience tells us, however, that ClustalW does not really benefit from this method. Maybe the rules are too simple? Argue for or against having alignment score based on modern secondary structure prediction! 8. Ett vanligt sätt att återskapa artträd är att samla ihop flera väl valda genfamiljer, linjera dessa var för sig, samt konkatenera dem till en enda lång multilinjering, se figur 2. Denna multilinjering består då av en supergen per art. Därefter beräknar man ett träd på den konkatenerade linjeringen. Man säger att man använder en supergenmetod (eng: super gene method). Fam 1 Fam 2 Fam 3 Unaligned Aligned "Super gene alignment" Figur 2: Illustration of super genes. Starting with unaligned sequences in different gene families, we align and concatenate them into a superalignment. I en artikel från förra året utvärderar Rasmussen och Kellis (Genome Research, 2007) olika metoder för fylogeni och visar samtidigt på styrkan med supergenmetoden. De har utnyttjat att vi idag har 12 fullständiga flug-genom med ett artträd som är pålitligt bestämt. Detta artträd används som facit i deras utvärdering. (a) Nämn en fördel och en nackdel med supergenmetoden! (b) Författarna lägger ner en hel del möda på att på att använda ortologa gener från synteniska regioner i sin analys. Varför det? (c) Som framgår av figur 3, och som figurtexten framhåller, är analys baserad på geners DNAsekvens mycket säkrare än analys baserad på genernas översättning. Förklara varför! A common way of reconstructing species trees is to gather several carefully chosen gene families, align them family-wise, and concatenate them into one long multialignment, see figure 2. This multialignment consists of one super gene per species. Then a phylogeny is computed based on the super alignment. In a paper from last year, Rasmussen and Kellis (Genome Research, 2007) test different phylogenetic methods and also show the strength of the super gene method. They have used the fact that we today have 12 whole fly genomes with a species tree reliably determined. This species tree is assumed to be the true answer in their tests. (a) Name one advantage and one disadvantage with the super gene method! (b) The authors make quite an effort on finding orthologous genes from syntenic regions in their analysis. Why? (c) As can be seen in figure 3, and also commented on in their figure legend, analysis of genes DNA sequences is more reliable than an analysis based on a translation to amino acids. Explain why! 3

Figur 3: From Rasmussen and Kellis (Genome Res., 2007). Percentage of gene trees congruent to the species phylogeny correlates with gene length, regardless of the method used. DNA-based reconstruction consistently outperformed protein-based reconstruction across all methods. (By congruent to, the authors mean same as.) 9. I kursen nämnde vi PSIpred som en bra metod för sekundärstrukturprediktion. Beskriv, i stora drag, hur PSIpred fungerar. We have mentioned PSIpred as a good method for secondary structure prediction. Describe, in general terms, how PSIpred works. 10. Fylogenin i figur 4 är gjord för några gener för voltage-gated calcium channel s subunit α 1 (VGCC) i människa, råtta, blåsfisk (Fugu), samt C. elegans (mask) och Drosophila melanogaster (bananfluga). Genfamiljen VGCC är viktig för kalciumreglering och är inblandad i ett flertal cellfunktioner. Eftersom flera sjukdomar är relaterade till mutationer hos VGCC-gener är det intressant att förstå hur dessa gener har utvecklats hos ryggradsdjuren och hur våra VGCC-gener relaterar till modellorganismernas gener. Om man antar att fylogenin är korrekt, vad kan man säga om antalet VGCC-gener hos det första ryggradsdjuret? The phylogeny in figure 4 is made from genes for the voltage-gated calcium channel s subunit α 1 (VGCC) in man, rat, puffer fish (Fugu), C. elegans (worm) and Drosophila melanogaster (fly). The VGCC gene family is important for calcium regulation and is involved in several cellular functions. Since several diseases are connected to mutations in VGCC genes, it is interesting to understand how these genes have evolved in vertebrates and how our VGCC genes relate to those of the model organisms. Assuming the phylogeny is correct, what can be said about the number of VGCC genes in the first vertebrate? 4

Figur 4: From Wong et al, Gene, 2005: Phylogenetic analysis of VGCC α 1 -subunits from Fugu, C. elegans, Drosophila, rat, rabbit and human. Only protein sequence spanning domains I IV of each subunit ( 1400 1800 amino acids) was used for comparison. Six of the Fugu α 1 -subunit partial sequences did not contain the complete domains I IV hence were not included in the analysis. Protein sequences were aligned using ClustalW and the phylogenetic tree was generated based on the neighbor-joining algorithm. (The phylogeny has been reduced for this exam.) 5