Database course summary

Database course summary Baserat på gamla tentor och tentasvar från Databasteknikskursen. April 26, 2011 1 Terminology Meta data or the database schema, include data about data, i.e. a description of the database stored in the system catalog. Meta-data consist of information about structure of files, type and storage format of each data item, various constraints on the data and other types of information about data such as authorization privileges and access statistics. For the relational model this include descriptions of the relation names, attribute names, data types, primary keys, secondary keys, foreign keys, other constraints, views, storage structures and indexes, and security and authorization information. The participation constraint states if an entity has to be a member of a relationship type or not. Total participation states that all entities has to have at least one relationship of the type and partial participation means that not all entities must have a relationship of the type. Primärindex består av en ordnad fil av dataposter med 2 fält. Första fältet är av samma typ som ordningsfältet (indexeringsfältet) för datafilen och det andra fältet är en pekare till ett datablock (blockpekare). Primärindex är ett glest index då det har en indexpost för varje block i data-filen. Primärindex kräver mycket mindre plats än motsvarande datafil och kan utnyttjas för att snabba upp sökningen av dataposter i datafilen med avseende på indexeringsfältet. Referensintegritet ( Referential integrity ) kräver att om en tupel i en relation refererar till en annan relation så måste den referera till en existerande tupel. Fullt funktionellt beroende ( full functional dependency ). Ett funktionellt beroende så att X bestämmer Y, X - Y, existerar då om för varje par av tupler t1, t2 r(r) och för alla r(r) följande gäller: om t1[x] = t2[x] så gäller att t1[y ] = t2[y ] Fullt funktionellt beroende anger att för ett funktionellt beroende gäller att det inte finns någon delmängd attribut A X så att (X {A})Y. 1

Här gäller att R är ett relationsschema och r(r) är en instans av schemat R med attributen A1,..., An och X, Y {A1,..., An}. Alltså ett fullt funktionellt beroende är ett funktionellt beroende som inte innehåller något onödigt attribut i determinanten (vänsterledet i beroendet). Entitetsintegritet ( entity integrity ). To preserve entity integrity, guaranteeing that all tuples in a relation can be uniquely identified, it is required that no primary key is assigned a NULL value. Entitetsintegritet uttrycker att ingen primärnyckel får anta värdet NULL så att alla tupler i en relation kan identifieras unikt. Dödlig låsning ( dead lock ) är en situation som kan uppstå när alla transaktioner i en mängd av två eller flera transaktioner väntar på att få accessrättinghet till någon dataartikel som är låst av någon annan av transaktionerna. En primärnyckel ( primary key ) är en minimal supernyckel, utvald bland kandidatnycklarna att utgöra nyckel för en relation. En minimal supernyckel består av en minimal delmängd av relationens attribut som unikt identifierar alla tupler i relationen. A transaction is a logical unit of database processing that is performed in its entirety or not at all. Sekundärindex ( secondary index ) är en ordnad fil av dataposter med 2 fält där första fältet är av samma typ som som indexeringsfältet, dvs vilket fält som helst i datafilen. Andra fältet är en blockpekare. Indexeringsfältet kan vara ett icke-nyckelfält eller ett sekundärnyckelfält och datafilen ej sorterad efter indexeringsfältet. Sekundärindex kan vara glesa eller täta. Index ger en avsevärd effetivisering vid sökning av dataposter. Vid updatering av datafilen måste också tillhörande index uppdateras vilket medför en viss ökad kostnad för dessa operationer. Recovery is the process of reconstructing a database back to the last consistent state before a transaction failure. Oklustrat index ( unclustered index ) är ett index vars nycklar har annan sorteringsordning än raderna i tabellen. En supernyckel ( super key ) är varje delmängd av en relations attribut som unikt kan identifiera alla tupler i relationen (notera att det normalt finns fler än en supernyckel för samma relation). 2

En naturlig join ( natural join ) är en sammansättningsoperation mellan två relationer (tabeller) där villkoret för att kombinationen av två tupler (en från varje relation) skall ingå i den resulterande och sammansatta relationen är ett likhetsvillkor mellan ett/flera attribut. De attribut från den högra relationen som deltar i likhetsvillkoret ingår ej i den resulterande tabellen, dvs redundanta attribut elemineras. 2 Data models Physical data independence: the possibility to change the internal schema without influencing the conceptual schema. E.g. the effects of a physical reorganization of the database, such as adding an access path, is eliminated. Logical data independence: the possibility to change the conceptual schema without influencing the external schemas (views). E.g. add another field to a conceptual schema. 2.1 The Three-schema architecture The three-schema architecture introduces a multi-level architecture where each level represents one abstraction level - in 1978 the standard architecture (ANSI/SPARC architecture) for databases was introduced. It consists of 3 levels where each level introduces one abstraction layer and has a schema that describes how representations should be mapped to the next lower abstraction level: 1. The internal level or internal schema - describes storage structures and access paths for the physical database. Abstraction level: files, index files etc. Is usually defined through the data definition language (DDL) of the DBMS. 2. Conceptual level or conceptual schema - an abstract description of the physical database. Constitute one, for all users, common basic model of the logical content of the database. This abstraction level corresponds to the real world : object, characteristics, relationships between objects etc. The schema is created in the DDL according to a specific data model. 3. External level, external schemas, or views - a typical DB has several users with varying needs, demands, access privileges etc. External schemas describes different views of the conceptual database with respect to what different user groups would like to/are allowed to se. Some DBMSs have a specific language for view definitions (else the DDL is used). 2.2 The relational model Relationsdatamodellen representerar en databas som en samling relationer (eller tabeller). Varje tabell har ett namn och representerar ett fysiskt eller abstrakt 3

begrepp eller samband. Begreppets eller sambandets egenskaper representeras av tabellens kolumner (eller attribut) med kolumnens namn och värdedomän. Värdedomänen anger vilka tillåtna värden som attributet kan ha. Varje rad (eller tupel) i tabellen representerar en specifik individ av begreppet eller sambandet och omfattar en mängd av samhörande värden, ett värde för varje attribut i tabellen. Varje rad i tabellen är vidare unik och särskiljs av att ett eller flera attribut har unika värden för varje rad. Detta (eller dessa) attribut sägs utgöra tabellens nyckel och används för att unikt identifiera varje rad i en tabell. En tabell omfattar alltså en mängd av rader där varje rad representerar ett individuellt begrepp eller samband. Ett relationsschema beskriver en tabells gemensamma struktur i forma av relationens/tabellens namn och dess gemensamma mängd av attribut. Ordningen mellan attribut eller mellan tupler har ingen betydelse i relationsmodellen. 3 ER and EER Specialization is a process to conceptually refine a general entity type called a superclass by specifying a set of subclasses. The subclasses are created by identifying some distinguishing characteristics among subsets of entities of the superclass that is the basis to form the subclasses. Generalization is a process to specify a superclass by identifying a number of common characteristics among a set of (sub)classes. These characteristics can be extracted and defined to form the attributes in a common superclass, where these characteristics can be inherited by the subclasses. Aggregation is an abstraction concept to group entities into composite objects from their components. In three cases can aggregation be related to the EER model. The 1st case is an aggregation of attribute values of an object to form the whole object. The 2nd case is the representation of an aggregation relationship using an ordinary relationship. The 3rd case is not explicitly supported in EER but involve the possibility to combine related objects using a particular relationship instance into a higher-level aggregate object. How are the concepts entity type and attribute in the ER (entity-relationship) model represented in the following implementation data models: 1. The relational data model 2. The object-oriented data model Answer: 1. E-R modellens begrepp entitetstyp representeras som en tabell 2. Object types/classes and object attribute 4

4 SQL and relational algebra Assume that we have a litterature database where there are two relations (tables) with the following schemas: BOOK(BID, BNAME) CHAPTER(CID, CNAME, LENGTH, BOID), where xid s represents keys. 1. Formulate a query in relational algebra that retrieves book id, book name, chapter id, chapter name and the length of the chapters for the book Guide Uppsala. 2. Formulate an SQL query that retrieves the book id, book name, and the number of chapters for each book, i.e. how many chapters each book consists of. Solution: 1. π < BID, BNAME, CID, CNAME, LENGT H > (σbname = GuideUppsala (BOOKX < BID = BOID > CHAP T ER)) 2. SELECT B.BID,B.BNAME, COUNT(*) AS NO_OF_ CHAPTERS FROM BOOK B, CHAPTER C WHERE B.PID = C.BOID GROUP BY BID,BNAME 5 Normalization 5.1 Functional dependencies A partial functional dependency is a functional dependency, X Y where some attribute A X can be removed from X and the dependency still holds, i.e. for some A X, (X {A}) Y. A transitive functional dependency is a functional dependency, X Y where there is a set of non-prime attributes Z and both X Z andz Y hold. 5.2 Normal forms Första normalformen säger att alla värden i en relation/tabell endast tillåts vara atomisk. Alltså varje värde skall betraktas som odelbart så att sammansatta eller multipla värden ej är tillåtna. Boyce-Codd s Normal Form (BCNF) states that a relation should, in addition to fulfilling 1st normal form, fulfil that all determinants should be candidate keys. i.e. all non-trivial full functional dependencies should originate from a candidate key. 5

6 Transactions and Concurrency En databastransaktion är en atomisk och logisk enhet av databas processering som accessar och eventuellt uppdaterar olika data items. En transaktion genomförs alltid antingen i sin helhet eller inte alls (vilket garanteras av transaktionshanteraren som ser till att transaktioner hanteras som en odelbar mängd av operationer). 6.1 Två-fas-låsning Två-fas låsningsprotokoll garanterar serialiserbara transaktionsscheman men garanterar ej frihet från deadlocks. En transaktion sägs följa ett två-fas låsningsprotokoll om alla låsningsoperationer föregår den färsta upplåsningsoperationen (unlock) i transaktionen. Alltså en sådan transaktion genomgår en expanderande fas där nya lås kan utfärdas men inga lås kan släppas; och en krympande fas där existerande lås kan låsas upp men inga nya lås kan erhållas. 6.2 ACID To preserve the integrity of data, the DBMS must ensure ACID properties: Atomicity (atomic or indivisible): a logic processing unit (all operations of the transaction) is carried out in its whole or not at all. Consistency (preservation): a correct execution of a transaction in isolation should preserve the consistency of the database (from one consistent state to another). Isolation: Although multiple transactions may execute concurrently, each transaction must be unaware of of other concurrently executing transactions. The updates of a transaction shall be isolated from other transactions until after the commit point. Durability (or permanency): If a transaction completes successfully, the changes it has made to the database must persist and should not be lost in a later system failure. 7 Physical design Describe the basic principles of external hashing and how it can be used to store and retrieve data records in files. Answer: Hashing for disk files is called external hashing. The hash function maps a key into a relative bucket number. A table in the file header converts the bucket number into a block address (see Figure 13.9 in Elmasri/Navathe). A typical hash function has the following form: h(k) = K mod M, where M is the number of buckets that the file is divided into. Overflow buckyes and chaining 6

can be used to solve bucket overflows. The insert a record in the file, the hash function is applied to the hash field of the record and the bucket number where to insert the record is returned. To search for a record with a specfic value of the hash field works similarly by applying the hash function to the value and getting the bucket number where the record is stored in return. Explain the organization and functionality of hash-files (hash-filer). The answer should include how to retrieve a data record (sv. datapost) with regard to a specific search key (sv. söknyckel) of the hash-file. Answer: En hash-fil består av ett statiskt eller dynamiskt antal datablock som hanteras av olika typer av hashningstekniker. Hash-filer hanterar adressering av dataposter till datablock genom att applicera en hash-funktion till hashfältet (dvs sökfältet) vilken returnerar adressen till ett datablock för insättning elleråtersökning av dataposten. En vanlig form av hashfunktion har formen h(f(p)) = f(p) mod M, där hash-funktionen h(f(p)) tillhandahåller addressen för det datablock där dataposten p skall lagras genom att beräkna hashfältet f(p) modulo (mod) antalet datablock M. Man hittar alltså var (i vilket block) en datapost finns för en specifik söknyckel genom att beräkna hash-funktionen för nyckeln som ger adressen till blocket. Förklara för vilka typer av databasfrågor som följande index kan, och inte kan, effektivisera exekveringen: 1. hashindex 2. B+-träd Answer: 1. Hashindex är effektiva för sökning av godtyckliga poster med avseende på värdet av hashfältet. Hashindex är mindre lämpliga (kan jämföras med sökning i oordnad fil) för att söka efter värden med avseende på något annat fält än indexeringsfältet. De är normalt heller ej lämpliga för sökning av ordnade poster då det kan krävas en diskaccess för varje post. 2. B+-träd är effektiva för sökning av poster i ordning baserad på indexeringsfältet och för frågor som inbegriper sökvillkor baserat på indexeringsfältet. Exempelvis villkor som inbegriper <, >,, och betyder att posterna som uppfyller villkoret lagras kontinuerligt efter varann. Frågor som innebär access av godtyckliga poster eller av poster ordnade efter något annat fält än indexeringsfältet ges inga speciella fördelar av ett trädindex. 8 APIs JDBC ( Java DataBase Connectivity ) är ett standardgränssnitt mellan programmeringsspråket Java och en eller flera samtidigt tillgängliga SQL- baserade relationella databaser. 7

JDBC hanterar frågeresultat som resultatströmmar genom ett ResultSet objekt som representerar en resultattabell där en rad i taget kan genereras och bearbetas genom att stega sig igenom resultattabellen. Detta underlättar hantering av mycket stora datamängder. ODBC ( Open DataBase Connectivity ) är ett programmeringsspråksoberoende gränssnitt till SQL-baserade relationella databaser. O står för open och syftar på programmeringsspråksoberoende och operativsystemoberoende. Prepare förkompilerar en parametriserad fråga så att den vid senare (upprepad) exekvering ej behöver kompileras vilket eleminerar onödig frågeoptimering. 9 Recovery 9.1 Recovery according to the deferred update model 1. Start from the last record in the log file and traverse backwards until a check point is reached. Create two lists: (a) C transactions that have reached their commit points (b) NC transactions that have not reached their commit points. 2. Start from the position after the check point in the log file and redo all (Write,T,...) for all transactions T in the list C. 3. Restart all transactions in the list NC. 9.2 Recovery according to the immediate update model 1. Start from the last record in the log file and traverse backwards until a check point is reached. Create two lists: (a) C transactions that have reached their commit points (b) NC transactions that have not reached their commit points. 2. Start from the last record in the log file and apply the UNDO procedure to all (Write,T,...) where T NC. 3. Start from the check point and REDO all transactions (Write,T,...) such that T C. 4. Restart all failured transactions. 8