"!#"$ %'&(*)+), %'-/.102340.15 36.10879#::<; 9#=.2340>@?::1ACB DE $"!"F GH$ IKJLEH DE $"!#FR GHK$ I"!#"$ M NOQP ST T U'TWVX'YUZ#U\[XI]^U'T _a`'b UcEdbfe"ghiT U'Uj] k T ldm'eon [@hi`'t#yu'plp qrjklshi t RuIGv w PP xy zig w {} w u w!"h$ (~00.% >5. A4A ƒ=340 M =.19 #ˆ*=0' :Š; 340> '3CB #34-/B RŒW B.1AC. Œ*0'3 5=9#B3 Ž : P % PI ŒW B.1AC. % = 2 =0 ˆ*.13 A4.10'0.1 A 340> O ' O Bf= š J# I œ#w ž~a6>.ÿ = 2 # /=9 )+.1ˆ~ =A A K W f!" 1 73C=9f9#= 7.1 'A^% : 0 2.> ~ 'ª«w š Hw # G =9 9 :19W Ž : AC:> Ž A6.10>.> =W=9 9#: 9#B % =f23cb ±²³²µ µ " '¹+ µºj»ņ¼j½+¾i±¼ ²À ¾ Á±¼Â¾º º µ»'¾ã¼j½â² ÄaÁºj¾ ½Å½Â¼ Æ"¼ Á ¾ Ç "¼Â µã«½âŗ½â "²ÀÈÉ µæw¾º ºÊ'¼ à Ľ+ µæºj¾ã»ë ¾» ²³ ²µºj¾ #² Äa²µ µ ½ÌÆ µëìãwä ¼ ÃrÁ1 µã #²µÈs¹+ µ ¾ ŅÍ^Îϲ ļj½Â±Ïà ²µÎE½j¹+¾¹Ì²À *¾ "¼ Áº4² ½ÅÐ K± ²i± ¼ ²µ ¾ Ç Á± @Á µã ½Â¼j½j ½ µæ Æ µëì º ²ÀÑ ²Àºj½ÅÐÒI¾Á±Ïº ²ÀÑ ²Àº¼j½+» ¼4Ñ ²Àë¾IÓ Á±W¾ ¾Á "²À Á1 IJµÔ ²½jË̺4 "¼ ÃW» ¼4ë¾ÃÏÕ Á± ¾ ¾Á #²µ ²µ µ # '¹Ì²iÁ1 IJi iö ² ¾ ½ Ç ½Â¼j»Ã ² Äa i²µñ²µ Ҳµ µ ²Àà " Ҽ Ãr "± ²i²µ µ *ľ ¾ÖK¾ ½j²Ð ±²³Èؾ¼ ÃrÆ µáë+½+ µæ #±²i # '¹+ µºj» @±W¾ ½Ö ²À²µÃE µãï "± ²i ²ÀÁ1» ü Ç "¼Â µã«µæ²à µ ½jÔ µãrîï±w¾ ¼ Ã Æ µ ÈØ¾ #¼j µãï¼â½ã ²µ² Ä²Ä Æ µ *IJµ "²ÀÁ "¼4û ¾Ã Äa ²ÀÁ1»Ã¼j½Â¼ ûNļ Æ#Æ"²µ ²Àà º ¾Ã»Ë+¾»²i²À µ ½\Ð ±²³Á1 µ ²µÁ "¼j µã ÈØ¾ IJ³Ö @ #±²³¹Ì µæ Ç ²¾ IJµ *¾Ã Äa #±²i¹Ì²À Æ µ ÈÙ¾ÃÁ²a µæ "± ² ¹Ì µæ Ç ² ¾ ļ ÃW» µº± ¾ ½+¾ºj½Å iö ²µ²ÀÃÏ ¾Ê'²ÀÃϼ à iá1 µãw½j¼jä² ¾ "¼j µã Îϱ ¼ º ²³Á1 µã ½Â " ËÌÁ #¼ û8 #±²i²À µ " '¹+ µºj» Ð ±²iÁ ¾Ë+½j²a µæ "± ² ²µ µ ±W¾ ½Ö ²À²µÃE» ¼4Ñ ²ÀÃÏ "± ²iºj µîr² ½Â ¹Ì ¼ µ ¼ " Ð
Executive Summary The error typology is a hierarchially organised classification system for all kinds of language related errors found in contemporary Swedish newspaper articles. The error typology is to be used in the development of a proof-reading tool for Danish, Norwegian, and Swedish in the SCARRIE project. In specific, the typology forms a basis for the error type code attached to each entry in the Error Corpora Database (ECD), and for the parser in the resulting proof-reading system. It is of great importance in the development of the proof-reading tool to know what types of errors that in fact occur in newspapers, and to have these systematised in an appropriate manner. Potential errors have not been considered, which means that the typology is solely based on factual errors and not on hypothetical ones. The Swedish newspapers Svenska Dagbladet and Upsala Nya Tidning have supplied material for the development of the error typology and the ECD where all the error instances with their corrections and error types codes are stored. The language errors have been detected and corrected by professional proofreaders at the newspapers. The typology is descriptive, not normative. There are at least four possible dimensions according to which a division between errors could be made: the nature of the error, the cause of the error, the context in which the error appears, and the correction of the error. An error must be recognised before it can be corrected. Therefore, the erroneous feature and the context are the most important characteristics. The principle is thus that two errors of the same kind appearing in a similar context may be given the same error type code even if there might be differences in how the errors could be corrected. The cause of the error has been given the lowest priority. For automatic proof-reading purposes, the cause was found to be of less interest than it would be for pedagogical purposes. The strategy of the proof-reading tool has been taken into consideration while constructing the error typology. The grammar checker will use a combined approach of linguistic analyses and the application of rules of anticipated errors. Correction will be based on a grammar of foreseen errors. Consistency with regard to standard or style will also be checked. Style checking will concentrate on lexical choice, variation in inflection and, to some extent, syntax. Errors in newspapers may be of many different types. To capture this variety, the typology needs to be quite elaborate. The hierarchy consists of four levels which are given the following terms: group, category, subcategory, and specification. There are five groups: spelling errors, grammar problems, punctuation problems, graphical problems, and style, meaning, and reference problems. Each group contains a number of categories which in turn are divided in subcategories. A more detailed level may occur within the subcategories for further specification of the errors. The basic division between error types is based on how much context is needed for an error to be recognised. Spelling errors require the smallest context, especially misspellings resulting in non-lexical words. A word is a sequence of characters separated by space, punctuation marks, or graphical signs except hyphens, and, in certain occasions, colons and apostrophs. This string based definition of a word is important for the classification. Spelling errors resulting in existing words can only be recognised in a wider context. Such an error belongs to the grammar problems group if it can be detected by means of grammatical features. Otherwise it belongs to the style, meaning, and reference group. Errors for which a context wider than one sentence is needed belong to this last group as do problems involving a choice between alternative correct word forms. The punctuation problems group comprises erroneous end of sentence punctuation, and erroneous comma usage, but also missing capital letter in the beginning of the sentence. Graphical problems are problems related to the typographical representations of texts and signs such as dashes, quotation marks, and space signs. The three highest levels in the hierarchy have been assigned a code of two letters; the specification level has been given a two number code starting with 01 in the order in which they have appeared in the material. Concatenated, these codes form the error type code assigned to every language error in the Error Corpora Database. The resulting error type code thus consists of eight tokens, and each level has its own position in the code. An error typology built on factual errors occurring in newspapers will perhaps never be complete. During application and revision the typology was found to be adequate for describing Swedish errors. So far, approximately 9,000 error instances have been processed. Continued work with the Error Corpora Database will show how well the typology conforms to Danish and Norwegian. The typology is open to modifications. In specific, new types can readily be added at the lower levels of the typology. If there is a need for a less detailed typology, only the higher levels may be used.
Uppsala university Department of Linguistics SCARRIE 21 January 1998 Error Typology for Automatic Proof-reading Purposes Olga Wedbjer Rambell SCARRIE DEL 2.1 FINAL VERSION 1.1
Preface This report is an updated version of the first version as a result of the on-going revision of the Error Corpora Database. The main difference is that approximately 30 new specifications have been added, almost all of them punctuation problems or graphical problems. A few subcategories has been added as well. For a handful of error types, the text has undergone some minor changes.
Contents 1 Introduction 1 2 Method 2 3 Material 3 4 Error Typology 5 4.1 Spelling Errors (SE) 7 4.1.1 Capital Letter Errors (CP) 7 Proper nouns (PN) 7 Compounds with proper nouns (CC) 8 Derivations of proper nouns (DC) 8 Personal titles (PT) 8 Foreign names (FT) 9 4.1.2 Word Formation Errors (WF) 9 Binding -s- missing (SM) 9 Binding -s- incorrect (SI) 10 Hyphen missing (HM) 10 Hyphen incorrect (HI) 10 Split words (SW) 11 Concatenated words (CW) 12 Misplaced space (MS) 13 Coordination with common word part (CO) 13 Abbreviations (AB) 14 Other word formation errors (OP) 14 4.1.3 End of Line Hyphenation Errors (HY) 14 Proper according to the morpheme boundary rule and to the one consonant rule (MC) 15 Proper according to the morpheme boundary rule only (MB) 15 Proper according the one consonant rule only (CR) 15 Proper according to pronounciation and other problems (PR) 15 4.1.4 (Other) Spelling Errors (OS) 16 Proper nouns (PN) 16 Foreign words (FW) 16 Number expressions (NB) 16 Other words (OW) 17 4.2 Grammar Problems (GP) 18 4.2.1 Noun Phrase (NP) 19 Agreement (AG) 19 Gender (GE) 21 Number (NB) 22 Species (SS) 22 Case (CA) 24 Adjective phrase (AP) 25 Participles (PE) 26 Numerals (NL) 26 Nouns (NN) 27 Pronouns (PN) 27 Choice of preposition after a noun (CP) 28 Preposition missing after a noun (MP) 30 Other noun valency problems (NV) 30 Coordination (CO) 31 Word order (WO) 31 Other problems (OP) 32
4.2.2 Adjective Phrase (AP) 32 Wrong word category (WC) 32 Choice of preposition after an adjective (CP) 32 Comparing än (CM) 33 4.2.3 Adverb Phrase (AB) 33 Word missing (WM) 33 Doubled word (DW) 33 Word order (WO) 33 Other problems (OP) 33 4.2.4 Prepositional Phrase (PP) 33 Prepositions (PR) 34 Complements (CO) 35 4.2.5 Conjunctions and Conjunctive Adverbs (CN) 36 Conjunction or conjunctive adverb missing (CM) 36 Complex conjunction (CC) 37 Doubled conjunctions (DW) 37 Erroneous conjunction (EC) 37 Wrong word category (WC) 38 4.2.6 Verb Phrase in the Limited Sense (VF) 38 Main verb in the finite form (MF) 39 Temporal auxiliary verb in the finite form + Main verb in the supine (TS) 40 Existential auxiliary verb in the finite form + Main verb in the perfect participle (EP) 41 Auxiliary verb in the finite form + Main verb in the infinitive (AI) 42 Combination of auxiliary verbs + Main verb (AM) 44 Coordination of verbs (CO) 44 Infinitive in infinitive phrase (IP) 45 Other problems (OP) 46 4.2.7 Verb Valency (VV) 46 Intransitivity (IN) 46 Transitivity (TR) 47 Copula (CO) 47 Reflexivity (RE) 47 Passive constructions (PC) 48 Object with infinitive (OI) 48 Prepositional phrase (PP) 49 Infinitive phrase (IP) 49 Clause (CL) 50 Position holding det (ID) 50 VF missing (VM) 50 NP missing (NM) 51 Choice of preposition/adverb after verbs (CP) 51 Preposition/adverb missing after verbs (MP) 53 Repetition of preposition/adverb (RP) 55 4.2.8 Pronoun Case (PC) 55 Subjective form correct (SF) 55 Objective form correct (OF) 56 4.2.9 Agreement (AG) 56 NP and AP subject and complement (NA) 56 NP and AP object and complement (NO) 57 AP and AP subject and complement (AA) 57 NP and perfect participle subject and complement (NE) 58 NP and pronoun subject and complement (PN) 58 NP and NP subject and complement (NP) 58 NP and NP in som phrases subject and complement (NS) 58 NP and NP object and complement (NN) 59
4.2.10 Referential Problems (RP) 59 Pronoun reference (PN) 59 Choice of VF (VF) 59 4.2.11 Word Order (WO) 60 Inversion (IN) 60 Inserted phrase (IP) 60 Adverb phrase (AB) 60 Noun phrase (NP) 61 Prepositional phrase (PP) 62 Other word order problems (OP) 62 4.2.12 Wrong Word Category (WC) 63 Adjective (AV) 63 Adverb (AB) 63 Pronoun (PN) 64 4.2.13 Other Grammar Problems (OG) 64 Coordinations (CO) 64 Word missing (WM) 64 Doubled words (DW) 64 Heading (HE) 64 Strange syntax and other grammatical problems (OP) 64 4.3 Punctuation Problems (PU) 65 4.3.1 End of Sentence Punctuation (ES) 65 Punctuation mark missing (PM) 65 Choice of end of sentence punctuation (EC) 66 Full stop together with quotation marks or parentheses (FS) 67 One punctuation mark too many (PT) 67 Not end of sentence (NE) 68 Other end of sentence punctuation problems (OP) 69 4.3.2 Capital Letter (CP) 69 Point (PT) 69 Colon (CN) 69 Quotation (QN) 70 Not beginning of sentence (NO) 70 4.3.3 Comma (CO) 70 Main clauses (MC) 70 Subordinate clause (SC) 70 Phrases / units (PH) 71 Parts of phrases / units (PA) 71 Clarity criteria (CC) 72 Comma instead of word (IW) 72 Comma correct (CO) 72 Other problems with commas (OP) 73 4.3.4 Dash within the Sentence (DW) 73 Phrases / units (PH) 73 Dash correct (DC) 73 4.3.5 Colon (CN) 74 Colon correct (CC) 74 Colon missing (CM) 74 Incorrect usage of colon (IC) 75 4.3.6 Semicolon (SN) 75 Semicolon correct (CS) 75 Semicolon missing (SM) 75 Incorrect usage of semicolon (IS) 76 4.3.7 Other Punctuation Problems (OP) 76 Erroneous punctuation in certain text types (EP) 76
Other erroneous punctuation marks (EM) 76 4.4 Graphical Problems (GR) 77 4.4.1 Space (SC) 77 Missing space around signs (BA) 77 Missing space before signs (SB) 77 Missing space after signs (SM) 78 Too little space (SL) 79 Too much space (ST) 79 4.4.2 New Line / Paragraph (NL) 80 New line / paragraph to be removed (NR) 81 Erroneously placed line break (AB) 81 New line / paragraph to be inserted (NI) 81 4.4.3 Dash before Direct Speech (DS) 82 Dash missing (DM) 82 Incorrect hyphen (IH) 82 Incorrect dash (ID) 82 Incorrect underscore (IU) 82 4.4.4 Dash within the Sentence (DW) 82 Incorrect hyphen (IH) 83 Incorrect underscore (IU) 83 Incorrect dash (ID) 83 4.4.5 Quotation Marks (QM) 83 Quotation within a quotation (WQ) 83 Incorrect usage of single quotation marks (IS) 83 Quotation marks around titles, names etc (TI) 84 Quotation marks around citations etc (CI) 84 Quotation after så kallade etc (SK) 85 Other incorrect quotation marks (OP) 85 4.4.6 Parentheses (PA) 85 Parentheses not in pair (PP) 85 Parentheses to be removed (PR) 85 Parentheses missing (PM) 86 4.4.7 Typographical Errors (TY) 86 Lower case and upper case characters (GC) 86 Italic (IT) 86 Bold (BO) 87 Font size (FS) 87 Other font problems (FO) 87 Margins (MA) 87 4.4.8 Other Graphical Problems (OP) 88 Hyphens (HY) 88 Accent (AC) 88 Apostroph (AP) 88 Other signs (OS) 88 4.5 Style, Meaning, and Reference (SP) 89 4.5.1 Preferred Spelling (PS) 89 4.5.2 Abbreviation (AB) 89 Choice of abbreviated form (CA) 89 Full expression preferred (FE) 90 4.5.3 Number Style (NS) 90 Number beginning the sentence (BS) 90 Small numbers (SN) 90 Decimal numbers (DN) 90
Large numbers (LN) 91 Approximate figures (AF) 91 Ordinals (OR) 91 Year, date, time etc (YD) 91 Other problems (OP) 93 4.5.4 Correct Word Category but Wrong Word (WN) 93 Adjectives (AV) 93 Adverbs (AB) 93 Conjunctions and Conjunctional Adverbs (CN) 93 Nouns (NN) 93 Prepositions (PR) 94 Pronouns (PN) 94 Verbs (VB) 94 Interjections (IN) 94 4.5.5 Choice of Words and Expressions (CW) 94 4.5.6 Choice of Signs (CS) 94 Dash => Colon (CD) 94 Colon => Dash(es) (DS) 95 Dash => Slash (SL) 95 Points in lists (PE) 95 4.5.7 Choice of Sentence Boundaries (CB) 95 One sentence => Two sentences (OT) 95 Two sentences => One sentence (TO) 95 4.5.8 Choice of Syntactic Construction (SC) 95 Omitted auxiliary ha (OM) 95 Omission of relative pronoun (OR) 96 The adverb så (SR) 96 4.5.9 Consistency (CN) 96 Number (NB) 96 Spelling / Word form (SP) 96 Number style (NS) 96 4.5.10 Redundancy (RD) 96 4.5.11 Referential Problems (RP) 97 NP and NP (NP) 97 NP and AP (NA) 97 Clause and pronoun (CR) 97 General and specific reference (GS) 98 5 Closing Remarks 99 Literature 100 Appendix A: ECD Error Corpora Database Specification 101
1 Introduction The error typology is a classification system of language errors to be used in the development of a proofreading tool for Danish, Norwegian, and Swedish in the SCARRIE project. The typology forms a basis for the error type code attached to each entry in the Error Corpora Database (ECD) 1, and for the parser in the resulting proof-reading system. The starting point was to create a distinct and easily used system of error types for describing and classifying Swedish language errors, especially grammatical ones. For this purpose errors have been collected from intended users of the proof-reading tool such as newspapers. It is of great importance to know what types of errors that in fact occur in newspapers, and to have these systematised in an appropriate manner. Potential errors have not been considered, which means that the typology is solely based on factual errors and not on hypothetical ones. Hopefully, the error typology will prove useful for Danish and Norwegian as well. The performance of the proof-reading tool has been taken into consideration while constructing the error typology. The grammar checker will use a combined approach of linguistic analyses and application of rules of anticipated errors. It will recognise phrase constituents but probably not syntactic functions, sentence structure or verb phrases. Correction will be based on a grammar of local rules of foreseen errors. Consistency with regard to standard or style will also be checked. Style checking will concentrate on lexical choice, variation in inflection and, to some extent, syntax. The error typology is a part of work package 2 of the SCARRIE project which is funded by the Language Engineering Sector in the Telematics Application Programme of the European Union. 2 The SCARRIE consortium consists of a co-ordinating partner, four project partners, and nine sub-contractors. Center for Sprogteknologi in Copenhagen will develop the Danish part of the SCARRIE pilot application, Humanistik Datasenter in Bergen will develop the Norwegian part and the Department of Linguistics at Uppsala university will develop the Swedish part. One of the subcontractors, Stichting Cognitieve Technologie, has already developed a proof-reading tool for Dutch that will be used in the SCARRIE project. Newspapers and publishing houses in Sweden (Svenska Dagbladet, Upsala Nya Tidning), Norway (Bergen Trykk AS), and Denmark (Berlingske Tidende, Munksgaard International Publishers) have contributed to the project by defining user demands on an automated proof-reading tool. They are also the main suppliers of text material for the dictionaries and the error corpora. In the final phase, these users will act as test beds for the SCARRIE proof-reading software. After the project, the co-ordinating partner of the project, WordFinder Software, will package the SCARRIE results into its own interface, and market it as a product. The ultimate goal for WordFinder Software is to develop a proof-reading tool for everyone using a word processor when they write in Swedish, Danish or Norwegian. In this report the error typology is described in chapter 4, but first the method and the material used will be presented in chapters 2 and 3. Nearly every error type is accompanied by examples with the first version being the incorrect version and the second being the result of the proot-reader s corrections. The two sentences are separated by a slash (/). Notes about the sources of the examples are given in parentheses. 1 The ECD specification can be found in Appendix 1. 2 More information about the SCARRIE project can be found on the Internet: http://www2.echo.lu/langeng/en/le3/scarrie/scarrie.html http://www.scarrie.com 1
2 Method The main focus of the typology has been on the recognition of errors, on what information is needed for detecting different language errors. The correction made by the proof-reader has also been taken into consideration. More detailed discussions of the typology guidelines are found in chapter 4. If the incorrect sentence has been corrected in more than one aspect, it is first established whether the corrections depend on each other or not. If they do, they will be treated as one error, otherwise as separate errors. All errors that have been detected and all corrections that have been made in the material used have been made by professional proof-readers at the user sites. The typology is descriptive, not normative. Therefore, when creating the typology, errors that might seem to be correct and corrections that might seem to be erroneous must also be given an appropriate error type code. The error typology has been constructed in three steps or phases. At first, the typology was developed simultaneously with the collection of a limited number of errors from Svenska Dagbladet (SvD) and Upsala Nya Tidning (UNT). In the first phase, a preliminary typology was created and presented to the Danish and Norwegian partners for comments. In the second phase, the preliminary typology was tested by applying it to an extended material from SvD and UNT. In the third and last phase, the typology was revised, and changes were made in the Error Corpora Database, where all the error instances with their corrections and error types codes were stored. No new material was added at this stage. Construction A thousand errors from SvD were systematised in a preliminary version of the error typology. The typology was constructed at the same time as the errors were analysed, and is thus very much dependent on the material on which it has been based. The question whether the material is representative or not has to be considered. The answer will be found when testing more material and classifying new error instances: If the errors can easily be classified without having to add or in any other way alter the classes, the typology embraces representative errors. Application Material from SvD and UNT was classified according to the preliminary typology. The typology was expanded to cover new error types by making the existing error type codes cover more problems, and by introducing new error type codes. This phase was carried out by students at the Department of Linguistics at Uppsala university. Three students classified the errors, and three students typed the sentences, the error type codes and complementing information in files forming the preliminary Error Corpora Database. Revision The outcome of the application was evaluated. The structure of the typology seemed to be appropriate. Problems that had arised during the application phase were adressed, and the typology was changed to become more detailed and more consistent. The main changes done concerned verb related problems. The report has also been extended and rewritten as a result of the application and revision phases. The Error Corpora Database was then revised in accordance with the improved typology. 2
3 Material The Swedish newspapers Svenska Dagbladet (SvD) and Upsala Nya Tidning (UNT) have supplied material for the development of the error typology. The language errors have been detected and corrected by professional proof-readers at the newspapers, following the norms present. Both SvD and UNT have language norms of their own printed in booklets. In the three phases of the development of the error typology different material has been used. Each source has been given a source code, to which the source notes at the examples in this report are referring. Construction The newspaper articles, from which the errors have been collected in the construction phase, cover mainly two text types, domestic news and political debate articles not written by journalists. 1. Domestic news articles, SvD (GS) In 1994, Gabriella Sandström made a study of errors found in 29 domestic news articles in 3 versions (from script to printed text). The total amount of errors is 512 (the same error remaining in a following version of the article has only been counted once). 2. Minor study, SvD (MS) A minor study was carried out some years ago by the proof-readers at SvD. The material that has been passed on to the SCARRIE project consists of a concluding summary of their findings plus 25 example texts. The total amount of errors in the example texts is 26. 3. Survey made by a reader, SvD (RS) A language interested reader went through the SvD of May 25, 1996, and found 65 errors. This is the only material used that was not corrected by a professional proof-reader. 4. Collection of SvD articles 1993 1996 (CS) This collection consists of nearly 50 texts with about 1,700 correction marks, of which 300 errors originating from 5 articles have been used in the creation of the error typology. The articles originate from the headings Samtider and Brännpunkt, and cover political debate and other contemporary issues where a more personal style is allowed. The articles have been saved by the proof-readers on their own initiative. It is important to note that the absolute majority of the articles is not written by professional journalists, although the writers most often have an academic education and/or occupation. 5. Upsala Nya Tidning, October 1996 (UNT) Upsala Nya Tidning has supplied the Department of Linguistics with the proof-readers paper copies on which they have marked the corrections to be made. The articles are of all genres that usally are proofread at the newspaper. Five days production has been covered. Application Language errors were supplied from Svenska Dagbladet and Upsala Nya Tidning for the Error Corpora Database. The material from SvD came in electronic form while UNT supplied the department with paper copies as in the previous material delivery (5 above). 3 6. Svenska Dagbladet, 1997 (SvD) The reporters versions of 734 articles were proof-read. The articles represent seven different text genres: editorials, domestic affairs, foreign affairs, local news, economy, culture, and sports. Two weeks production were covered except for the sport pages which include more than two weeks. The articles were all written during the first eight months of 1997. The total number of erroneous sentences was 1965 containing 2,143 errors. Unlike the other materials, the non-proof-read and the proof-read articles were delivered in electronic form. The two versions were compared automatically. Pairs of sentences in which differences were discovered were picked out and manually examined as the other material. 3 See Wedbjer Rambell et al (1998): An Error Database of Swedish 3
7. Upsala Nya Tidning, February May 1997 (UNT) From February to May errors from 25 days normal production were analysed and classified. Just as for the material 5 above, this material covered the genres normally proof-read at the paper. It contained nearly 6,900 errors. There are no statistics for how many articles that are included in the material, nor for their distribution among the sections in the paper. Revision The material in the Error Corpora Database (i.e. material 6 and 7 above) was analysed again resulting in a revised typology. No new material was added at this stage. 4
4 Error Typology This chapter focuses on the error typology for automatic proof-reading purposes. The chapter is divided in five sections, one for each problem group. Examples of nearly every error type code are given. Before the typology is presented in more detail, the basic division lines and fundamental guidelines are discussed. There are at least four possible characteristics according to which a division between errors could be made: the nature of the error, the cause of the error, the context in which the error appears, and the correction of the error. An error must be recognised before it can be corrected. Therefore, the erroneous feature and the context are the most important characteristics. The principle is thus that two errors of the same kind appearing in a similar context would be given the same error type code even if there might be differences in how the errors could be corrected. The cause of the error has been given the lowest priority. For automatic proof-reading purposes, the cause is of less interest than it would be for pedagogical purposes. The error typology is a hierarchially organised classification system of all kinds of language related errors found in contemporary Swedish newspaper articles. The hierarchy consists of four levels which are given the following terms: group, category, subcategory, and specification. Each level is typographically marked in the report as follows: x.x Group x.x.x Category Subcategory xx specification There are five groups: spelling errors, grammar problems, punctuation problems, graphical problems, and style, meaning, and reference problems. Each group contains a number of categories which in turn are divided in subcategories. A more detailed level may occur within the subcategories for further specification of the errors. The main idea is that the three higher levels (i.e. the groups, the categories, and the subcategories) state the proper or correct usage while the error is specified on the lowest level. (On some occasions this principle is violated as will be shown below.) The basic divisions between error types are based on how much context is needed for an error to be detected. Spelling errors require the smallest context, especially misspellings resulting in non-lexical words. A word is a sequence of characters separated by space, punctuation marks, or graphical signs except hyphens, and, in certain occassions, colons and apostrophs. This typographical definition of a word is important for the classification. Spelling errors resulting in existing words can only be recognised by looking at a wider context. Such an error belongs to the grammar problems group if it can be detected by means of grammatical features. Otherwise it belongs to the style, meaning, and reference group. Errors for which a context wider than one sentence is needed belong always to this last group as do problems involving choosing between correct word forms. The punctuation problems group contains erroneous end of sentence punctuation, and erroneous comma usage, but also missing capital letter in the beginning of the sentence. Graphical problems are problems related to signs such as dashes, quotation marks, and space signs. The division between the five groups is, however, not always clearcut. The dash within the sentence occurs as a category in both the punctuation problems group (dealing with errors related to the function of the dash in the sentence) and in the graphical problems group (dealing with errors related to how the dash is graphically represented). Choice between a comma and a dash is seen as a style problem thus belonging to the style, meaning, and reference group. Problematic issues will be discussed in more detail in the following sections. The three highest levels in the hierarchy have been assigned a code of two letters; the specification level has been given a two number code starting with 01 in the order they have appeared in the material. Concatenated, these codes form the error type code assigned to every language error in the Error Corpora Database. The resulting error type code thus consists of eight tokens, and each level has its own position in the code. Although the lower levels have not yet been presented, an example may give an idea of the structure of the typology and its error type codes. Let GP stand for the group of grammar problems, NP for the noun phrase category, and AG for the agreement error subcategory within the NP. Finally, let 02 represent erroneous species agreement between the premodifier and the noun. The error type code GPNPAG02 will then be assigned to the following example: 5
Polisen avblåser nu den stora brottsutredning. / Polisen avblåser nu den stora brottsutredningen. (GS11A) The specification level will expand the most when more material is examined. It is easier to give an additional error specification a new sequential number than a two letter combination. In those subcategories lacking a specification level, the error can be assigned the specifiation type code 00 as default. In the work with the typology Swedish language guides as been consulted, such as Svenska skrivregler by Svenska språknämnden (1991), Nationalencyklopedins ordbok (1995 1996), and Svenska Akademiens ordlista (1986). Svensk grammatik by Olof Thorell (1997) and Allmän grammatik by Magnus Ljung and Sölve Ohlander (1982) have been very helpful on grammatical issues. See also the list of literature. 6
4.1 Spelling Errors (SE) The vast majority of the spelling errors can be recognised, and perhaps even corrected, independently of the context in which they appear. A spelling error do usually involve only one word. A word is defined as a sequence of letters separated by spaces, punctuation marks, or graphical signs except hyphens, and, in certain cases, colons and apostrophs. Multiword expressions constitute closed context, and errors in such expressions (for instance names consisting of more than one word) belong to the spelling errors group. Errors in idiomatic expressions fall outside this category, as do spelling errors resulting in existing words. These errors are addressed in the grammar problems group and in the style, meaning, and reference group. The spelling errors group consists of four categories: Capital Letter Errors (CP) Word Formation Errors (WF) End of Line Hyphenation Errors (HY) (Other) Spelling Errors (OS) 4.1.1 Capital Letter Errors (CP) There are different types of capital letter errors. The absolute majority of capital letter errors is context independent and therefore possible to correct by using a dictionary, such as proper nouns and compounds and derivations of proper nouns. However, one type is context dependent: If a sentence starts with an erroneous lower case letter, the error is context dependent forming a category within the punctuation problems group. Ordinary words starting with a capital letter and not being the first word in the sentence are also dealt with in the punctuation problems group since ordinary words are not proper nouns. Proper nouns are however not always easily distinguishable from ordinary words. For instance, a proper noun does not necessarily begin with a capital letter. In cases where the status of a word is unclear, it is dealt with as if it is a proper noun, thus belonging to the spelling errors group. Capital letter problems in abbreviations belong either to the word formation category or to the style, meaning, and reference group the former classification is used if the form of the abbreviation is incorrect, and the latter if it is a question of choice between two correct forms. Problems with erroneous capital letters not being the first letter of the word are perceived to be ordinary spelling errors. Capital letter problems may coexist with word formation problems. The error combinations belong to the proper word formation category and are not dealt with in the capital letter category. The subcategorisation of capital letter errors is based on what type of word is involved: proper noun, compound with proper noun, derivation of proper noun, personal title, and foreign name. The specifications state the erroneous form and the correct form of the letter. Proper nouns (PN) The proper noun subcategory contains all those capital letter errors related to names of different kinds: persons, organisations, companies, countries, cities, etc. Depending on what the proper nouns denote, different rules apply. These rules are newspaper specific, not all of them are in accordance with the recommendations Svenska språknämnden gives in Svenska skrivregler (1991). The use of capital letters is rather a matter of norms than rigid rules, and it ought to be possible for each user to adjust these norms in the final proof-reading program. 01 lower case letter => upper case letter Lars Hjalmarsson, ridsportförbundets nye generalsekreterare, hoppas på en snar förbättring av förbundets sponsorsituation. / Lars Hjalmarsson, Ridsportförbundets nye generalsekreterare, hoppas på en snar förbättring av förbundets sponsorsituation. (SvD Sport) 7
02 upper case letter => lower case letter Förutsättningen är att Sjöfartsverket bedömer åtgärden som lämplig... / Förutsättningen är att sjöfartsverket bedömer åtgärden som lämplig... (GS2A) Compounds with proper nouns (CC) In Swedish a compound containing a proper noun ought to have a capital letter, no matter where in the compound the name occurs, if it has a name character rather than denoting a species. This rule is not consequently applied by the proof-readers. Capital letter problems may coexist with incorrect hyphenation of compounds. These error combinations are addressed in the hyphen incorrect subcategory of the word formation category. However, if the hyphenation is correct but there is a capital error problem in the second part of the compound, the error is dealt with as a capital letter error. 01 lower case letter => upper case letter I lerumfallet har dessutom Kommunalanställdas förbund medgivit en visstidsdispens som medfört att ambulansförarna kunnat komma upp i de här, som vi tror, extrema övertidssummorna. / I Lerumfallet har dessutom Kommunalanställdas förbund medgivit en visstidsdispens som medfört att ambulansförarna kunnat komma upp i de här, som vi tror, extrema övertidssummorna. (GS1A) 02 upper case letter => lower case letter Karlskrona förväntas bli en mötesplats för företag i Östersjöstaterna med mer än 50 miljoner människor. / Karlskrona förväntas bli en mötesplats för företag i östersjöstaterna med mer än 50 miljoner människor. (SvD Inrikes) När det ställs samman blir intrycket rätt beklämmande, ty vid sidan av Palmes entydiga fördömanden av Tjeckoslovakiens Husakregim, Franco-Spanien, grekjuntan, apartheid i Sydafrika och Pinochetregeringen i Chile samt uppbackningen av frigörelse från kolonialväldena står mycken tvetydighet eller värre i andra fall. / När det ställs samman blir intrycket rätt beklämmande, ty vid sidan av Palmes entydiga fördömanden av Tjeckoslovakiens Husakregim, Franco-spanien, grekjuntan, apartheid i Sydafrika och Pinochetregeringen i Chile samt uppbackningen av frigörelse från kolonialväldena står mycken tvetydighet eller värre i andra fall. (SvD Ledare) Derivations of proper nouns (DC) A derivation is not usally written with a capital letter even though the original proper noun is. 01 lower case letter => upper case letter... den omtalade elliottska rävfarmen... /... den omtalade Elliottska rävfarmen... (SvD Kultur) 02 upper case letter => lower case letter Nu är Wayne Roques i Stockholm för att ge sina argument mot legaliseringen på konferensen Svensk narkotikapolitik i ett Europeiskt perspektiv. / Nu är Wayne Roques i Stockholm för att ge sina argument mot legaliseringen på konferensen Svensk narkotikapolitik i ett europeiskt perspektiv. (GS5A) Personal titles (PT) Titles normally take a lower case initial, but there are instances where proof-readers have changed a lower case letter to an upper case letter. 8
01 lower case letter => upper case letter professor Sören Berg. / Professor Sören Berg. (UNT 970419 Uppsala) 02 upper case letter => lower case letter Fallet Silje har nått ända till Seargent Steve Stonehill, Merseyside Police i Liverpool, och han hittar egentligen inga viktiga likheter mellan fallen. / Fallet Silje har nått ända till seargent Steve Stonehill, Merseyside Police i Liverpool, och han hittar egentligen inga viktiga likheter mellan fallen. (GS6A) Foreign names (FT) Titles of conferences, plays, etc written in English ought to follow language rules for English and not Swedish, which means that every content word in titles should be written with a capital letter and not the first word only. All foreign names and terms with capital letter problems are dealt with here, since they are not ordinary words in Swedish. The appropriateness of mixing another language and its norms into a Scandinavian proof-reading system is a matter of discussion, but never the less foreign words and expressions (especially English ones) are not uncommon in Swedish newspaper articles. 01 lower case letter => upper case letter Han är oerhörd glad över bildandet av European cities against drugs (ett initiativ av Stockholms förra finansborgarråd Carl Cederschiöld). / Han är oerhörd glad över bildandet av European Cities Against Drugs (ett initiativ av Stockholms förra finansborgarråd Carl Cederschiöld). (GS5BC) 02 upper case letter => lower case letter Nej, men jag tycker min musik är Hard Listening, ha! Nej, men jag tycker min musik är Hard listening, ha! (SvD Kultur) 4.1.2 Word Formation Errors (WF) Word formation errors are more or less restricted to compounding errors, for instance problems with binding morphemes and hyphens. Other subcategories deal with split words (two or more words should be written together as one word), concatenated words (one word should be written as two or more separate words), misplaced space (a space should be moved, not removed as in split words, or inserted as in concatenated words), and coordination with a common word part (erroneous hyphenation when coordinating shared word). Problems with abbreviations form a separate subcategory. Hyphens are seen as parts of words. Colon and apostrophs may also be included in the words. Colon may stand between an abbreviation and its inflection, and apostroph may signal the genitive form. Errors involving these signs when they are used for other purposes and thus not being a part of a word are not word formation errors, but punctuation problems or graphical problems along with problems with other signs. Problems in choosing the proper word form is not a word formation error; it is either a style problem (when two forms are correct but one is preferred over the other) or a grammar problem (when the context within the sentence decides which form to use). Binding -s- missing (SM) When concatenating two words into a compound, an s is often needed between the words. Norrmän och danskar brukar anse att svenskarna har storebrorfasoner, att vi inte anstränger oss lika mycket för att samtalspartnern ska förstå. / Norrmän och danskar brukar anse att svenskarna har storebrorsfasoner, att vi inte anstränger oss lika mycket för att samtalspartnern ska förstå. (GS18ABC) 9
Binding -s- incorrect (SI) A concatenation has been made with an erroneous binding s which thus should be removed. If the binding s should be replaced by a hyphen, the error belongs to the next subcategory: hyphen missing. Eddie Irvine ligger tvåa när han går i depå för däcksbyte och bränslepåfyllning. / Eddie Irvine ligger tvåa när han går i depå för däckbyte och bränslepåfyllning. (SvD Sport) Hyphen missing (HM) In other cases, a hyphen should be put between the two words. A further specification is made based on the existence of an incorrect binding s or a capital letter problem. When a hyphen should be moved within a word, the problem is dealt with as one error belonging to this subcategory. 01 without an incorrect binding -s- and capital letter problem... några enstaka funktioner, exempelvis se kurserna på New Yorkbörsen. /... några enstaka funktioner, exempelvis se kurserna på New York-börsen. (MS54) 02 with an incorrect binding -s- Vid Jarl Hjalmarsonsstiftelsens nyligen avhållna seminarium i Stockholm var besvikelsen bland de baltiska deltagarna tydlig. / Vid Jarl Hjalmarson-stiftelsens nyligen avhållna seminarium i Stockholm var besvikelsen bland de baltiska deltagarna tydlig. (CS1) 03 with capital letter problem Saltsjöboo / Saltsjö-Boo (UNT 970502 Familjenytt) 04 hyphen to be moved mag-tarmkanalen / magtarm-kanalen (UNT 970414 Debatt) Hyphen incorrect (HI) Sometimes a hyphen is put between concatenated words where there should be no hyphen. Capital letter problems may occur together with erroneous hyphenation in compounds, a problem addressed on the specification level. Occasionally, when a binding hyphen is removed, a consonant is tripled and one of them is to be removed. 01 without capital letter problem Boston-forskarna har även identifierat ett ämne som produceras av ursprungstumören och som i djurförsök visat sig kunna förhindra uppkomsten av metastaser. / Bostonforskarna har även identifierat ett ämne som produceras av ursprungstumören och som i djurförsök visat sig kunna förhindra uppkomsten av metastaser. (GS16A) 02 with capital letter problem fastlands-kina / Fastlandskina (UNT 970220 Uppsala) 03 consonant to be removed cigarett-tändare / cigarettändare (UNT 970410 För Dagen) 10
Split words (SW) A split word error occurs when several words ought to be written together as one word. The error specification specifies how many words to be concatenated. Whether the words may appear on their own or not is also taken into consideration on the specification level. In certain cases, the concatenation should be made by replacing the space with a hyphen. 01 2 lexical words I och med den fruktansvärda katastrofen som jag har gått i genom har jag också fått en erfarenhet som jag kan använda mig av. / I och med den fruktansvärda katastrofen som jag har gått igenom har jag också fått en erfarenhet som jag kan använda mig av. (GS4BC) 02 3 lexical words Där i genom blir det omöjligt för någon annan att döma pojkarna. / Därigenom blir det omöjligt för någon annan att döma pojkarna. (GS6BC) 04 2 words at least one word is non-lexical Läkaren fastslo g att brännmärkena hade orsakats genom tortyr. / Läkaren fastslog att brännmärkena hade orsakats genom tortyr. (SvD Inrikes) 12 2 words at least one needs correction hip hopare / hiphoppare (UNT 970421 Nöje) 05 2 lexical words + a hyphen after the first word or before the second word parad- exempel / paradexempel (UNT 970430 Signerat) 20 -talet / 20-talet (UNT 970417 Uppland) 08 2 lexical words + a hyphen between them facklig - politisk samverkan / facklig-politisk samverkan (UNT 970415 Debatt) 10 2 words + hyphen removed in the last word Villan i Totebo var helt övertänd när brandkåren var på plats vid halv fyra-tiden. / Villan i Totebo var helt övertänd när brandkåren var på plats vid halvfyratiden. (SvD Inrikes) 09 2 words + capital letter problem... i Förstamaj demonstrationen i Östervåla. /... i förstamajdemonstrationen i Östervåla. (UNT 970502 Uppland) 07 compound with hyphen common word Det innebär fem sex procent, ungefär. / Det innebär fem-sex procent, ungefär. (SvD Inrikes) 11
03 compound with hyphen proper noun... den anständiga, toleranta och allmänt humana ateism som vi finner hos en Bertrand Russell, en Jean Paul Sartre eller en Ingemar Hedenius, för att nu ta tre varianter,... /... den anständiga, toleranta och allmänt humana ateism som vi finner hos en Bertrand Russell, en Jean-Paul Sartre eller en Ingemar Hedenius, för att nu ta tre varianter,... (CS3) 11 compound with hyphen + capital letter problem Drive in-besiktning / drive-in-besiktning (UNT 970427 Ettan) 06 compound with hyphen with figures, abbreviations etc Som grund för beslutet ligger en studie där barnhälsovården i Uppsala gått igenom alla avvikelser som hittats på en hel årskull 1,5 åringar satt i relation till den grupp som diagnostiserat fynden. / Som grund för beslutet ligger en studie där barnhälsovården i Uppsala gått igenom alla avvikelser som hittats på en hel årskull 1,5-åringar satt i relation till den grupp som diagnostiserat fynden. (SvD Inrikes) Concatenated words (CW) The opposite of the split words subcategory is the concatenated words subcategory: one single word should be divided into several words. When abbreviations are formed incorrectly with a missing space, the error falls within the abbreviations subcategory. The specifications concern how many words the erroneous word consists of, if there are any other problems in the concatenated word, and what type of words that are involved. 01 2 words both correct Den estniska regeringen har hittills inte tagit ställning ifrågan. / Den estniska regeringen har hittills inte tagit ställning i frågan. (GS2A) 09 3 words all correct... och förklarade vidare att "hur vi framstår spelar egentligen inte så stor roll längre eftersom vi iallafall kommer att få sparken och sedan hamnar vi i helvetet och får leva på äppelskrutt i evighet som hämnd för att vi lät oss luras av den där ormen för länge sedan". /... och förklarade vidare att "hur vi framstår spelar egentligen inte så stor roll längre eftersom vi i alla fall kommer att få sparken och sedan hamnar vi i helvetet och får leva på äppelskrutt i evighet som hämnd för att vi lät oss luras av den där ormen för länge sedan". (SvD Kultur) 02 2 words one word needs correction De exempellösa ekonomiska framgångar som snabbt lyfte Japan till positionen som världens näst starkaste ekonomi kom självfallet också att högst avsevärt stärka LPD:ställning och göra väljarna mer benägna att överse med en tvivelaktig politisk moral. / De exempellösa ekonomiska framgångar som snabbt lyfte Japan till positionen som världens näst starkaste ekonomi kom självfallet också att högst avsevärt stärka LPD:s ställning och göra väljarna mer benägna att överse med en tvivelaktig politisk moral. (UNT 961022 Ledare) 04 2 words with figures, letters etc Kl 14.00100 miljoner transistorer på en... / Kl 14.00 100 miljoner transistorer på en... (UNT 970419 Uppsala) 12
07 2 words foreign words Varför inte en performance eller standup: / Varför inte en performance eller stand up: (SvD Kultur) Words with common word parts form a separate subcategory in which problems with hyphens are addressed. Problems with missing space, however, belong here. 05 words with a common word part Det menar gatu-och fastighetskontoret. / Det menar gatu- och fastighetskontoret. (SvD Stockholm) Erroneous hyphens are also addressed. The hyphen should be replaced by a space. The type of words involved is taken into consideration. For an expression to be labelled proper noun, at least one of the words should be a proper noun. 03 erroneous compound with a hyphen proper noun Men också förklara konsekvenserna av den politik som Maj-Britt Theorin, Per Gahrton och Jörn Svensson företräder. / Men också förklara konsekvenserna av den politik som Maj Britt Theorin, Per Gahrton och Jörn Svensson företräder. (CS1) 06 erroneous compound with a hyphen common word Det är mycket kvar att göra med just den formen, kanske vi kan hålla på fyra-fem år till. / Det är mycket kvar att göra med just den formen, kanske vi kan hålla på fyra fem år till. (SvD Kultur) 10 erroneous compound with hyphen comma to be inserted Gruppspykologi-dynamik och... / Gruppspykologi, dynamik och... (UNT 970429 Debatt) 08 other KS har inget eget kapital trots att det normala för ett sjukhus bör vara att 50av tillgångarna finansieras med egna medel. / KS har inget eget kapital trots att det normala för ett sjukhus bör vara att 50 procent av tillgångarna finansieras med egna medel. (SvD Stockholm) Misplaced space (MS) When a space should be moved, not inserted or removed, the error falls within the misplaced space subcategory. No specification is made. Den so mbara.../ Den som bara... (UNT 970306 Kultur) Coordination with common word part (CO) Hyphens can also be used when coordinating two words ending in (or beginning with) the same word, which then can be replaced with a hyphen in the first (or last) compound. Problems with missing space are addressed in the concatenated words subcategory. 01 hyphen missing Där finner man bland annat: tidigarelagd momsinbetalning, återinförd rätt till blockad mot enmans och familjeföretag, höjda arbetsgivaravgifter och återinförd facklig vetorätt mot anlitande av entreprenörer. / 13