Technical Report Series on Corpus Building



Relevanta dokument
Isolda Purchase - EDI

Grafisk teknik IMCDP IMCDP IMCDP. IMCDP(filter) Sasan Gooran (HT 2006) Assumptions:

Schenker Privpak AB Telefon VAT Nr. SE Schenker ABs ansvarsbestämmelser, identiska med Box 905 Faxnr Säte: Borås

SVENSK STANDARD SS-EN ISO 19108:2005/AC:2015

This exam consists of four problems. The maximum sum of points is 20. The marks 3, 4 and 5 require a minimum

Information technology Open Document Format for Office Applications (OpenDocument) v1.0 (ISO/IEC 26300:2006, IDT) SWEDISH STANDARDS INSTITUTE

Grafisk teknik IMCDP. Sasan Gooran (HT 2006) Assumptions:

Writing with context. Att skriva med sammanhang

Schenker Privpak AB Telefon VAT Nr. SE Schenker ABs ansvarsbestämmelser, identiska med Box 905 Faxnr Säte: Borås

Grafisk teknik. Sasan Gooran (HT 2006)

Styrteknik: Binära tal, talsystem och koder D3:1

Schenker Privpak AB Telefon VAT Nr. SE Schenker ABs ansvarsbestämmelser, identiska med Box 905 Faxnr Säte: Borås

Isometries of the plane

A study of the performance

Adding active and blended learning to an introductory mechanics course

Stiftelsen Allmänna Barnhuset KARLSTADS UNIVERSITET

1. Compute the following matrix: (2 p) 2. Compute the determinant of the following matrix: (2 p)

SVENSK STANDARD SS

Webbregistrering pa kurs och termin

Preschool Kindergarten

Viktig information för transmittrar med option /A1 Gold-Plated Diaphragm

6 th Grade English October 6-10, 2014

Support for Artist Residencies

Module 6: Integrals and applications

CHANGE WITH THE BRAIN IN MIND. Frukostseminarium 11 oktober 2018

SAMMANFATTNING AV SUMMARY OF

Bridging the gap - state-of-the-art testing research, Explanea, and why you should care

Rastercell. Digital Rastrering. AM & FM Raster. Rastercell. AM & FM Raster. Sasan Gooran (VT 2007) Rastrering. Rastercell. Konventionellt, AM

Accomodations at Anfasteröd Gårdsvik, Ljungskile

EXPERT SURVEY OF THE NEWS MEDIA

WindPRO version feb SHADOW - Main Result. Calculation: inkl Halmstad SWT 2.3. Assumptions for shadow calculations. Shadow receptor-input

Make a speech. How to make the perfect speech. söndag 6 oktober 13

En bild säger mer än tusen ord?

1. Unpack content of zip-file to temporary folder and double click Setup

Metodprov för kontroll av svetsmutterförband Kontrollbestämmelse Method test for inspection of joints of weld nut Inspection specification

Calculate check digits according to the modulus-11 method

Boiler with heatpump / Värmepumpsberedare

Protokoll Föreningsutskottet


Aborter i Sverige 2008 januari juni

Resultat av den utökade första planeringsövningen inför RRC september 2005

SVENSK STANDARD SS-ISO 8779:2010/Amd 1:2014

PORTSECURITY IN SÖLVESBORG

CUSTOMER READERSHIP HARRODS MAGAZINE CUSTOMER OVERVIEW. 63% of Harrods Magazine readers are mostly interested in reading about beauty

Managing addresses in the City of Kokkola Underhåll av adresser i Karleby stad

Eternal Employment Financial Feasibility Study

12.6 Heat equation, Wave equation

Documentation SN 3102

Questionnaire for visa applicants Appendix A

Surfaces for sports areas Determination of vertical deformation. Golvmaterial Sportbeläggningar Bestämning av vertikal deformation

SVENSK STANDARD SS

Measuring void content with GPR Current test with PaveScan and a comparison with traditional GPR systems. Martin Wiström, Ramboll RST

EXTERNAL ASSESSMENT SAMPLE TASKS SWEDISH BREAKTHROUGH LSPSWEB/0Y09

Annonsformat desktop. Startsida / områdesstartsidor. Artikel/nyhets-sidor. 1. Toppbanner, format 1050x180 pxl. Format 1060x180 px + 250x240 pxl.

Michael Q. Jones & Matt B. Pedersen University of Nevada Las Vegas

NORDIC GRID DISTURBANCE STATISTICS 2012

SVENSK STANDARD SS :2010

Materialplanering och styrning på grundnivå. 7,5 högskolepoäng

Measuring child participation in immunization registries: two national surveys, 2001

Kurskod: TAMS28 MATEMATISK STATISTIK Provkod: TEN1 05 June 2017, 14:00-18:00. English Version

Dagens Nyheter STHLM Total. A Stockholm paper made by and for those that love Stockholm

Lösenordsportalen Hosted by UNIT4 For instructions in English, see further down in this document

Uttagning för D21E och H21E

Datasäkerhet och integritet

Webbreg öppen: 26/ /

Module 1: Functions, Limits, Continuity

Tentamen i Matematik 2: M0030M.

BOENDEFORMENS BETYDELSE FÖR ASYLSÖKANDES INTEGRATION Lina Sandström

LUNDS TEKNISKA HÖGSKOLA Institutionen för Elektro- och Informationsteknik

Kurskod: TAIU06 MATEMATISK STATISTIK Provkod: TENA 15 August 2016, 8:00-12:00. English Version

Anders Persson Philosophy of Science (FOR001F) Response rate = 0 % Survey Results. Relative Frequencies of answers Std. Dev.

Biblioteket.se. A library project, not a web project. Daniel Andersson. Biblioteket.se. New Communication Channels in Libraries Budapest Nov 19, 2007

District Application for Partnership

Beijer Electronics AB 2000, MA00336A,

SVENSK STANDARD SS-EN ISO

Thesis work at McNeil AB Evaluation/remediation of psychosocial risks and hazards.

Förändrade förväntningar

Övning 5 ETS052 Datorkommuniktion Routing och Networking

Immigration Studying. Studying - University. Stating that you want to enroll. Stating that you want to apply for a course.

Evaluation Ny Nordisk Mat II Appendix 1. Questionnaire evaluation Ny Nordisk Mat II

Kurskod: TAIU06 MATEMATISK STATISTIK Provkod: TENA 17 August 2015, 8:00-12:00. English Version

SWESIAQ Swedish Chapter of International Society of Indoor Air Quality and Climate

FORSKNINGSKOMMUNIKATION OCH PUBLICERINGS- MÖNSTER INOM UTBILDNINGSVETENSKAP

F ξ (x) = f(y, x)dydx = 1. We say that a random variable ξ has a distribution F (x), if. F (x) =

Application Note SW

Integritetspolicy på svenska Integrity policy in English... 5

SVENSK STANDARD SS-EN ISO 9876

RADIATION TEST REPORT. GAMMA: 30.45k, 59.05k, 118.8k/TM1019 Condition D

SVENSK STANDARD SS-EN ISO

STANDARD. UTM Ingegerd Annergren UTMS Lina Orbéus. UTMD Anders Johansson UTMS Jan Sandberg

Swedish adaptation of ISO TC 211 Quality principles. Erik Stenborg

Statistical Quality Control Statistisk kvalitetsstyrning. 7,5 högskolepoäng. Ladok code: 41T05A, Name: Personal number:

Kvalitetsarbete I Landstinget i Kalmar län. 24 oktober 2007 Eva Arvidsson

FÖRBERED UNDERLAG FÖR BEDÖMNING SÅ HÄR

SweLL & legal aspects. Elena Volodina

Högskolan i Skövde (SK, JS) Svensk version Tentamen i matematik

Transkript:

Technical Report Series on Corpus Building Vol. 9 (June 2013) Swedish Corpora Uwe Quasthoff Dirk Goldhahn Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig

Affiliation of the authors: Uwe Quasthoff, Dirk Goldhahn: Institut für Informatik,Universität Leipzig {quasthoff, dgoldhahn}@informatik.uni-leipzig.de Copyright: Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig, http://asv.informatik.uni-leipzig.de/ Technical Report Series on Corpus Building Vol. 1: Deutscher Wortschatz 2013 Vol. 2: Danish Corpora Vol. 3: Dutch Corpora Vol. 4: Icelandic Corpora Vol. 5: Hungarian Corpora Vol. 6: Ukrainian Corpora Vol. 7: Indonesian Corpora Vol. 8: Czech Corpora Vol. 9: Swedish Corpora This PDF document was created using the open source tool mwlib. For more infotmation, see http://code.pediapress.com/ PDF generated at: 26. June 2013

Swedish corpora 1 Introduction to corpus creation 1 SWE - a processing related language description 2 SWE corpora 3 SWE corpus comparison 8 Processing details 10 Appendix to swe news 2007: Database summary 10 Appendix to swe news 2008: Database summary 10 Appendix to swe news 2009: Database summary 11 Appendix to swe news 2010: Database summary 11 Appendix to swe news 2011: Database summary 12 Appendix to swe news 2012: Database summary 12 Appendix to swe newscrawl 2011: Database summary 13 Appendix to swe newscrawl 2012: Database summary 13 Appendix to swe web 2002: Database summary 14 Appendix to swe web 2011: Database summary 14 Appendix to swe web 2012: Database summary 15 Appendix to swe wikipedia 2007: Database summary 15 Appendix to swe wikipedia 2012: Database summary 16 Appendix to swe mixed 2012: Database summary 16 Content details 17 Appendix to swe news 2007: Size of different TLDs 17 Appendix to swe news 2008: Size of different TLDs 17 Appendix to swe news 2009: Size of different TLDs 18 Appendix to swe news 2010: Size of different TLDs 18 Appendix to swe news 2011: Size of different TLDs 18 Appendix to swe news 2012: Size of different TLDs 19 Appendix to swe newscrawl 2011: Size of different TLDs 19 Appendix to swe newscrawl 2012: Size of different TLDs 20 Appendix to swe web 2002: Size of different TLDs 20 Appendix to swe web 2011: Size of different TLDs 20

Appendix to swe web 2012: Size of different TLDs 21 Appendix to swe mixed 2012: Size of different TLDs 21 Appendix to swe news 2007: Size of largest domains 22 Appendix to swe news 2008: Size of largest domains 22 Appendix to swe news 2009: Size of largest domains 23 Appendix to swe news 2010: Size of largest domains 24 Appendix to swe news 2011: Size of largest domains 24 Appendix to swe news 2012: Size of largest domains 25 Appendix to swe newscrawl 2011: Size of largest domains 26 Appendix to swe newscrawl 2012: Size of largest domains 26 Appendix to swe web 2002: Size of largest domains 27 Appendix to swe web 2011: Size of largest domains 28 Appendix to swe web 2012: Size of largest domains 28 Appendix to swe mixed 2012: Size of largest domains 29 Appendix to swe news 2007: Number of sources by time period 30 Appendix to swe news 2008: Number of sources by time period 31 Appendix to swe news 2009: Number of sources by time period 33 Appendix to swe news 2010: Number of sources by time period 34 Appendix to swe news 2011: Number of sources by time period 35 Appendix to swe news 2012: Number of sources by time period 37 Word details 39 Appendix to swe news 2007: Words by length without multiplicity 39 Appendix to swe news 2008: Words by length without multiplicity 41 Appendix to swe news 2009: Words by length without multiplicity 43 Appendix to swe news 2010: Words by length without multiplicity 45 Appendix to swe news 2011: Words by length without multiplicity 47 Appendix to swe news 2012: Words by length without multiplicity 49 Appendix to swe newscrawl 2011: Words by length without multiplicity 51 Appendix to swe newscrawl 2012: Words by length without multiplicity 53 Appendix to swe web 2002: Words by length without multiplicity 55 Appendix to swe web 2011: Words by length without multiplicity 57 Appendix to swe web 2012: Words by length without multiplicity 59 Appendix to swe wikipedia 2012: Words by length without multiplicity 61 Appendix to swe mixed 2012: Words by length without multiplicity 63 Appendix to swe news 2007: Words by length with multiplicity 65 Appendix to swe news 2008: Words by length with multiplicity 67 Appendix to swe news 2009: Words by length with multiplicity 69

Appendix to swe news 2010: Words by length with multiplicity 71 Appendix to swe news 2011: Words by length with multiplicity 73 Appendix to swe news 2012: Words by length with multiplicity 75 Appendix to swe newscrawl 2011: Words by length with multiplicity 77 Appendix to swe newscrawl 2012: Words by length with multiplicity 79 Appendix to swe web 2002: Words by length with multiplicity 81 Appendix to swe web 2011: Words by length with multiplicity 83 Appendix to swe web 2012: Words by length with multiplicity 85 Appendix to swe wikipedia 2007: Words by length with multiplicity 87 Appendix to swe wikipedia 2012: Words by length with multiplicity 89 Appendix to swe mixed 2012: Words by length with multiplicity 91 Appendix to swe news 2007: The most frequent 50 words 92 Appendix to swe news 2008: The most frequent 50 words 93 Appendix to swe news 2009: The most frequent 50 words 94 Appendix to swe news 2010: The most frequent 50 words 95 Appendix to swe news 2011: The most frequent 50 words 96 Appendix to swe news 2012: The most frequent 50 words 97 Appendix to swe newscrawl 2011: The most frequent 50 words 98 Appendix to swe newscrawl 2012: The most frequent 50 words 99 Appendix to swe web 2002: The most frequent 50 words 100 Appendix to swe web 2011: The most frequent 50 words 101 Appendix to swe web 2012: The most frequent 50 words 102 Appendix to swe wikipedia 2007: The most frequent 50 words 103 Appendix to swe wikipedia 2012: The most frequent 50 words 104 Appendix to swe mixed 2012: The most frequent 50 words 105 Appendix to swe news 2007: Longest words in top-1.000 by rank 106 Appendix to swe news 2008: Longest words in top-1.000 by rank 107 Appendix to swe news 2009: Longest words in top-1.000 by rank 108 Appendix to swe news 2010: Longest words in top-1.000 by rank 109 Appendix to swe news 2011: Longest words in top-1.000 by rank 110 Appendix to swe news 2012: Longest words in top-1.000 by rank 111 Appendix to swe newscrawl 2011: Longest words in top-1.000 by rank 112 Appendix to swe newscrawl 2012: Longest words in top-1.000 by rank 113 Appendix to swe web 2002: Longest words in top-1.000 by rank 114 Appendix to swe web 2011: Longest words in top-1.000 by rank 115 Appendix to swe web 2012: Longest words in top-1.000 by rank 116 Appendix to swe wikipedia 2007: Longest words in top-1.000 by rank 117 Appendix to swe wikipedia 2012: Longest words in top-1.000 by rank 118

Appendix to swe mixed 2012: Longest words in top-1.000 by rank 119 Character N-gram details 120 Appendix to swe news 2007: Alphabet as used in the top-100.000 words 120 Appendix to swe news 2008: Alphabet as used in the top-100.000 words 121 Appendix to swe news 2009: Alphabet as used in the top-100.000 words 122 Appendix to swe news 2010: Alphabet as used in the top-100.000 words 123 Appendix to swe news 2011: Alphabet as used in the top-100.000 words 125 Appendix to swe news 2012: Alphabet as used in the top-100.000 words 126 Appendix to swe newscrawl 2011: Alphabet as used in the top-100.000 words 127 Appendix to swe newscrawl 2012: Alphabet as used in the top-100.000 words 128 Appendix to swe web 2002: Alphabet as used in the top-100.000 words 129 Appendix to swe web 2011: Alphabet as used in the top-100.000 words 131 Appendix to swe web 2012: Alphabet as used in the top-100.000 words 132 Appendix to swe wikipedia 2007: Alphabet as used in the top-100.000 words 133 Appendix to swe wikipedia 2012: Alphabet as used in the top-100.000 words 134 Appendix to swe mixed 2012: Alphabet as used in the top-100.000 words 136 Abbreviation details 138 Appendix to swe news 2007: Most frequent abbreviations 138 Appendix to swe news 2008: Most frequent abbreviations 139 Appendix to swe news 2009: Most frequent abbreviations 140 Appendix to swe news 2010: Most frequent abbreviations 141 Appendix to swe news 2011: Most frequent abbreviations 142 Appendix to swe news 2012: Most frequent abbreviations 143 Appendix to swe newscrawl 2011: Most frequent abbreviations 143 Appendix to swe newscrawl 2012: Most frequent abbreviations 144 Appendix to swe web 2002: Most frequent abbreviations 144 Appendix to swe web 2011: Most frequent abbreviations 145 Appendix to swe web 2012: Most frequent abbreviations 145 Appendix to swe wikipedia 2007: Most frequent abbreviations 146 Appendix to swe wikipedia 2012: Most frequent abbreviations 147 Appendix to swe mixed 2012: Most frequent abbreviations 148 Appendix to swe news 2007: Left neighbors of the full stop 148 Appendix to swe news 2008: Left neighbors of the full stop 149 Appendix to swe news 2009: Left neighbors of the full stop 150 Appendix to swe news 2010: Left neighbors of the full stop 151 Appendix to swe news 2011: Left neighbors of the full stop 152

Appendix to swe news 2012: Left neighbors of the full stop 153 Appendix to swe newscrawl 2011: Left neighbors of the full stop 154 Appendix to swe newscrawl 2012: Left neighbors of the full stop 155 Appendix to swe web 2002: Left neighbors of the full stop 156 Appendix to swe web 2011: Left neighbors of the full stop 157 Appendix to swe web 2012: Left neighbors of the full stop 158 Appendix to swe wikipedia 2007: Left neighbors of the full stop 159 Appendix to swe wikipedia 2012: Left neighbors of the full stop 160 Appendix to swe mixed 2012: Left neighbors of the full stop 161 Appendix to swe news 2007: Left neighbors of the full stop with additional internal full stops 162 Appendix to swe news 2008: Left neighbors of the full stop with additional internal full stops 163 Appendix to swe news 2009: Left neighbors of the full stop with additional internal full stops 164 Appendix to swe news 2010: Left neighbors of the full stop with additional internal full stops 165 Appendix to swe news 2011: Left neighbors of the full stop with additional internal full stops 166 Appendix to swe news 2012: Left neighbors of the full stop with additional internal full stops 167 Appendix to swe newscrawl 2011: Left neighbors of the full stop with additional internal full stops 168 Appendix to swe newscrawl 2012: Left neighbors of the full stop with additional internal full stops 169 Appendix to swe web 2002: Left neighbors of the full stop with additional internal full stops 170 Appendix to swe web 2011: Left neighbors of the full stop with additional internal full stops 171 Appendix to swe web 2012: Left neighbors of the full stop with additional internal full stops 172 Appendix to swe wikipedia 2007: Left neighbors of the full stop with additional internal full stops 173 Appendix to swe wikipedia 2012: Left neighbors of the full stop with additional internal full stops 174 Appendix to swe mixed 2012: Left neighbors of the full stop with additional internal full stops 175 Sentences details 176 Appendix to swe news 2007: Shortest sentences 176 Appendix to swe news 2008: Shortest sentences 177 Appendix to swe news 2009: Shortest sentences 179 Appendix to swe news 2010: Shortest sentences 180 Appendix to swe news 2011: Shortest sentences 182 Appendix to swe news 2012: Shortest sentences 183 Appendix to swe newscrawl 2011: Shortest sentences 185 Appendix to swe newscrawl 2012: Shortest sentences 186 Appendix to swe web 2002: Shortest sentences 188 Appendix to swe web 2011: Shortest sentences 189 Appendix to swe web 2012: Shortest sentences 191 Appendix to swe wikipedia 2007: Shortest sentences 192

Appendix to swe wikipedia 2012: Shortest sentences 194 Appendix to swe mixed 2012: Shortest sentences 195 Appendix to swe news 2007: Longest sentences 197 Appendix to swe news 2008: Longest sentences 199 Appendix to swe news 2009: Longest sentences 201 Appendix to swe news 2010: Longest sentences 203 Appendix to swe news 2011: Longest sentences 205 Appendix to swe news 2012: Longest sentences 207 Appendix to swe newscrawl 2011: Longest sentences 209 Appendix to swe newscrawl 2012: Longest sentences 211 Appendix to swe web 2002: Longest sentences 213 Appendix to swe web 2011: Longest sentences 215 Appendix to swe web 2012: Longest sentences 217 Appendix to swe wikipedia 2007: Longest sentences 219 Appendix to swe wikipedia 2012: Longest sentences 221 Appendix to swe mixed 2012: Longest sentences 223 Appendix to swe news 2007: Length of sentences in characters 225 Appendix to swe news 2008: Length of sentences in characters 226 Appendix to swe news 2009: Length of sentences in characters 227 Appendix to swe news 2010: Length of sentences in characters 228 Appendix to swe news 2011: Length of sentences in characters 229 Appendix to swe news 2012: Length of sentences in characters 230 Appendix to swe newscrawl 2011: Length of sentences in characters 231 Appendix to swe newscrawl 2012: Length of sentences in characters 232 Appendix to swe web 2002: Length of sentences in characters 233 Appendix to swe web 2011: Length of sentences in characters 234 Appendix to swe web 2012: Length of sentences in characters 235 Appendix to swe wikipedia 2007: Length of sentences in characters 236 Appendix to swe wikipedia 2012: Length of sentences in characters 237 Appendix to swe mixed 2012: Length of sentences in characters 238 Appendix to swe news 2007: Length of sentences in words 239 Appendix to swe news 2008: Length of sentences in words 240 Appendix to swe news 2009: Length of sentences in words 241 Appendix to swe news 2010: Length of sentences in words 242 Appendix to swe news 2011: Length of sentences in words 243 Appendix to swe news 2012: Length of sentences in words 244 Appendix to swe newscrawl 2011: Length of sentences in words 245 Appendix to swe newscrawl 2012: Length of sentences in words 246

Appendix to swe web 2002: Length of sentences in words 247 Appendix to swe web 2011: Length of sentences in words 248 Appendix to swe web 2012: Length of sentences in words 249 Appendix to swe wikipedia 2007: Length of sentences in words 250 Appendix to swe wikipedia 2012: Length of sentences in words 251 Appendix to swe mixed 2012: Length of sentences in words 252 Oddities details 253 Appendix to swe news 2007: Longest words 253 Appendix to swe news 2008: Longest words 253 Appendix to swe news 2009: Longest words 254 Appendix to swe news 2010: Longest words 254 Appendix to swe news 2011: Longest words 255 Appendix to swe news 2012: Longest words 255 Appendix to swe newscrawl 2011: Longest words 256 Appendix to swe newscrawl 2012: Longest words 256 Appendix to swe web 2002: Longest words 257 Appendix to swe web 2011: Longest words 257 Appendix to swe web 2012: Longest words 258 Appendix to swe wikipedia 2007: Longest words 258 Appendix to swe wikipedia 2012: Longest words 259 Appendix to swe mixed 2012: Longest words 259 Appendix to swe news 2007: Sentences with high average word length 260 Appendix to swe news 2008: Sentences with high average word length 261 Appendix to swe news 2009: Sentences with high average word length 262 Appendix to swe news 2010: Sentences with high average word length 263 Appendix to swe news 2011: Sentences with high average word length 264 Appendix to swe news 2012: Sentences with high average word length 265 Appendix to swe newscrawl 2011: Sentences with high average word length 266 Appendix to swe newscrawl 2012: Sentences with high average word length 267 Appendix to swe web 2002: Sentences with high average word length 268 Appendix to swe web 2011: Sentences with high average word length 269 Appendix to swe web 2012: Sentences with high average word length 270 Appendix to swe wikipedia 2007: Sentences with high average word length 271 Appendix to swe wikipedia 2012: Sentences with high average word length 272 Appendix to swe mixed 2012: Sentences with high average word length 273 Appendix to swe news 2007: Problems with sentence segmentation - words ending in a stopword 274 Appendix to swe news 2008: Problems with sentence segmentation - words ending in a stopword 275

Appendix to swe news 2009: Problems with sentence segmentation - words ending in a stopword 275 Appendix to swe news 2010: Problems with sentence segmentation - words ending in a stopword 276 Appendix to swe news 2011: Problems with sentence segmentation - words ending in a stopword 277 Appendix to swe news 2012: Problems with sentence segmentation - words ending in a stopword 278 Appendix to swe newscrawl 2011: Problems with sentence segmentation - words ending in a stopword 278 Appendix to swe newscrawl 2012: Problems with sentence segmentation - words ending in a stopword 279 Appendix to swe web 2002: Problems with sentence segmentation - words ending in a stopword 280 Appendix to swe web 2011: Problems with sentence segmentation - words ending in a stopword 281 Appendix to swe web 2012: Problems with sentence segmentation - words ending in a stopword 282 Appendix to swe wikipedia 2007: Problems with sentence segmentation - words ending in a stopword 283 Appendix to swe wikipedia 2012: Problems with sentence segmentation - words ending in a stopword 283 Appendix to swe mixed 2012: Problems with sentence segmentation - words ending in a stopword 284

1 Swedish corpora Introduction to corpus creation The Leipzig Corpora Collection (LCC) collects Web based corpora for many different languages. The main text genres are newspaper texts, Wikipedias and randomly collected web pages. All corpora are processed in the same way: Crawling Web pages HTML stripping Language identifikation Sentence segmentation Cleaning: Removal of ill-formed sentences Duplicate removal Calculation of word frequences and word co-occurrences As result we have a corpus containing only well-formed sentences in the language under consideration. The sentences are in random order; hence, sharing the corpus does not violate copyright law because it is impossible to reconstruct the original texts. The pre-processing steps contain both language independent steps (like HTML stripping and duplicate removal) and language dependent steps (like language identification and sentence segmentation). Especially the language specific parts are vulnerable to specific processing problems. The aim of the paper is to identify possible problems and evaluate the results. The following problems are adressed: A processing-focused language description Language size: How much text is available for this language? What are the biggest sources? Corpus description: Genre, size, crawling and processing date. Possible problems in language identification: Which languages are similar? Character set and alphabet Inspecting the word list: Most frequent words, longer high frequent words and longest words at all. Word length distribution. Can abbreviations confuse sentence segmentation? Information about the abbreviation list. Inspecting sentences: Inspect shortest and longest sentences to identify possible segmentation problems. Sentence length distribution. The paper describes the result of these inspections; the appendices show the exact results for the different corpora. This helps to compare the corpora with respect to quality. In the section quality overview, an overall quality description for each corpus is given. All corpora contain only minor problems which are irrelevant for most applications. Otherwise the corpus creation has been iterated.

SWE - a processing related language description 2 SWE - a processing related language description General properties of the Swedish language Native Name: Svenska Classifiation: Indo-European, Germanic, North, East Scandinavian, Danish-Swedish, Swedish Total Number of Speakers: 8.4M Largest countries with number of speakers: Sweden(8.0M). Also spoken in parts of Finland, where it has equal legal standing with Finnish. Largely mutually intelligible with Norwegian and Danish. Source: http:/ / www. ethnologue. com/ language/ swe Processing summary Latin alphabet with some additional characters full stop is used as sentence boundary and for abbreviations apostrostophes used rarely Properties important for processing Alphabet and punctuation The alphabet is Latin based, with the following specialities (source: http:/ / en. wikipedia. org/ wiki/ Swedish_alphabet): Swedish includes all 26 base letters and Å, Ä, Ö. In the alphabetic ordering, the letters Å, Ä, Ö follow Z at the end of the alphabet. Usual Latin punctuation Usage of uppercase letters: At sentence beginnings and for proper names (of persons, organisations, countries etc.). Sentence segmentation and word tokenization Sentence beginnings Sentences begin with a capitalized first word. Abbreviations Abbreviations confusing with sentence boundaries: Special abbreviation list has to be inspected. Sources for abbreviations:??? Abbreviations with full stop may appear in the word list without full stop. Apostrophes: The use of apostrophes is infrequent.

SWE - a processing related language description 3 Sources and ranking (2012) Estimated number of webpages containing text Google.com top-5 words: 337,000,000 results for "och" "i" "att" "som" "på" Google.com top-10 words: 232,000,000 results for "och" "i" "att" "som" "på" "är" "en" "av" "för" "med" Rank according to number of speakers (Ethnologue): 86 Rank according to Wikipedia size (see http:/ / de. wikipedia. org/ wiki/ Wikipedia:Sprachen): Rank 5 with 1.054.845 articles (2013-06-20). Rank according to number of newspapers as found by AbyZ (5/2012): 256 newspapers, rank 10. Rank according to number of newspapers with RSS feeds (5/2012): 122 newspapers, rank 13. Rank according to our corpus size (9/2012): 13 SWE corpora Quality Overview Quality Ratings A: Very good quality. Ready to use (or already used) for frequency dictionary. Size as large as possible Only minimal errors Multiple genres (if possible) A-: Small problems identified. They should not affect usage. B: Native speaker quality. Information about abbreviations and sentence boundaries by native speaker Resulting statistics checked by native speaker, possible errors corrected C: Non-native speaker quality Obvious problems shown in corpus statistics are corrected D: First version Pre-processing with default abbreviation list and default sentence boundaries E: Poor Quality: Old, outdated or faulty. Corpus Quality The quality of the corpora differes slightly because the corpus processing toolchain changed slightly during several years. Moreover, original data are often no more available. Hence, improvement of quality often means removing incomplete or doubtful sentences. Forthcoming editions of all corpora thus might have a slightly smaller number of sentences. This especially applies to near duplicate sentences which are removed only sparingly. The following table shows the quality of the corpora. Minimal errors are still possible and described in the sections below. All possible major improvements are mentioned here.

SWE corpora 4 Corpus Quality rating Known problems to-dos swe_news_2007 A - - swe_news_2008 A - - swe_news_2009 A- Some uplicate sentences - swe_news_2010 A - - swe_news_2011 A - - swe_news_2012 A - - swe_newscrawl_2011 A- several near duplicate peaks - swe_newscrawl_2012 A - - swe_web_2002 A- max. 255 bytes instead characters - swe_web_2011 A - - swe_web_2012 A - - swe_wikipedia_2007 A- max. 255 bytes instead characters - swe_wikipedia_2012 A - - swe_mixed_2012 A - - Processing Overview For more details, see Appendix: Database Summary and Appendix: Number of sources by time period. Corpus Size (M sentences) Size (M running words) Multiwords Crawling date Production date swe_news_2007 2.6 38 0 mainly 2005 and 2007 2010 swe_news_2008 1.0 15 21846 daily 2008, 17% without date 2011 swe_news_2009 1.9 22 26049 daily 2009 2011 swe_news_2010 0.9 13 20178 daily 2010 2011 swe_news_2011 0.9 15 18397 daily 2011 2012 swe_news_2012 0.8 11 18013 daily 2012 2013 swe_newscrawl_2011 4.0 66 40410 04/2012 2012 swe_newscrawl_2012 4.5 68 41079 04/2013 2013 swe_web_2002 7.5 107 0 batch crawl 2002 2007 swe_web_2011 5.9 90 42161 12/2010-12/2011 2012 swe_web_2012 6.1 90 41968 1/2012-12/2012 2013 swe_wikipedia_2007 1.3 21 0 10/2007 2010 swe_wikipedia_2012 2.2 37 103466 01/2012 2012 swe_mixed_2012 33.9 498 153459 see above 2013

SWE corpora 5 Content Overview For more details, see Appendix: Size of different TLDs and Appendix: Size of different domains. Corpus Type of sources Countries Number of sources Publishing date Biggest source swe_news_2007 News.se (93%),.fi(3%),.com(2%) 113 mainly 5/2007-12/2007 www.aftonbladet.se/ swe_news_2008 News.se 48 2008 www.aftonbladet.se/ swe_news_2009 News.se 66 2009 www.aftonbladet.se/ swe_news_2010 News.se 55 2010 www.aftonbladet.se/ swe_news_2011 News.se 12 2011 www.aftonbladet.se/ swe_news_2012 News.se(95%),.ax(5%) 39 2012 www.aftonbladet.se/ swe_newscrawl_2011 News.se(80%),.com(18%) 50 2011 and before www.webfinanser.com/ swe_newscrawl_2012 News.se(82%),.fi(8%),.com(7%),.nu(2%) 106 2012 and before www.sourze.se/ swe_web_2002 Web.se 18839 2002 and before www.genealogi.se/ swe_web_2011 Web.se(88%),.com(5%),.fi(4%) 68214 2011 and before www.rfsl.se/ swe_web_2012 Web.se(86%),.com(7%),.fi(3%) 77855 2012 and before www.omtv.se/ swe_wikipedia_2007 Wikipedia - 1 2007 and before wikipedia.org swe_wikipedia_2012 Wikipedia - 1 2012 and before wikipedia.org swe_mixed_2012 Mixed Sources.se(80%),.com(6%),.fi(3%) 105215 2012 and before www.aftonbladet.se/ Words Appendix: Words by Length without multiplicity shows a plot of the corresponding length distribution. A smooth asymetric bell-shaped curve is expected. Appendix: Words by Length with multiplicity shows a plot of the corresponding length distribution. A smooth asymetric bell-shaped curve is expected. Appendix: The Most Frequent 50 Words shows the most frequent stopwords as well as one or more words related to the region. Appendix: Longest Words in Top-1000 by rank shows the 25 longest words within the top-1000. The usually give an impression of the main topics treated in the corpus. Appendix: Longest Words with minimum frequency 2 should give an idea of very long words. In the case of processing problems, different types of non-words may appear. This might help to improve the word definition.

SWE corpora 6 Corpus Word length graph without multiplicity Word length graph with multiplicity Most Frequent 50 Words Longest Words in Top-1000 Longest Words with minimum frequency 2 swe_news_2007 okay okay okay okay URLs, missing blanks swe_news_2008 okay okay okay okay missing blanks swe_news_2009 okay okay okay okay missing blanks, routes swe_news_2010 okay okay okay okay missing blanks, junk swe_news_2011 okay okay okay Rank 636: 71000@aftonbladet.se URLs, missing blanks swe_news_2012 okay okay okay okay okay swe_newscrawl_2011 okay okay okay okay Missing blanks, routes, junk, URLs swe_newscrawl_2012 okay okay okay okay URLs, missing blanks, junk, etc. swe_web_2002 okay okay okay okay URLs, missing blanks, chemicals swe_web_2011 okay okay okay okay Routes, URLs, missing blanks, junk swe_web_2012 okay okay okay okay Routes, missing blanks, URLs, junk swe_wikipedia_2007 okay okay okay Rank 971: RobotQuistnix Routes, URLs swe_wikipedia_2012 okay okay okay okay URLs swe_mixed_2012 okay okay okay okay all of the above Abbreviations Abbreviations are usually not used as sentence boundaries. Conversely, missing abbreviations can overgenerate sentence boundaries. Due to limitations in the processing chain, the list of abbreviations used for sentence boundary detection can differ from the abbreviations in the word list. Appendix: Most Frequent Abbreviations shows possible under-generation of sentence boundaries by wrong abbreviations (i.e. words ending in a full stop) in the word list. Sentences Appendix: Shortest sentences shows the shortest declarative, exclamatory and interrogative sentences. In preprocessing, a minimal length for sentences might be specified. And missing abbreviations are often visible as faulty sentence engings. Appendix: Longest sentences shows the longest declarative, exclamatory and interrogative sentences. Usually, the maximun sentence length is defined as 256 characters (not 256 bytes). Very long exclamatory or interrogative sentences often contain an overseen sentence boundary. Appendix: Length of sentences in characters shows the distribution of the sentence length. A large and balanced corpus will result in a smooth and bell-shaped curve. Isolated local maxima usually result from large sets of near duplicate sentences.

SWE corpora 7 Corpus Shortest sentences Longest sentences Length distribution (in characters) Length distribution (in words) swe_news_2007 okay max. 255 bytes instead characters okay okay swe_news_2008 okay okay okay okay swe_news_2009 Some uplicate sentences okay okay okay swe_news_2010 okay okay okay okay swe_news_2011 okay okay okay okay swe_news_2012 okay okay okay okay swe_newscrawl_2011 okay okay several near duplicate peaks okay swe_newscrawl_2012 okay okay okay okay swe_web_2002 okay okay max. 255 bytes instead characters okay swe_web_2011 okay okay okay okay swe_web_2012 okay okay okay okay swe_wikipedia_2007 okay okay max. 255 bytes instead characters okay swe_wikipedia_2012 okay okay okay okay swe_mixed_2012 okay okay okay okay Oddities Appendix: Sentences with high average word length: Average sentences contain many stopwords, and these stopwords are usually short. Hence, they restrict the average word length in a sentence. Conversely, sentences with high average word length are often ill formed. They may be used to improve pre-processing. Appendix: Problems with sentence segmentation - Words ending in a stopword: If there are many ill-formed word or sentence boundaries witout a blank between two words, they will generate new ill-formed words. The appendix shows the most frequent words ending in an uppercase stopword. If they are infrequent then the date were of high quality. Corpus Sentences with high average word length Words ending in a stopword swe_news_2007 missing blanks maxfreq=48 swe_news_2008 routes, proper names okay, maxfreq=8 swe_news_2009 okay okay, maxfreq=11 swe_news_2010 URLs, missing blanks, routes maxfreq=19 swe_news_2011 okay maxfreq=17 swe_news_2012 okay okay, maxfreq=4 swe_newscrawl_2011 URLs, missing blanks, junk maxfreq=203 swe_newscrawl_2012 missing blanks, junk maxfreq=94 swe_web_2002 URLs, junk, special characters maxfreq=33 swe_web_2011 URLs, missing blanks, routes, junk maxfreq=32 swe_web_2012 URLs, missing blanks, routes, junk okay, maxfreq=12 swe_wikipedia_2007 URLs, chemicals, routes okay

SWE corpora 8 swe_wikipedia_2012 URLs, Japanese, routes okay swe_mixed_2012 as above maxfreq=203 SWE corpus comparison Automated Corpus comparison For the following comparisons, the following tests on the top-1000 words are performed: Vectors based on the frequencies of the top-1000 words are created for the analysed languages. The cosine of the angle between these vectors is computed. Identical languages receive a value of 0, distinct languages get a value of 1. The same analysis is conducted using the frequencies of the top-1000 typical letter trigrams of the languages. Monolingual word list comparison (top-1000 words) As one can expect the comparisons show: The different news corpora have different word lists with maximum distance 0.23 (swe_newscrawl_2011 and swe_news_2011) The wikipedia corpora are similar with maximum distance 0.09 The web corpora have maximum distance 0.18 (swe_web_2002 and swe_web_2012) The mixed corpus hun_mixed_2012 holds a central position with maximum distances of 0.32 to the other corpora. Multilingual word list comparison (top-1000 words) Both the comparison of the top-1000 words and the comparison of the letter trigrams used in these words show that there are similar languages in our data, mainly members of the north germanic family. The distance of the mixed corpus to the next language, Slovak, is 0.47 for the words and 0.54 for the letter trigrams. Both distances are below average. The average value for the most similar language is 0.58 for trigrams. The most similar languages based on words: Danish, Norwegian (Bokmål), Norwegian (Nynorsk) +--------+---------------------+--------------------------+-------------+ source language_short_name language_name cos_logfreq +--------+---------------------+--------------------------+-------------+ swe dan Danish 0.469093 swe nob Norwegian, Bokmål 0.694608 swe nno Norwegian, Nynorsk 0.708683 swe loy Loke 0.888261 swe cat Catalan-Valencian-Balear 0.892844 +--------+---------------------+--------------------------+-------------+ The most similar languages based on letter trigrams: Danish, Norwegian (Bokmål), Norwegian (Nynorsk) +--------+---------------------+--------------------+-------------+ source language_short_name language_name cos_logfreq +--------+---------------------+--------------------+-------------+ swe dan Danish 0.544641 swe nob Norwegian, Bokmål 0.605218 swe nno Norwegian, Nynorsk 0.610377

SWE corpus comparison 9 swe eng English 0.707715 swe nld Dutch 0.710454 +--------+---------------------+--------------------+-------------+

10 Processing details Appendix to swe news 2007: Database summary Values for some general parameters Parameter Value Number of sentences 2610317 Number of running word forms 37720016 Number of distinct word forms 977940 Number of multiwords 0 Percentage of words with frequency=1 58.2179 Number of sentence based co-occurrences 5967570 Number of neighbour co-occurrences 918800 Appendix to swe news 2008: Database summary Values for some general parameters Parameter Value Number of sentences 1019572 Number of running word forms 15387045 Number of distinct word forms 497335 Number of multiwords 21846 Percentage of words with frequency=1 54.4257 Number of sentence based co-occurrences 2838010 Number of neighbour co-occurrences 436791