Revisited. Robert Koch. Norbert Blum. Informatik IV, Universitat Bonn. Romerstr. 164, D Bonn, Germany. Abstract

Greibach Normal Form Transformation, Revisited Robert Koch Norbert Blum Informatik IV, Universitat Bonn Romerstr. 164, D-53117 Bonn, Germany email: blum@cs.uni-bonn.de Abstract We develop a direct method for placing a given context-free grammar into Greibach normal form with only polynomial increase of its size; i.e., we don't use any algebraic concept like formal power series. Starting with a cfg G in Chomsky normal form, we will use standard methods for the construction of an equivalent context-free grammar from a nite automaton and vice versa for transformation of G into an equivalent cfg G 0 in Greibach normal form. The size of G 0 will be O(jGj 3 ), where jgj is the size of G. Moreover, we show that it would be more ecient to apply the algorithm to a context-free grammar in canonical two form, obtaining a context-free grammar where, up to chain rules, the productions fulll the Greibach normal form properties, and then to use the standard method for chain rule elimination for the transformation of this grammar into Greibach normal form. The size of the constructed grammar is O(jGj 4 ) instead of O(jGj 6 ), which we would obtain if we transform G into Chomsky normal form and then into Greibach normal form.

1 Introduction and denitions We assume that the reader is familiar with the elementary theory of nite automata and context-free grammars as written in standard text books, e.g. [1, 4, 6, 9]. First, we will review the notations used in the subsequence. A context-free grammar (cfg) G is a 4-tuple (V; ; P; S) where V is a nite, nonempty set of symbols called the total vocabular, V a nite set of terminal symbols, N = V n the set of nonterminal symbols (or variables), P a nite set of rules ( or productions), and S 2 N is the start sympol. The productions are of the form A!, where A 2 N and 2 V. is called alternative of A. L(G) denotes the context-free language generated by G. The size jgj of the cfg G is dened by jgj = X A!2P lg(a); where lg(a) is the length of the string A. Two context-free grammars G and G 0 are equivalent if both grammars generate the same language; i.e., L(G) = L(G 0 ). Let " denote the empty word. A production A! " is called "-rule. A production A! B with B 2 N is called chain rule. A leftmost derivation is a derivation where, at every step, the variable replaced has no variable to its left in the sentential form from which the replacement is made. A cfg G = (V; ; P; S) is in canonical two form if each production is of the form i) A! BC with B; C 2 N n fsg, ii) A! B with B 2 N n fsg, iii) A! a with a 2, or iv) S! ". A cfg G is in Chomsky normal form if G is in canonical two form and G contains no chain rule. A cfg G = (V; ; P; S) is in extended Greibach normal form if each production is of the form 1

i) A! ab 1 B 2 : : : B t with B 1 ; B 2 ; : : : B t 2 N n fsg; a 2, ii) A! B with B 2 N n S, iii) A! a with a 2, or iv) S! ". A cfg G is in extended 2-standard form if G is in extended Greibach normal form and with respect to rules of kind (i), always t 2. A cfg G = (V; ; P; S) is in Greibach normal form (Gnf) if G is in extended Greibach normal form and G contains no chain rule. G is in 2- standard form (2-Gnf) if G is in extended 2-standard form and G contains no chain rule. A nondeterministic nite automaton M is a 5-tuple (Q; ; ; q 0 ; q f ), where Q is a nite set of states, is a nite, nonempty set of input symbols, is a transition function mapping Q to 2 Q, q 0 2 Q is the initial state, and q f 2 Q is the nite state. Given an arbritrary cfg G = (V; ; P; S), it is well known that G can be transformed into an equivalent cfg G 0 which is in Gnf [3, 4, 6, 9]. But the usual algorithms possibly construct a cfg G 0, where the size of G 0 is exponential in the size of G (see [4], pp. 113{115 for an example). Given a cfg G without "- rules and without chain rules, Rosenkrantz [8] has given an algorithm which produces an equivalent cfg G 0 in Gnf such that jg 0 j = O(jGj 3 ). Rosenkrantz gave no analysis of the size of G 0. For an analysis, see [4], pp. 129{130 or [7]. Rosenkrantz's algorithm uses formal power series. We will develop a direct method for placing a given cfg into Gnf with only polynomial increase of its size; i.e., we don't use any algebraic concept like formal power series. Given any cfg G, in a rst step the grammar will be transformed into an equivalent cfg G 0 in Cnf. Then, we will use standard methods for the construction of an equivalent cfg from an nfa and vice versa for transformation of G 0 into an equivalent cfg G 00 in 2-Gnf. The size of G 00 will be O(jG 0 j 3 ). Moreover, we show that it would be more ecient to apply the algorithm to a cfg in canonical two form, obtaining a cfg in extended 2-standard form and then to use the standard method for chain rule elimination for the transformation of the grammar into 2-Gnf. The size of the constructed grammar is O(jGj 4 ) instead of O(jGj 6 ), which we would obtain if we transform G into Cnf and then into Gnf. 2

2 The method Let G = (V; ; P; S) be an arbitrary cfg in Cnf. Productions of type A! a with a 2 already fulll the Greibach normal form properties. Our goal is now to replace the productions of type A! BC, B; C 2 N n fsg by productions which fulll the Greibach normal form properties. Let B 2 N n fsg. We have an interest in leftmost derivations of the form B ) a or B ) lm B 0B 1 : : : B t ) ab 1 B 2 : : : B t ; where a 2 and B 0 ; B 1 ; : : : ; B t 2 N n fsg. Up to the last replacement, only alternatives from (N n fsg) 2 are chosen. The last replacement chooses for B 0 an alternative in. Such a leftmost derivation is called terminal leftmost derivation and is denoted by B ) tlm a and B ) tlm ab 1B 2 : : : B t ; respectively: Let L B = fa 2 (N n fsg) j B ) tlm G B = (V B ; V; P B ; S B ) such that a) L(G B ) = L B, and ag. Our goal is to construct a cfg b) each alternative of a variable begins with a symbol in. N B denotes the set of nonterminals of G B ; i.e. N B = V B n V. Assume that for B; C 2 N n fsg, B 6= C, the grammars G B and G C are constructed such that i) N B \ N C = ;, or ii) each production with a variable from N B \ N C on the left side is contained in both set of rules P B and P C. If we replace each production A! BC 2 P by the set fa! ac j a is an alternative of S B in G B g; then we obtain a cfg G 0 = (V 0 ; ; P 0 ; S) in Gnf, where V 0 = P 0 = V [ fn B j B 2 N n fsgg; and fa! a 2 P j a 2 g [fa S! ac j A! BC 2 P and a is an alternative of S B in G B g [ B2NnfSg P B n fs B! j 2 N g B 3

The conditions (i) and (ii) above ensure that in a derivation in G 0 no illegal changes between two grammars G B and G C are possible. It remains the construction of the cfg G B = (V B ; V; P B ; S B ), B 2 N n fsg. The central observation is that L B is a regular set. Let i.e., L R B is L B reversed. L R B = fb t B t 1 : : : B 1 a j ab 1 B 2 : : : B t 2 L B g; First, we will construct a nfa M B = (Q; V; ; B B ; S B ) with L(M B ) = L R B. Using a standard method (see [4], pp. 55{56), it is easy to construct from M B a nfa M 0 B with L(M 0 B) = L B. For doing this, S B will be the initial state, B B will be the unique nite state, and we turn around the transitions of M B. Then, we will use for each B 2 N n fsg the nfa M 0 B for the construction of a cfg G 0 B = (V B ; V; P 0 B ; S B) which fullls 1. L(G 0 B) = L B, 2. G 0 B is in Gnf. (Note that V is the terminal alphabet of G 0 B.) 3. 2 N B for each production S B! 2 P 0 B. 4. The right side of every other production begins with a variable of G; i.e., a symbol in N n fsg. 5. N B \ N C = ; for all B; C 2 N n fsg, B 6= C. Now, G B will be constructed from G 0 B by the replacement of every rule of the form D B! EC B and D B! E, respectively in P 0 B by a set of productions, analogously to the construction of P 0 from P. Altogether, we obtain the following method for the construction of the cfg G B = (V B ; V; P B ; S B ). (1) Dene M B = (Q; V; ; B B ; S B ) by Q = (C B ; E) = (C B ; a) = fa B j A 2 Ng [ fs B g; and ( fd B j C! DE 2 P g fsb g if C! a 2 P ; otherwise for all C B 2 Q, E 2 N n fsg, a 2. 4

L(M B ) = L R B can easily be proven by induction on the length of the terminal left derivation and on the length of the computation of the nfa, respectively. (2) Dene M 0 B = (Q; V; 0 ; S B ; B B ) by 0 (S B ; a) = 0 (D B ; E) = fc B j fs B g = (C B ; a)g; and fc B j D B 2 (C B ; E)g for all D B 2 Q, E 2 N n fsg and a 2. Also L(M 0 B ) = L B can easily be proven by induction. Using the standard method for the construction of an equivalent cfg from a given nfa (see [4], pp. 61{62), we obtain the cfg G 0 B. (3) Dene G 0 B = (V B; V; P 0 B ; S B) by V B = fa B j A 2 N n fsgg [ fs B g [ V; and P 0 B = fs B! ac B j C B 2 0 (S B ; a) and (C B 6= B B or 0 (fb B g N n fsg 6= ;)g [fs B! a j B B 2 0 (S B ; a)g [fd B! EC B j C B 2 0 (D B ; E) and (C B 6= B B or 0 (fb B g N n fsg 6= ;)g [fd B! E j B B 2 0 (D B ; E)g for all D B 2 V 0 B, E 2 N n fsg, a 2. By inspection, Properties 1 { 5 can easily be proven. It is easy to show that L(G 0 B) = L B. For obtaining only productions fullling the Greibach normal properties with respect to the terminal alphabet, simultaneously to all G 0 B, B 2 N nfsg, we apply the trick used for the construction of P 0 from P again. Note that for all grammars G 0 E, E 2 N n fsg, the alternatives of the start symbol S E fulll the Greibach normal form properties with respect to the terminal alphabet. (4) For each B 2 N n fsg we dene G B = (V B ; ; P B ; S B ) by P B = fs B! ac B j S B! ac B 2 P 0 Bg fs B! a j S B! a 2 P 0 Bg [ S DB!ECB2P 0 B fd B! C B j is an alternative of S E in G 0 Eg [ S DB!E2P 0 B fd B! j is an alternative of S E in G 0 Eg 5

Note that L(G B ) 6= L B, since the same trick was applied twice and hence, the terminal alphabet of G B is and not V. We have to add to P B all productions in G E with left side 6= S E. Since we use for each E 2 N n fsg the grammar G E for the construction of the cfg G 0 = (V 0 ; ; P 0 ; S) and all these productions are added to P 0, we do not have to add these productions explicitely to P B. For the same reasons, we have not to extend N B explicitely. By construction, the cfg G B fullls the Greibach normal form properties with respect to the terminal alphabet for all B 2 N n fsg. Furthermore, for all B; C 2 N n fsg, B 6= C the grammar G B and G C are constructed, such that i) N B \ N C = ;, or ii) each production with a variable from N B \ N C on the left side is contained implicitely in both set of rules P B and P C. Since L(G 0 B ) = L B for all B 2 NnfSg, L(G 0 ) = L(G) follows immediately. Furthermore, G 0 is in Gnf. Moreover, since every alternative of the start symbol S B of the grammars G 0 B and G B are of the form = ae, a 2 ; E 2 N n fsg, the grammar G 0 is in 2-Gnf. Let us analyze the size of G 0 in dependence of the size of G. Since each transition of M B and hence, each transition of M 0 B corresponds to a rule of P and vice versa, it is easy to see that jg 0 Bj 5jGj for all B 2 N n fsg. Constructing G B from G 0 B, every production D B! EC B and D B! E, respectively of P 0 B is replaced by at most jg 0 Ej rules. Hence, jg B j jg 0 BjjGj = O(jGj 2 ). Similary, each production of type A! BC of P is replaced by at most jgj productions. Note that for all B 2 N n fsg, the number of alternatives of the start symbol S B of G B is bounded by jgj. Altogether, we have obtained X jg 0 j 5jGj 2 + jg B j = O(jGj 3 ): B2NnfSg Hence, we have proven the following theorem. Theorem 1 Let G = (V; ; P; S) be a cfg in Cnf. Then there is a cfg G 0 = (V 0 ; ; P 0 ; S) in 2-Gnf such that L(G 0 ) = L(G) and jg 0 j = O(jGj 3 ). 6

Given an arbitrary cfg G = (V; ; P; S), the usual algorithm for the transformation of G into Cnf can square the size of the grammar (see [4], pp. 102 for an example). No better algorithm is known. This observation leads directly to the following corollary. Corollary 1 Let G = (V; ; P; S) be a cfg. Then there exists a cfg G 0 = (V 0 ; ; P 0 ; S) in 2-Gnf such that L(G 0 ) = L(G) and jg 0 j = O(jGj 6 ). In [2], it is shown that for all > 0 and suciently large n there is a contextfree language CL n with the following properties: a) CL n has a cfg of size O(n). b) Each chain rule free cfg for CL n has size O( 3 n 3=2 ). 3 Improving the size of the Gnf grammar Harrison and Yehudai [5] have developed an algorithm which eliminates for a given cfg G = (V; ; P; S) the "-rules in linear time. Moreover, the constructed cfg G 0 = (V 0 ; ; P 0 ; S) is in canonical two form and jg 0 j 12jGj. For obtaining an equivalent cfg G 00 in Cnf from G 0, it would suce to eliminate the chain rules from G 0. Hence, the expensive part of the transformation of an arbitrary cfg G = (V; ; P; S) into Cnf is the elimination of the chain rules. The standard method for chain rule elimination computes for all A 2 N the set W (A) = fb 2 N j A ) Bg and deletes the chain rules from P and adds the set P (A) = fa! j 2 V + n N and 9B 2 W (A) : B! 2 P g of productions to P. Note that jp (A)j jgj. Hence, jg 00 j = O(n 2 ). Given an arbitrary cfg G = (V; ; P; S), our goal is now to construct an equivalent cfg G 0 in Gnf such that jg 0 j = O(jGj 4 ). Since the transformation of G into canonical two form enlarges the size of G at most by the factor 12, we can assume that G is already in canonical two form. First, we will show that an appropriate extension of our algorithm will produce an equivalent cfg G in extended 2-standard form. Then we will show that applying the 7

standard method for chain rule elimination produces a cfg G 0 in 2-standard form with L(G 0 ) = L(G) and jg 0 j = O(jGj 4 ). Let G = (V; ; P; S) be a cfg in canonical two form. Then the chain rules already fulll the properties for extended 2-standard form. Hence, the grammar G = (V 0 ; ; P ; S), where V 0 = P = V [ fn B j B 2 N n fsgg; and fa! a 2 P j a 2 g fa! B 2 P j B 2 Ng [fa S! ac j A! BC 2 P and a is an alternative of S B in G B g [ B2NnfSg P B n fs B! j 2 N g B would be in extended 2-standard form, assumed that the grammars G B, B 2 N n fsg are constructed correctly. It remains to discuss, how to treat the chain rules during the construction of G B. For doing this, we will extend the four steps of the algorithm in an appropriate manner. (1) We add the following transitions to. for all C B 2 Q. (C B ; ") = fd B j C! D 2 P g (2) For the correct denition of M 0 B, we have to add the following transitions to 0. 0 (D B ; ") = fc B j D B 2 (C B ; ")g for all D B 2 Q. (3) The standard algorithm for the construction of a cfg from a given nfa add the following extra rules to P 0 B. fd B! C B j C B 2 0 (D B ; ") and (C B 6= B B or 0 (fb B gnnfsg) 6= ;)g for all D B 2 V 0 B. Step 4 has not to be extended. For the elimination of the chain rules A! B from P, we add the set fa! a j 9B 2 W (A) and a is an alternative of S B in G B g 8

of productions to P 0. Since jw (A)j jnj for all A 2 N and since every start symbol S B of a grammar G B has at most jgj alternatives, this enlarges the size of G at most by an amount of O(jGj 2 ). Now, we apply the standard method for chain rule elimination to G B for all B 2 N n fsg. Note that for all C B 2 N B, jw (C B )j jn B j and hence, jw (C B )j jgj. Furthermore, the number of distinct alternatives of variables in V B is bounded by O(jGj 2 ). Hence, the standard method for chain rule elimination enlarges the size of G B at most by a factor jgj. Hence, jg B j = O(jGj 3 ). Altogether, we have obtained the following theorem. Theorem 2 Let G = (V; ; P; S) be a cfg. Then there exits a cfg G 0 = (V 0 ; ; P 0 ; S) in 2-Gnf such that L(G 0 ) = L(G) and jg 0 j = O(jGj 4 ). Remark: Given an "-rule free cfg G = (V; ; P; S), the transformation of G into canonical two form can enlarge the number of production considerably. If one does not wish to enlarge the number of productions in such a way, one can transform G directly into extended Greibach normal form. Let A! B be any production of G. We consider as one symbol and built in Steps 1 and 2 nfa's M B = (Q; V ext ; ; B B ; S B ) and M 0 = B (Q; V ext; 0 ; S B ; B B ), respectively, where V ext = V [ f j 9E 2 V with E is an alternative of an variableg. Then we construct the cfg G 0 B from MB. 0 During Step 4, will be considered as an element of V. Acknowledgment: We thank Claus Rick for helpful comments. References [1] A. V. Aho, and J. D. Ullman, The Theory of Parsing, Translation, and Compiling, Vol. I: Parsing, Prentice-Hall (1972). [2] N. Blum, More on the power of chain rules in context-free grammars, TCS 27 (1983), 287{295. [3] S. A. Greibach, A new normal-form theorem for context-free, phrasestructure grammars, JACM 12 (1965), 42{52. [4] M. A. Harrison, Introduction to Formal Language Theory, Addison- Wesley (1978). 9

[5] M. A. Harrison, and A. Yehudai, Eliminating null rules in linear time, The Computer Journal 24 (1981), 156{161. [6] J. E. Hopcroft, and J. D. Ullman, Introduction to Autmata Theory, Languages, and Computation, Addison-Wesley (1979). [7] A. Kelemenova, Complexity of normal form grammars, TCS 28 (1984), 299{314. [8] D. J. Rosenkrantz, Matrix equations and normal forms for context-free grammers, JACM 14 (1967), 501{507. [9] D. Wood, Theory of Computation, Harper & Row (1987). 10