Hidden Markov Model. Definition: V, X,{T k },π. hidden Markov model. X is an output alphabet. V is a finite set of states

Hidden Markov Model Definition: hidden Markov model V, X,{T k },π X is an output alphabet V is a finite set of states {T k }={T k k X} are transition matrices T k is an V V matrix, T k j [,], j,k Tk j = π is a row vector, π [,], π =, π= k πtk Shorthand: λ= π,{t k }

Word Probabilities

Notation and Definition Alphabet: States: Word: By definition, X={,,2,...,N } V={,,2,...,V } = 2 L Pr( )=πt X =N V =V =L For example, Pr() = j k π T () j T () jk T() k How do we compute Pr( ) efficiently?

Brute Force Algorithm Method: Add each term in the summation. def wordprob( ): s=[,,...,] prob = # L+ zeros, these are indices keepgoing = True while keepgoing: term = π[s ] for i in range(l): term = term T[ ][s,s + ] prob += term # increment the indices in lexicographic order keepgoing = incrementindices( s) return prob This algorithm is O L N L. For any reasonable L, the algorithm is too slow. The algorithm calculates the same quantities repeatedly. Pr()=π T T +π T T +π T T +π T T +π T T +π T T +π T T +π T T For example, linked, colored quantities are calculated twice.

Forward Probabilities Definition: α t (j)=pr(seeing t and ending up in state j) For example, let = and N=V=2. Then, α ()=Pr(seeing and ending up in state ) =π T () +π T () α ()=Pr(seeing and ending up in state ) =π T () +π T () Notice, Pr()=α ()+α ().

Forward Probabilities Definition: α t (j)=pr(seeing t and ending up in state j) For example, let = and N=V=2. Then, α ()=π T () +π T () α ()=π T () +π T () α ()=Pr(seeing and ending up in state ) =π T () T() +π T () T() +π T () T() +π T () T() =α ()T () +α ()T () α ()=Pr(seeing and ending up in state ) =α ()T () +α ()T ()

Forward Probabilities Definition: α t (j)=pr(seeing t and ending up in state j) In general, α t (j)= π T ( ) j t= α t ( )T ( t) j <t<l Pr( t )= α t (j) j Notice, t is represented as [:t+] in Python.

Forward Algorithm Method: Use forward probabilities. def wordprob( ): L = len( ) α = zeros((l,v),float) # an L V matrix of zeros for j in range(v): for i in range(v): α[,j] += π[i] T[ ][i, j] for t in range(,l): for j in range(v): for i in range(v): α[t,j] += α[t,i] T[ t ][i, j] prob = for j in range(v): prob += α[n,j] return prob This algorithm is O L V 2 In some cases, algorithm can be improved to be linear in V. See Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis by Felzenszwalb, Huttenlocher, and Kleinberg.

Complete Sets of Word Probabilities Suppose we needed to word probabilities for every word of length L. Is there a good way to do this? Notice: This problem is inherently exponential in L. Using the brute-force method on each of the N L words gives an algorithm in O(N 2L L). Using the forward-algorithm on each of the N L words gives an algorithm in O(N L V 2 L). Using the forward-algorithm efficiently gives an algorithm in O(N L V 2 logl) that gives probabilities for words of every length up to L. To compute Pr(), we also compute Pr() and Pr(). So we can store these values and use them when computing Pr().

Complete Sets of Word Probabilities λ λ

Mathematica Code Even Process in Mathematica T[]={{/2, }, {, }}; T[]={{, /2}, {, }}; n = Table[{}, {i,, Dimensions[T[]][[]]}]; A = {, }; ue = Table[a[i], {i,, Length[n]}]; evec = Solve[{ue.(T[] + T[]) == ue, Sum[a[i], {i,, Length[n]}] == }, ue]; e = Table[evec[[, i, 2]], {i,, Length[n]}]; wordprob[l_] := Module[{currentWord, i, words}, currentmatrices := Fold[Dot, T[i[]], Table[T[i[j ]], {j, 2, L}]]; words := Flatten[ Fold[ Table, { MyStringJoin[ Table[MyToString[i[k]], {k,,l}] ], (e.currentmatrices.n) }, Table[{i[k],, }, {k, L,, }] ] ] /. MyToString > ToString /. MyStringJoin > StringJoin; Return[words]; ] For appropriate V V matrices, wordprob[l] computes the probabilities for every word of length L using the brute force method for each of the X L words. For L 5, Mathematica computes wordprob[l] as fast as Python does when using the forward algorithm intelligently!

Viterbi Path

Question and Example Given an HMM and an output sequence, what sequence of states most likely caused the output sequence?.99..99. If we observe =, the possible internal state sequences are: Since we are in state most of the time, the most likely state sequence is.

Viterbi Path - Brute Force The Viterbi path, ρ V L+, is given by: ρ( )= rgm x s Pr( s) Once again, we can compute this with brute force. Given a word,. calculate Pr( s) where s is a sequence of states 2. do this for all N L possible state sequences s 3. return the s which maximizes Pr( s) This algorithm is ridiculously similar to the brute force method for computing word probabilities, and thus, is also O(L N L ). As before, dynamic programming techniques should be used.

Viterbi Algorithm δ t (j)=(probability,path) The first component is the probability of the Viterbi path for t that ends in state j. m x π T j t= δ X t (j)= m x δ t ( )T t j <t<l X The second component is a list of the path mentioned above. δ t (j)= rgm x π T j δ t ( ) { } t= <t<l, = rgm x δ t T t j The union operator should be understood as append to the list.

Viterbi Algorithm δ t (j)= m x π T j m x X δ t ( )T t j X t= <t<l δ t (j)= rgm x π T j δ t ( ) { } t= <t<l, = rgm x δ t T t j For each j V, δ (j) is a possible Viterbi path for. The L actual Viterbi path ρ is: ρ=δ L (j ) {j } where j = rgm x j {δ L (j)} That is, the correct path is the one with maximum likelihood.

Viterbi Example Consider = for a 2-state HMM with X={,}.

Viterbi Example Consider = for a 2-state HMM with X={,}. δ ()=m x π T,π T

Viterbi Example Consider = for a 2-state HMM with X={,}. δ ()=m x π T,π T For demonstration purposes, let s pick one of these paths to be more probable. Notice, this algorithm assumes they are not equal. Thus, we now have δ ()= π T,{}

Viterbi Example Consider = for a 2-state HMM with X={,}. δ ()=m x π T,π T

Viterbi Example Consider = for a 2-state HMM with X={,}. δ ()=m x π T,π T Once again, let s pick one to be the maximum. Thus, δ ()= π T,{}

Viterbi Example Consider = for a 2-state HMM with X={,}. So far, we have: δ ()= π T,{} δ ()= π T,{}

Viterbi Example Consider = for a 2-state HMM with X={,}. δ ()=m x δ ()T,δ ()T =m x π T T,π T T

Viterbi Example Consider = for a 2-state HMM with X={,}. δ ()=m x δ ()T,δ ()T Picking a maximum gives: =m x π T T,π T T δ ()= π T T,{,}

Viterbi Example Consider = for a 2-state HMM with X={,}. δ ()=m x δ ()T,δ ()T =m x π T T,π T T

Viterbi Example Consider = for a 2-state HMM with X={,}. δ ()=m x δ ()T,δ ()T Computing the maximum gives: =m x π T T,π T T δ ()= π T T,{,}

Viterbi Example Consider = for a 2-state HMM with X={,}. Now we have: δ ()= π T,{} δ ()= π T,{} δ ()= π T T,{,} δ ()= π T T,{,}

Viterbi Example Consider = for a 2-state HMM with X={,}. δ 2 ()=m x δ ()T,δ ()T =m x π T T T,π T T T

Viterbi Example Consider = for a 2-state HMM with X={,}. δ 2 ()=m x δ ()T,δ ()T =m x π T T T,π T T T Hypothetically, we compute the maximum and obtain: δ 2 ()= π T T T,{,,}

Viterbi Example Consider = for a 2-state HMM with X={,}. δ 2 ()=m x δ ()T,δ ()T =m x π T T T,π T T T

Viterbi Example Consider = for a 2-state HMM with X={,}. δ 2 ()=m x δ ()T,δ ()T =m x π T T T,π T T T Finally, we compute the maximum and obtain: δ 2 ()= π T T T,{,,}

Viterbi Example Consider = for a 2-state HMM with X={,}. In total, we have: δ ()= π T,{} δ ()= π T,{} δ ()= π T T,{,} δ ()= π T T,{,} δ 2 ()= π T T T,{,,} δ 2 ()= π T T T,{,,}

Viterbi Example Consider = for a 2-state HMM with X={,}. To find the Viterbi path for =, first we find the j that maximizes δ 2 (j). j = rgm x{π T T T,π T j }{{} T T } }{{} j= j= Suppose, the j= term was the larger of the two. Then the Viterbi path is: ρ=δ 2 () {}={,,} {}={,,,}

Viterbi Example Code def viterbi( ): L = len( ) δ = {} for j in range(v): (v_prob, v_path) = (, None) for i in range(num_states): prob = π[i] T[ ][i, j] if prob > v_prob: (v_prob, v_path) = (prob, [i ]) δ[,j] = (v_prob, v_path) for t in range(,l): for j in range(v): (v_prob, v_path) = (, None) for i in range(v): (prior_prob, prior_path) = δ[t,i] prob = prior_prob T[ t ][i, j] if prob > v_prob: (v_prob, v_path) = (prob, prior_path + [i ]) δ[t,j] = (v_prob, v_path) value_max = argmax = None for j in range(v): if δ[n,j][] > value_max: value_max = δ[n,j][] argmax = j path = δ[n,argmax][] + [argmax] return path Like the forward algorithm, this algorithm is O(L V 2 ).

HMM Inference

The Idea Given and an assumed number of states, adjust λ= π,{t k } to maximize Pr( λ). That is, λ = rgm x λ Pr( λ) One common method to use is the Baum-Welch algorithm. In practice, only local maxima can be found. The method is iterative producing a series of λ such that Pr( λ + )>Pr( λ ) Eventually, the improvements on λ decrease to zero. MLE method via expectation-modification (EM) algorithm. q q qt t q t+ t+ L 2 q L L q L

Backward Probabilities Recall, the forward probabilities: α t (j)=pr( t,q t+ =j λ) = π T ( ) j t= α t ( )T ( t) j <t<l Now, the backward probabilities: β t ( )=Pr( t+ t+2 L q t+ =,λ) = t=l j T t+ j β t+ (j) t<l

Another Definition The probability of being in state at time t+, given : γ t ( )=Pr(q t+ =,λ) Notice, Pr(,q t+ = λ)=α t ( )β t ( ) Using Pr(A B)=Pr(A B)Pr(B), γ t ( )= Pr(,q t+= λ) Pr( λ) = α t( )β t ( ) α t( )β t ( ) Thus, we compute the forward and backward probabilities. Then we calculate γ t ( ). Given, one way to estimate π is by: π =γ ( )

Yet Another Definition The probability of being in state at time t+ then j, given : ξ t (,j)=pr(q t+ =,q t+2 =j,λ) t L 2 = Pr(q t+=,q t+2 =j, λ) Pr( λ) = α t( )T t+ j β t+ (j) j α t( )T t+ j β t+ (j) Notice, γ t ( ) is the marginal distribution. γ t ( )= ξ t (,j) j

Baum-Welch Algorithm Let λ and λ be two different HMM specifications. Q(λ, λ)= Pr(,q λ)logpr(,q λ) q V L+ For state-output HMMs, it was shown 2 that: Q(λ, λ)>q(λ,λ) Pr( λ)>pr( λ) We can generate an edge-output HMM from a state-output HMM that describes the same process, so the results should hold (with slight modifications). This is also known as the forward-backward algorithm. 2 Baum,Petrie,Soules,andWeiss. Amaximizationtechniqueoccurringinthe statistical analysis of probabilistic functions of Markov chains.

Baum-Welch Algorithm We can build a new λ from λ and : π =prob. of being in state at time(t=) =γ ( ) T k j = prob. of transitioning from state to j and seeing k prob. of transitioning from state to j = L 2 t= t+ =k L 2 t= ξ t (,j) ξ t (,j) So, we take λ= π,{ T k }

Baum-Welch Procedure Procedure: # assume the number of states # choose some initial λ, perhaps a uniform λ # observe # using, generate λ while not λ λ: λ= λ # regenerate λ Example: (eventually)

Sources Most sources use state-output HMMs. Lawrence R. Rabiner. A tutorial on hidden markov models and selected applicationsin speech recognition. Proceedings of the IEEE, Vol. 77, No. 2, Feb 989. Wikipedia. Viterbi algorithm. http://en.wikipedia.org/wiki/viterbi_algorithm Roger Boyle. Hidden Markov Models. http://www.comp.leeds.ac.uk/roger/hiddenmarkovmodels/html_dev/main.html Benjamin Taitelbaum. The Uses of Hidden Markov Models and Adaptive Time-Delay Neural Networks in Speech Recognition. http://occs.cs.oberlin.edu/~btaitelb/projects/honors/paper.html Narada Dilp Warakagoda. A Hybrid ANN-HMM ASR system with NN based adaptive preprocessing. http:// jedlik.phy.bme.hu/~gerjanos/hmm/hoved.html