HMM, MEMM, and CRF: A Comparative Analysis of Statistical Modeling Methods

简介: HMM, MEMM, and CRF are three popular statistical modeling methods, often applied to pattern recognition and machine learning problems.

This article presents a comparison analysis of Hidden Markov Model (HMM), Maximum Entropy Markov Models (MEMM), and Conditional Random Fields (CRF). HMM, MEMM, and CRF are three popular statistical modeling methods, often applied to pattern recognition and machine learning problems. Let us explore each method in further detail.

Hidden Markov Model (HMM)

The word “Hidden” symbolizes the fact that only the symbols released by the system are observable, while the user cannot view the underlying random walk between states. Many in this field recognize HMM as a finite state machine.

1
Hidden Markov Model (HMM)

Advantages of HMM

HMM has a strong statistical foundation with efficient learning algorithms where learning can take place directly from raw sequence data. It allows consistent treatment of insertion and deletion penalties in the form of locally learnable methods and can handle inputs of variable length. They are the most flexible generalization of sequence profiles. It can also perform a wide variety of operations including multiple alignment, data mining and classification, structural analysis, and pattern discovery. It is also easy to combine into libraries.

Disadvantages of HMM

  • HMM is only dependent on every state and its corresponding observed object:

    The sequence labeling, in addition to having a relationship with individual words, also relates to such aspects as the observed sequence length, word context and others.


  • The target function and the predicted target function do not match:
    HMM acquires the joint distribution P(Y, X) of the state and the observed sequence, while in the estimation issue, we need a conditional probability P(Y|X).

Maximum Entropy Markov Models (MEMM)

2
Maximum Entropy Markov Models

MEMM takes into account the dependencies between neighboring states and the entire observed sequence, hence a better expression ability. MEMM does not consider P(X), which reduces the modeling workload and learns the consistency between the target function and the estimated function.

MEMM Labeling Bias

3
Viterbi algorithm decoding of MEMM

In Figure 3, State 1 tends to convert to State 2, and State 2 tends to stay at State 2 at the same time.

P(1-> 1-> 1-> 1)= 0.4 x 0.45 x 0.5 = 0.09, P(2->2->2->2)= 0.2 X 0.3 X 0.3 = 0.018,

P(1->2->1->2)= 0.6 X 0.2 X 0.5 = 0.06,P(1->1->2->2)= 0.4 X 0.55 X 0.3 = 0.066.

However, the optimal state conversion path is 1 > 1 > 1 >1. Why?

It is because State 2 has more convertible states than State 1 does, hence reducing the conversion probability – MEMM tends to select the state with fewer convertible states. Such selection is termed the labeling bias issue. CRF well addresses the labeling bias issue.

Conditional Random Fields (CRF model)

4

The CRF model has addressed the labeling bias issue and eliminated two unreasonable hypotheses in HMM. Of course, the model has also become more complicated.

MEMM adopts local variance normalization while CRF adopts global variance normalization.

On the other hand, MEMMs cannot find the corresponding parameters to meet the following distribution:

a b c --> a/A b/B c/C              p(A B C | a b c) = 1
a b e --> a/A b/D e/E             p(A D E | a b e) = 1
p(A|a)p(B|b,A)p(C|c,B) = 1                                     
p(A|a)p(D|b,A)p(E|e,D) = 1                                    
But CRFs can.                                                        

Generative model or discriminative model

Suppose o is the observed value, and m is the model.

a) Generative model: Infinite samples > Probability density model = Generative model > Prediction

If you model P(o|m), it is a generative model. The basic idea is, first, to establish the probability density model of the sample, and then use the model for inference prediction. The requirement of the samples being infinite or as large as possible is common knowledge. The method draws from statistical mechanics and Bayes theory.

HMM directly models the transition probability and phenotype probability, and calculates the probability of co-occurrence. Thus, it is a generative model.

b) Discriminative model: Finite samples > Discriminative function = Discriminative model > Prediction

If you model on the conditional probability P(m|o), it is the discriminative model. The basic idea is to establish the discriminant function with finite samples, and directly study the prediction model without considering the generative model of samples. Its representative theory is the statistical learning theory.

CRF is a discriminant model. MEMM is not a generative model, but a model with finite states based on state classification.

Topological structure

HMM and MEMM are a directed graph, while CRF is an undirected graph.

Global optimum or local optimum

HMM directly models the transition probability and the phenotype probability, and calculates the probability of co-occurrence.

MEMM establishes the probability of co-occurrence based on the transition probability and the phenotype probability. It calculates the conditional probability, and only adopts the local variance normalization, making it easy to fall into a local optimum.

CRF calculates the normalization probability in the global scope, rather than in the local scope as is the case with MEMM. It is an optimal global solution and resolves the labeling bias issue in MEMM.

Advantages and Disadvantages of CRF

Advantages

  • Compared with HMM: Since CRF does not have as strict independence assumptions as HMM does, it can accommodate any context information. Its feature design is flexible (the same as ME).
  • Compared with MEMM: Since CRF computes the conditional probability of global optimal output nodes, it overcomes the drawbacks of label bias in MEMM.
  • Compared with ME: CRF computes the joint probability distribution of the entire label sequence when an observation sequence intended for labeling is available, rather than defining the state distribution of the next state under the current state conditions given.

Disadvantages

CRF is highly computationally complex at the training stage of the algorithm. It makes it very difficult to re-train the model when newer data becomes available.

Conclusion:

This post detailed out a comparative analysis between Hidden Markov Model (HMM), Maximum Entropy Markov Models (MEMM), and Conditional Random Fields (CRF). In this post we categorically learnt that CRFs and MEMMS are mainly discriminative sequence models whereas HMMs are primarily generative sequence models. It is Bayes Rule that forms the basis of HMM. On the contrary, CRF and MEMM’s based on MaxEnt models over transition and observable features.

目录
相关文章
|
8月前
|
机器学习/深度学习 人工智能 自然语言处理
OneIE:A Joint Neural Model for Information Extraction with Global Features论文解读
大多数现有的用于信息抽取(IE)的联合神经网络模型使用局部任务特定的分类器来预测单个实例(例如,触发词,关系)的标签,而不管它们之间的交互。
103 0
|
6月前
|
自然语言处理 数据挖掘 数据处理
【提示学习】Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference
目前流行的第四大范式Prompt的主流思路是PVP,即Pattern-Verbalizer-Pair,主打的就是Pattern(模板)与Verbalizer(标签映射器)。   本文基于PVP,提出PET与iPET,但是关注点在利用半监督扩充自己的数据集,让最终模型学习很多样本,从而达到好效果。
|
6月前
|
机器学习/深度学习 数据挖掘
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
30 1
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
|
8月前
|
编解码 计算机视觉
NeRF系列(3): Semantic-aware Occlusion Filtering Neural Radiance Fields in the Wild 论文解读
NeRF系列(3): Semantic-aware Occlusion Filtering Neural Radiance Fields in the Wild 论文解读
146 2
|
8月前
|
机器学习/深度学习 资源调度 算法
【RLchina第四讲】Model-Based Reinforcement Learning(上)
【RLchina第四讲】Model-Based Reinforcement Learning(上)
213 0
|
8月前
|
机器学习/深度学习 算法
【RLchina第四讲】Model-Based Reinforcement Learning(下)
【RLchina第四讲】Model-Based Reinforcement Learning(下)
108 0
|
8月前
|
机器学习/深度学习 开发框架 数据建模
HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction
远程监督假设任何包含相同实体对的句子都反映了相同的关系。先前的远程监督关系抽取(DSRE)任务通常独立地关注sentence-level或bag-level去噪技术
98 0
|
8月前
|
机器学习/深度学习 编解码 自然语言处理
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers论文解读
我们提出了SegFormer,一个简单,高效而强大的语义分割框架,它将transformer与轻量级多层感知器(MLP)解码器统一起来。
426 0
|
8月前
|
存储 自然语言处理 测试技术
LASS: Joint Language Semantic and Structure Embedding for Knowledge Graph Completion 论文解读
补全知识三元组的任务具有广泛的下游应用。结构信息和语义信息在知识图补全中都起着重要作用。与以往依赖知识图谱的结构或语义的方法不同
67 0
|
8月前
|
自然语言处理 算法 知识图谱
DEGREE: A Data-Efficient Generation-Based Event Extraction Model论文解读
事件抽取需要专家进行高质量的人工标注,这通常很昂贵。因此,学习一个仅用少数标记示例就能训练的数据高效事件抽取模型已成为一个至关重要的挑战。
73 0

热门文章

最新文章