使用keras实现BiLSTM+CNN+CRF文字标记NER-猿码集

使用keras实现BiLSTM+CNN+CRF文字标记NER


import tensorflow as tf
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional,Conv1D
from keras_contrib.layers.crf import CRF

命名实体识别(NER)是自然语言处理(NLP)的重要分支，它旨在从文本中识别实体类别，如人名、组织、位置、时间等等。在早期的NER中，基于规则的方法是主流，但由于特定语言中的复杂性和规则的缺失，机器学习方法成为了NER的主流。BiLSTM-CRF是NER任务的一种最常用的架构。如果文本中有重叠实体，CNN层能够帮助提取这类特征，从而提高NER的准确性。因此，本篇文章将介绍BiLSTM-CNN-CRF模型的实现。

1.数据处理

首先，我们需要加载数据集。此处选择的是conll2003数据集，它包含了完全标记的英文数据。因为NER数据集通常都是按每行一个单词的格式存储，如下所示：

U.N. NNP I-ORG O official NN O O Ekeus NNP I-PER O heads VBZ O O for IN O O Baghdad NNP I-LOC B-LOC . . O O

B-和I-前缀分别标记了开始和中间标签的单词。下面是代码加载conll2003数据集的过程：


import os
import codecs
import numpy as np
def load_data():
    ## 定义数据集文件路径
    TRAIN_FILE = './data/train.txt'
    DEV_FILE = './data/dev.txt'
    TEST_FILE = './data/test.txt'
    ## 读取并返回数据
    train_sentences, train_tags = read_conll2003(TRAIN_FILE)
    dev_sentences, dev_tags = read_conll2003(DEV_FILE)
    test_sentences, test_tags = read_conll2003(TEST_FILE)
    return train_sentences, train_tags, dev_sentences, dev_tags, test_sentences, test_tags
def read_conll2003(filename):
    """
    读取文件，返回一组列表，其中每个单词存储在字符串中，并保存标签/命名实体标记。
    返回的列表是按句子分组的，其中每个词都由单独空格分开。
    :param filename: 文件路径
    :return: ([句子1],[句子1的标记/命名实体标记], [句子2], [句子2的标记],  ...)
    """
    sentences = []
    labels = []
    with codecs.open(filename, 'r', encoding='utf-8', errors='ignore') as f:
        words = []
        tags = []
        for line in f:
            line = line.strip()  ## 清除两侧空白
            if len(line) == 0 or line.startswith('-DOCSTART-'):  ## 文档开始或结束
                if len(words) > 0:
                    sentences.append(words)
                    labels.append(tags)
                    words = []
                    tags = []
            else:
                splits = line.split(' ')
                words.append(splits[0])
                if len(splits) > 1:
                    tags.append(splits[-1].replace('\n', ''))  ## 移除结尾换行符
        ## 如果还有剩下的句子，组成一批。
        if len(words) > 0 and len(tags) > 0:
            sentences.append(words)
            labels.append(tags)
    return sentences, labels

为了把文本输入到神经网络中，需要将每个单词转换为数字。这通常使用词嵌入(Word Embedding)来实现，它将每个单词映射到向量空间中的矢量表示。在本例中，我们使用GloVe预训练的词嵌入。GloVe是预训练的词向量集，它基于整个维基百科巨型数据集和一些互联网文本数据训练得到，经过训练的结果可以直接用来进行通用任务的单词嵌入(类比于word2vec)。


def load_glove_embeddings():
    ## 加载GloVe预训练的词嵌入向量
    embeddings = {}
    with codecs.open('./data/glove.6B.100d.txt', 'r',
                     encoding='utf8') as f:
        for line in f:
            values = line.strip().split(' ')
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings
## 加载数据并处理，甚至可以手动定义每个单词的转换
train_sentences, train_tags, dev_sentences, dev_tags, test_sentences, test_tags = load_data()
print('加载训练集句子数：%d' % len(train_sentences))
train_word_embeddings = load_glove_embeddings()

2.构建模型

在构建模型之前，我们需要先对文字进行padding，使他们具有相同的长度。对于较短的文字，这意味着添加0到末尾；对于过长的文字，这意味着截取文本开头（或结尾）。我们使用Keras的“pad_sequences”方法实现上述操作。

专门针对NER的双向LSTM模型通常被用于将先前的和后续的标记考虑在内。它根据先前和下一个标记的定义来推理当前标记，从而寻求命名实体。因此，我们在模型的第一层中添加双向LSTM层。


def get_bilstm_cnn_crf_model(word_embeddings, train=False):
    ## 输入层定义
    input_layer = Input(shape=(None,), dtype='int32', name='words_input')
    
    ## 计算需要padding的长度
    max_sentence_length = 200
    input_layer_padding = tf.keras.layers.Lambda(lambda x: tf.pad(x, [[0, 0], [0, max_sentence_length - tf.shape(x)[1]]], 'CONSTANT'))(input_layer)
    ## 词嵌入
    embedding_weights = np.random.randn(len(word_embeddings) + 1, 100)
    for word, index in word_embeddings.items():
        if index is not None:
            embedding_weights[index] = word_embeddings[word]
    embedding_layer = Embedding(len(word_embeddings) + 1,
                                100,
                                weights=[embedding_weights],
                                trainable=train,
                                mask_zero=True,
                                name='word_embedding_layer')(input_layer_padding)
    dropout_1 = Dropout(0.5, name='dropout_1')(embedding_layer)
    ## 双向LSTM
    bilstm_layer = Bidirectional(LSTM(units=256, return_sequences=True, recurrent_dropout=0.1), name='bilstm_layer')(dropout_1)
    ## 一维卷积层
    cnn_1d = Conv1D(filters=100, kernel_size=1,activation='relu')(bilstm_layer)
    ## CRF层
    crf_layer = CRF(len(tag_set), sparse_target=True)
    out = crf_layer(cnn_1d)
    return Model([input_layer], [out])

我们在LSTM的顶部添加了一层1D卷积层（CNN）。这个想法来源于一篇论文：《End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF》。与LSTM相比，CNN能够提取序列中的局部特征。因此，在结合LSTM模型时，CNN可以帮助提高模型的性能。CNN的过滤器大小设置为1，这意味着我们只考虑当前单词的特征。我们还在模型的最后一层添加了CRF，以进一步提高NER模型的准确性。CRF通常比Softmax分类器更适用于序列标注任务，因为它可以将标签之间的依赖性考虑进去。

3.模型的训练与评估

我们训练上述模型，并输出验证和测试准确性：


## 加载数据集
train_sentences, train_tags, dev_sentences, dev_tags, test_sentences, test_tags = load_data()
train_word_embeddings = load_glove_embeddings()
## 定义标签训练集
tag_set = set(tag for doc in train_tags + dev_tags + test_tags for tag in doc)
tag2index = {'': 0, 'O': 1}
for tag in tag_set:
    if tag != 'O':
        tag2index['B-' + tag] = len(tag2index)
        tag2index['I-' + tag] = len(tag2index)
## 转换数据集为模型所需的形式
def sentences_to_indices(data_sentences):
    data_idx = []
    for sentence in data_sentences:
        sentence_idx = []
        for word in sentence:
            if word in train_word_embeddings:
                sentence_idx.append(train_word_embeddings[word])
            else:
                sentence_idx.append(0)
        if len(sentence_idx) > 0:
            data_idx.append(sentence_idx)
    return data_idx
x_train = sentences_to_indices(train_sentences)
y_train = sentences_to_indices(train_tags)
x_dev = sentences_to_indices(dev_sentences)
y_dev = sentences_to_indices(dev_tags)
x_test = sentences_to_indices(test_sentences)
y_test = sentences_to_indices(test_tags)
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=200, padding='post', truncating='post', value=0)
y_train = tf.keras.preprocessing.sequence.pad_sequences(y_train, maxlen=200, padding='post', truncating='post', value=0)
x_dev = tf.keras.preprocessing.sequence.pad_sequences(x_dev, maxlen=200, padding='post', truncating='post', value=0)
y_dev = tf.keras.preprocessing.sequence.pad_sequences(y_dev, maxlen=200, padding='post', truncating='post', value=0)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=200, padding='post', truncating='post', value=0)
y_test = tf.keras.preprocessing.sequence.pad_sequences(y_test, maxlen=200, padding='post', truncating='post', value=0)
y_train = np.array([np.eye(len(tag2index))[np.array([tag2index[tag] for tag in sent_tags])] for sent_tags in train_tags])
y_dev = np.array([np.eye(len(tag2index))[np.array([tag2index[tag] for tag in sent_tags])] for sent_tags in dev_tags])
y_test = np.array([np.eye(len(tag2index))[np.array([tag2index[tag] for tag in sent_tags])] for sent_tags in test_tags])
model = get_bilstm_cnn_crf_model(train_word_embeddings)
## 训练模型
optimizer = tf.keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model.compile(optimizer=optimizer,loss=crf_layer.loss_function,metrics=[crf_layer.accuracy])
model.fit(x_train, y_train, validation_data=(x_dev, y_dev),batch_size=32, epochs=10)
## 评估模型
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report
print('开始评估模型')
pred = model.predict(x_test, verbose=1)
pred_tags = n_viterbi_decode(pred, crf_layer.trans_params)
for i in range(len(pred_tags)):
    for j in range(len(pred_tags[i])):
        pred_tags[i][j] = list(tag2index.keys())[list(tag2index.values()).index(np.argmax(pred_tags[i][j]))]
test_tags2 = []
for i in range(len(y_test)):
    for j in range(len(y_test[i])):
        if list(tag2index.keys())[list(tag2index.values()).index(np.argmax(y_test[i][j]))] == '':
            break
        test_tags2.append(list(tag2index.keys())[list(tag2index.values()).index(np.argmax(y_test[i][j]))])
print('准确率:', f1_score(test_tags2, pred_tags))
print(classification_report(test_tags2, pred_tags))

ceil_label是将小数转换为整数的方法，我们在结果中加了一些样本标签。在训练完成后，我们对模型进行了评估。首先，我们使用test数据集获得预测标签，并使用Seqeval库中提供的函数计算准确性。

4.总结

本文我们介绍了如何使用Python和Keras开发名为BiLSTM-CNN-CRF的Named Entity Recognition (NER)模型。NER是自然语言处理中的一个常见任务，它涉及识别文本中的命名实体，如人名，组织和位置等。

与传统的方法不同，我们的模型结合了双向LSTM、CNN和CRF，以帮助提取序列中的特征并提高准确性。我们还介绍了如何加载GloVe预训练的词嵌入，以便能够将单词转换为数字向量。

使用keras实现BiLSTM+CNN+CRF文字标记NER