TextCNN文本分类详解--使用TensorFlow一步步带你实现简单TextCNN



  • 前言

    近期在项目组工作中,使用TextCNN对文本分类取得了不错的准确率,为了更清晰地了解TextCNN的结构,特翻译TensorFlow实现的TextCNN一文。


    一、什么是TextCNN

    TextCNN 是利用卷积神经网络对文本进行分类的算法,由 Yoon Kim 在 《Convolutional Neural Networks for Sentence Classification》 中提出.

    1. Model

    0_1522310664141_1dee9b0b-8773-4ce4-ab86-8d9462ff9949-image.png
    <center>图1 TextCNN结构图</center>

    第一层将单词嵌入到低维矢量中。下一层使用多个过滤器大小对嵌入的单词向量执行卷积。例如,一次滑动3,4或5个单词。接下来,将卷积层的结果最大池化为一个长特征向量,添加dropout正则,并使用softmax对结果进行分类。

    二、一步一步带你实现简单的TextCNN

    代码使用dennybritz实现的[Github]cnn-text-classification-tf,下面将对其代码内容进行详解

    1. TextCNN类

    TextCNN类的初始化方法如下:

    class TextCNN(object):
        def __init__(self,sequence_length, num_classes, vocab_size,
          embedding_size, filter_sizes, num_filters,l2_reg_lambda=0.0):
    

    各参数的介绍:

    参数名 意义
    sequence_length 句子长度--输入句子的最大长度 20
    num_classes 分类数 2400
    vocab_size 单词数 3000
    embedding_size 单词嵌入维度 128
    filter_sizes 卷积过滤器的大小 [3, 4, 5] -- 使用大小分别为3,4,5的过滤器,每个过滤器有一个与之对应的num_filters,本例共有3*[num_filters]个过滤器
    num_filters 每个过滤器大小的过滤器数量
    l2_reg_lambda l2正则权值 0.0

    1.1 Input Placeholders

    # Placeholders for input, output and dropout
    self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
    self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
    self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
    

    这里,我们创建了placeholder变量作为训练的输入和测试的输入,placeholder的第二项变量中第一维为batch_size,None意味着该维度可为任意值,使用None将该维度交给网络自由决定。
    将神经元保存在dropout层中的概率也作为网络的输入,因为我们在测试时不使用dropout。

    1.2 Embedding层

    Embedding层将单词向量使用更低维向量表示。

    with tf.device('/cpu:0'), tf.name_scope("embedding"):
        W = tf.Variable(
            tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
            name="W")
        self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
        self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
    

    这里我们使用了一些特殊的特性:

    • tf.device('/cpu:0')将Embedding操作交给cpu执行。默认情况下TensorFlow会将该操作交给gpu执行(前提是有gpu),但是当前embedding在gpu中执行会报错。
    • tf.name_scope("embedding"):本操作将embedding加入到命名空间(name scope)中。命名空间将所有操作加入到名为embedding的顶层节点中,因此在使用TensorBoard进行网络可视化时能有一个良好的层次结构。

    W是我们在训练中学习的嵌入矩阵。 我们使用随机均匀分布来初始化它。 tf.nn.embedding_lookup创建实际的嵌入操作。 嵌入操作的结果是形状为[None,sequence_length,embedding_size]的三维张量。
    TensorFlow的卷积转换操作具有对应于批次,宽度,高度和通道的尺寸的4维张量。 我们嵌入的结果不包含通道尺寸,所以我们手动添加,留下一层shape为[None,sequence_length,embedding_size,1]

    1.3 Convolution and Max-Pooling Layers

    下面开始构建卷积层,再进行max-pooling。因为每个卷积产生不同形状的张量,因此为他们中的每一个创建一个层,然后合并结果为一个大的特征向量。

    pooled_outputs = []  # 池化输出结果
    for i, filter_size in enumerate(filter_sizes):
        # 遍历多个filter_size
        with tf.name_scope("conv-maxpool-%s" % filter_size):
            # Convolution Layer
            filter_shape = [filter_size, embedding_size, 1, num_filters]
            W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
            b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
            conv = tf.nn.conv2d(
                self.embedded_chars_expanded,
                W,
                strides=[1, 1, 1, 1],
                padding="VALID",
                name="conv")
            # Apply nonlinearity
            h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
            # Max-pooling over the outputs
            pooled = tf.nn.max_pool(
                h,
                ksize=[1, sequence_length - filter_size + 1, 1, 1],
                strides=[1, 1, 1, 1],
                padding='VALID',
                name="pool")
            pooled_outputs.append(pooled)
     
    # Combine all the pooled features
    num_filters_total = num_filters * len(filter_sizes)
    self.h_pool = tf.concat(3, pooled_outputs)
    self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
    
    

    这里,W是过滤器矩阵,h是将非线性应用于卷积输出的结果。 每个过滤器在整个嵌入中滑动,但是它涵盖的字数有所不同。 VALID填充意味着我们在没有填充边缘的情况下将过滤器滑过我们的句子,执行给我们输出形状[1,sequence_length - filter_size + 1,1,1]的窄卷积。 在特定过滤器大小的输出上执行最大值池将留下一张张量的形状[batch_size,1,num_filters]。 这本质上是一个特征向量,其中最后一个维度对应于我们的特征。 一旦我们从每个过滤器大小得到所有的汇总输出张量,我们将它们组合成一个长形特征向量[batch_size,num_filters_total]。 在tf.reshape中使用-1可以告诉TensorFlow在可能的情况下平坦化维度。

    0_1522310619733_20fffba7-e40f-4304-b1ed-c7bc37bff401-image.png
    0_1522310635786_73355d77-a508-4b86-858d-a4a9c5a55214-image.png

    1.4 Dropout层

    Dropout是使卷积神经网络正则化的最受欢迎的方法,Dropout的想法很简单:Dropout层随机“禁用”神经元的一部分,这可以防止神经元共同适应并迫使他们独立学习有用的特征。神经元中启用的比例是由初始化参数中的dropout_keep_prob决定的,训练时我们将它定义为0.5,而在测试时定义为1(禁用Dropout)。

    # Add dropout
    with tf.name_scope("dropout"):
        self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
    

    1.5 评估和预测

    使用从max-pooling中得到的特征向量(带Dropout),我们可以通过矩阵乘法生成预测并选择得分最高的分类,我们使用softmax将原分数转换为归一化概率,但它并不会改变预测结果。

    with tf.name_scope("output"):
        W = tf.Variable(tf.truncated_normal([num_filters_total, num_classes], stddev=0.1), name="W")
        b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
        self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
        self.predictions = tf.argmax(self.scores, 1, name="predictions")
    

    其中,tf.nn.xw_plus是一个实现Wx+bWx+b矩阵乘法的一个封装方法。

    1.6 loss和准确率计算

    我们可以使用1.5得到的score来定义loss function。分类问题的标准损失方程为交叉熵损失方程。

    # Calculate mean cross-entropy loss
    with tf.name_scope("loss"):
        losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, self.input_y)
        self.loss = tf.reduce_mean(losses)
    

    其中,tf.nn.softmax_cross_entropy_with_logits是一个对每个分类计算交叉熵损失的封装方法,通过score和正确分类作为参数,我们可以得到每一类的loss,对其求平均值,可以得到平均损失。

    我们也定义了准确率函数。

    # Calculate Accuracy
    with tf.name_scope("accuracy"):
        correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
        self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
    

    1.7 网络的可视化

    到这里,我们已经完成了网络的构建,为了得到网络的可视化图,我们可以使用TensorBoard对网络进行可视化。
    0_1522310547663_b067f939-22de-417b-b25f-69db46d39497-image.png
    <center>图4 网络可视化图</center>

    1.8 训练

    FLAGS = tf.flags.FLAGS
    with tf.Graph().as_default():
        session_conf = tf.ConfigProto(
            allow_soft_placement=FLAGS.allow_soft_placement,
            log_device_placement=FLAGS.log_device_placement
        )
        sess = tf.Session(config=session_conf)
        with sess.as_default():
    

    显式创建graph便于训练结束后释放资源,

    1.9 实例化CNN和最小化损失

    当我们实例化我们的TextCNN模型时,所有定义的变量和操作将被放置在上面创建的默认图和会话中。

    cnn = TextCNN(
        sequence_length=x_train.shape[1],
        num_classes=y_train.shape[1],
        vocab_size=len(vocab_processor.vocabulary)
        embedding_size=FLAGS.num_filters,
        filter_sizes = map(int, FLAGS.filter_sizes.split(",")),
        num_filters = FLAGS.num_filters)
    

    接下来,我们定义如何优化网络的损失函数。 TensorFlow有几个内置优化器。 我们正在使用Adam优化器。

    # Define Training procedure
    global_step = tf.Variable(0,name="global_step",trainable=False)
    optimizer = tf.train.AdamOptimizer(1e-4)
    grads_and_vars = optimizer.compute_gradients(cnn.loss)
    train_op = optimizer.apply_gradients(grads_and_vars,global_step=global_step)
    

    1.10 概览(SUMMARIES)

    TensorFlow有一个概述(summaries),可以在训练和评估过程中跟踪和查看各种数值。 例如,您可能希望跟踪您的损失和准确性随时间的变化。您还可以跟踪更复杂的数值,例如图层激活的直方图。 summaries是序列化对象,并使用SummaryWriter写入磁盘。

    # Output directory for models and summaries
    timestamp = str(int(time.time()))
    out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
    print("Writing to {}\n".format(out_dir))
    
    # Summaries for loss and accuracy
    loss_summary = tf.scalar_summary("loss", cnn.loss)
    acc_summary = tf.scalar_summary("accuracy", cnn.accuracy)
    
    # Train Summaries
    train_summary_op = tf.merge_summary([loss_summary, acc_summary])
    train_summary_dir = os.path.join(out_dir, "summaries", "train")
    train_summary_writer = tf.train.SummaryWriter(train_summary_dir, sess.graph_def)
    
    # Dev summaries
    dev_summary_op = tf.merge_summary([loss_summary, acc_summary])
    dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
    dev_summary_writer = tf.train.SummaryWriter(dev_summary_dir, sess.graph_def)
    

    在这里,我们分别跟踪培训和评估的总结。 在我们的情况下,这些数值是相同的,但是您可能只有在训练过程中跟踪的数值(如参数更新值)。 tf.merge_summary是将多个摘要操作合并到可以执行的单个操作中的便利函数。

    1.11 CHECKPOINTING

    通常使用TensorFlow的另一个功能是checkpointing- 保存模型的参数以便稍后恢复。Checkpoints 可用于在以后的时间继续训练,或使用 early stopping选择最佳参数设置。 使用Saver对象创建 Checkpoints。

    # Checkpointing
    checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
    checkpoint_prefix = os.path.join(checkpoint_dir, "model")
    # Tensorflow assumes this directory already exists so we need to create it
    if not os.path.exists(checkpoint_dir):
        os.makedirs(checkpoint_dir)
    saver = tf.train.Saver(tf.all_variables())
    

    1.12 初始化变量

    在训练模型前,我们需要初始化变量。

    # Initialize all variables
    sess.run(tf.global_variables_initializer())
    

    1.13 定义单步训练函数

    现在我们来定义一个训练步骤的函数,评估一批数据上的模型并更新模型参数。

    def train_step(x_batch,y_batch):
        """
            A single training step
            """
        feed_dict = {
            cnn.input_x:x_batch,
            cnn.input_y:y_batch,
            cnn.dropout_keep_prob:FLAGS.dropout_keep_prob
        }
        _,step,summaries,loss,accuracy = sess.run(
            [train_op,global_step,train_summary_op,cnn.loss,cnn.accuracy],feed_dict
        )
        time_str = datetime.datetime.now().isoformat()
        print("{}:step{},loss{:g},acc{:g}".format(time_str,step,loss,accuracy))
        train_summary_writer.add_summary(summaries,step)
    
    def dev_step(x_batch, y_batch, writer=None):
        """
        Evaluates model on a dev set
        """
        feed_dict = {
            cnn.input_x: x_batch,
            cnn.input_y: y_batch,
            cnn.dropout_keep_prob: 1.0
        }
        step, summaries, loss, accuracy = sess.run(
            [global_step, dev_summary_op, cnn.loss, cnn.accuracy],
            feed_dict)
        time_str = datetime.datetime.now().isoformat()
        print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
        if writer:
            writer.add_summary(summaries, step)
    

    1.14 TRAINING LOOP

    下面写训练循环。

    # Generate batches
    batches = data_helpers.batch_iter(
        zip(x_train, y_train), FLAGS.batch_size, FLAGS.num_epochs)
    # Training loop. For each batch...
    for batch in batches:
        x_batch, y_batch = zip(*batch)
        train_step(x_batch, y_batch)
        current_step = tf.train.global_step(sess, global_step)
        if current_step % FLAGS.evaluate_every == 0:
            print("\nEvaluation:")
            dev_step(x_dev, y_dev, writer=dev_summary_writer)
            print("")
        if current_step % FLAGS.checkpoint_every == 0:
            path = saver.save(sess, checkpoint_prefix, global_step=current_step)
            print("Saved model checkpoint to {}\n".format(path))
    

    2. 模型可优化内容

    • 使用word2vec词向量初始化embedding层,为了达到提升,你需要使用300维以上的词向量。
    • 限制最后一层权重向量的L2范数,就像原始文献一样。你可以通过定义一个新的操作,在每次训练步骤之后更新权重值。
    • 将L2正则化添加到网络以防止过拟合,同时也提高dropout比率。(Github上的代码已经包括L2正则化,但默认情况下禁用)
    • 添加权重更新和图层操作的直方图summaries,并在TensorBoard中进行可视化。

    后续内容待更


    参考文献



  • 2018.5.5更新:使用word2vec作为输入进行训练

    使用Word2Vec

    使用word2vec训练其实早已写完代码,一直没有整理上来。先将修改后的代码放到这里,以后有时间再介绍修改过程。

    • text_cnn.py
    import tensorflow as tf
    import numpy as np
    class TextCNN(object):
    
       def __init__(
         self, sequence_length, num_classes, vocab_size,
         embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0):
    
           # Placeholders for input, output and dropout
           self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
           self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
           self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
           self.learning_rate = tf.placeholder(tf.float32, name='learning_rate')
    
           # Keeping track of l2 regularization loss (optional)
           l2_loss = tf.constant(0.0)
    
           # Embedding layer
           with tf.device('/cpu:0'), tf.name_scope("embedding"):
               self.W = tf.Variable(
                   tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
                   name="W")
               self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
               self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
    
           # Create a convolution + maxpool layer for each filter size
           pooled_outputs = []
           for i, filter_size in enumerate(filter_sizes):
               with tf.name_scope("conv-maxpool-%s" % filter_size):
                   # Convolution Layer
                   filter_shape = [filter_size, embedding_size, 1, num_filters]
                   W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
                   b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
                   conv = tf.nn.conv2d(
                       self.embedded_chars_expanded,
                       W,
                       strides=[1, 1, 1, 1],
                       padding="VALID",
                       name="conv")
                   # Apply nonlinearity
                   h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
                   # Maxpooling over the outputs
                   pooled = tf.nn.max_pool(
                       h,
                       ksize=[1, sequence_length - filter_size + 1, 1, 1],
                       strides=[1, 1, 1, 1],
                       padding='VALID',
                       name="pool")
                   pooled_outputs.append(pooled)
    
           # Combine all the pooled features
           num_filters_total = num_filters * len(filter_sizes)
           self.h_pool = tf.concat(pooled_outputs, 3)
           self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
    
           # Add dropout
           with tf.name_scope("dropout"):
               self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
    
           # Final (unnormalized) scores and predictions
           with tf.name_scope("output"):
               W = tf.get_variable(
                   "W",
                   shape=[num_filters_total, num_classes],
                   initializer=tf.contrib.layers.xavier_initializer())
               b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
               l2_loss += tf.nn.l2_loss(W)
               l2_loss += tf.nn.l2_loss(b)
               self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
               self.predictions = tf.argmax(self.scores, 1, name="predictions")
    
           # CalculateMean cross-entropy loss
           with tf.name_scope("loss"):
               losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y)
               self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
    
           # Accuracy
           with tf.name_scope("accuracy"):
               correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
               self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
    
    • data_helpers.py
    import numpy as np
    import re
    import jieba
    import itertools
    from collections import Counter
    from tensorflow.contrib import learn
    from gensim.models import Word2Vec
    
    
    def clean_str(string):
        """
        Tokenization/string cleaning for all datasets except for SST.
        Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
        """
        string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
        string = re.sub(r"\'s", " \'s", string)
        string = re.sub(r"\'ve", " \'ve", string)
        string = re.sub(r"n\'t", " n\'t", string)
        string = re.sub(r"\'re", " \'re", string)
        string = re.sub(r"\'d", " \'d", string)
        string = re.sub(r"\'ll", " \'ll", string)
        string = re.sub(r",", " , ", string)
        string = re.sub(r"!", " ! ", string)
        string = re.sub(r"\(", " \( ", string)
        string = re.sub(r"\)", " \) ", string)
        string = re.sub(r"\?", " \? ", string)
        string = re.sub(r"\s{2,}", " ", string)
        return string.strip().lower()
    
    
    def load_data_and_labels(train_file_path, test_file_path):
        """
        获取训练与测试数据
        :param train_file_path:
        :param test_file_path:
        :return: 分词后的训练句子列表, 训练分类列表, 分词后的测试句子列表, 测试分类列表
        """
    
        # Load data from files
        x_train = list()  # 训练数据
        y_train_data = list()  # y训练分类数据
        x_test = list()  # 测试数据
        y_test_data = list()  # y测试分类数据
        y_labels = list()  # 分类集
    
        # 读取训练数据
        with open(train_file_path, 'r', encoding='utf-8') as train_file:
            for line in train_file.read().split('\n'):
                sp = line.split('||')
                if len(sp) != 2:
                    continue
                x_train.append(' '.join(jieba.cut(sp[0])))
                y_train_data.append(sp[1])
    
        # 读取测试数据
        with open(test_file_path, 'r', encoding='utf-8') as test_file:
            for line in test_file.read().split('\n'):
                sp = line.split('||')
                if len(sp) != 2:
                    continue
                x_test.append(' '.join(jieba.cut(sp[0])))
                y_test_data.append(sp[1])
    
        # 构建分类列表
        for item in y_train_data:
            if item not in y_labels:
                y_labels.append(item)
    
        labels_len = len(y_labels)
        print('分类数为: ', labels_len)
    
        # 构建训练y
        y_train = np.zeros((len(y_train_data), labels_len), dtype=np.int)
        for index in range(len(y_train_data)):
            y_train[index][y_labels.index(y_train_data[index])] = 1
    
        # 构建测试y
        y_test = np.zeros((len(y_test_data), labels_len), dtype=np.int)
        for index in range(len(y_test_data)):
            y_test[index][y_labels.index(y_test_data[index])] = 1
    
        return [x_train, y_train, x_test, y_test, y_labels]
    
    
    def load_train_dev_data(train_file_path, test_file_path):
        x_train_text, y_train, x_test_text, y_test, _ = load_data_and_labels(train_file_path, test_file_path)
        # Load data
        print("Loading data...")
    
        # Build vocabulary
        max_train_document_length = max([len(x.split(" ")) for x in x_train_text])
        max_test_document_length = max([len(x.split(" ")) for x in x_test_text])
        max_document_length = max_test_document_length \
            if max_test_document_length > max_train_document_length \
            else max_train_document_length
    
        # 使用VocabularyProcessor处理输入
        vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
        x_train = np.array(list(vocab_processor.fit_transform(x_train_text)))
        x_test = np.array(list(vocab_processor.fit_transform(x_test_text)))
    
        # Randomly shuffle data -- 随机搅乱数据
        np.random.seed(10)
        shuffle_indices = np.random.permutation(np.arange(len(y_train)))
        x_train = x_train[shuffle_indices]
        y_train = y_train[shuffle_indices]
    
        print("Vocabulary Size: {:d}".format(len(vocab_processor.vocabulary_)))
        print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_test)))
        return x_train, y_train, x_test, y_test, vocab_processor
    
    
    def load_embedding_vectors_word2vec(vocabulary, filename, binary):
        word2vec_model = Word2Vec.load(filename)
        embedding_vectors = np.random.uniform(-0.25, 0.25, (len(vocabulary), 200))
    
        for word in word2vec_model.wv.vocab:
            idx = vocabulary.get(word)
            if idx != 0:
                embedding_vectors[idx] = word2vec_model[word]
    
        return embedding_vectors
    
    
    def batch_iter(data, batch_size, num_epochs, shuffle=True):
        """
        Generates a batch iterator for a dataset.
        """
        data = np.array(data)
        data_size = len(data)
        num_batches_per_epoch = int((len(data)-1)/batch_size) + 1
        for epoch in range(num_epochs):
            # Shuffle the data at each epoch
            if shuffle:
                shuffle_indices = np.random.permutation(np.arange(data_size))
                shuffled_data = data[shuffle_indices]
            else:
                shuffled_data = data
            for batch_num in range(num_batches_per_epoch):
                start_index = batch_num * batch_size
                end_index = min((batch_num + 1) * batch_size, data_size)
                yield shuffled_data[start_index:end_index]
    
    
    • train.py
    #! /usr/bin/env python
    
    import tensorflow as tf
    import numpy as np
    import os
    import time
    import datetime
    import data_helpers
    from text_cnn import TextCNN
    from tensorflow.contrib import learn
    import yaml
    
    # Parameters
    # ==================================================
    
    # Data loading params
    tf.app.flags.DEFINE_float("dev_sample_percentage", .1, "Percentage of the training data to use for validation")
    tf.app.flags.DEFINE_string("train_file", "../data/train_data.txt", "Train file source.")
    tf.app.flags.DEFINE_string("test_file", "../data/test_data.txt", "Test file source.")
    
    
    # Model Hyperparameters
    tf.app.flags.DEFINE_integer("embedding_dim", 128, "Dimensionality of character embedding (default: 128)")
    tf.app.flags.DEFINE_string("filter_sizes", "3,4,5", "Comma-separated filter sizes (default: '3,4,5')")
    tf.app.flags.DEFINE_integer("num_filters", 128, "Number of filters per filter size (default: 128)")
    tf.app.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)")
    tf.app.flags.DEFINE_float("l2_reg_lambda", 0.0, "L2 regularization lambda (default: 0.0)")
    
    # Training parameters
    tf.app.flags.DEFINE_integer("batch_size", 128, "Batch Size (default: 64)")
    tf.app.flags.DEFINE_integer("num_epochs", 100, "Number of training epochs (default: 200)")
    tf.app.flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps (default: 100)")
    tf.app.flags.DEFINE_integer("checkpoint_every", 1000, "Save model after this many steps (default: 100)")
    tf.app.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)")
    # Misc Parameters
    tf.app.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
    tf.app.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")
    
    FLAGS = tf.app.flags.FLAGS
    print("\nParameters:")
    for attr, value in sorted(FLAGS.__flags.items()):
        print("{}={}".format(attr.upper(), value))
    print("")
    
    
    # Data Preparation
    # ==================================================
    
    # Load data
    with open("config.yml", 'r') as ymlfile:
        cfg = yaml.load(ymlfile)
    
    print("Loading data...")
    x_train, y_train, x_test, y_test, vocab_processor = data_helpers.load_train_dev_data(FLAGS.train_file, FLAGS.test_file)
    
    embedding_name = cfg['word_embeddings']['default']
    embedding_dimension = cfg['word_embeddings'][embedding_name]['dimension']
    
    
    # Training
    # ==================================================
    
    with tf.Graph().as_default():
        session_conf = tf.ConfigProto(
          allow_soft_placement=FLAGS.allow_soft_placement,
          log_device_placement=FLAGS.log_device_placement)
        sess = tf.Session(config=session_conf)
    
        with sess.as_default():
            cnn = TextCNN(
                sequence_length=x_train.shape[1],
                num_classes=y_train.shape[1],
                vocab_size=len(vocab_processor.vocabulary_),
                embedding_size=embedding_dimension,
                filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),
                num_filters=FLAGS.num_filters,
                l2_reg_lambda=FLAGS.l2_reg_lambda)
    
            cnn.learning_rate = 0.01
            # Define Training procedure
            global_step = tf.Variable(0, name="global_step", trainable=False)
            optimizer = tf.train.AdamOptimizer(cnn.learning_rate)
            grads_and_vars = optimizer.compute_gradients(cnn.loss)
            train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
    
            # Keep track of gradient values and sparsity (optional)
            grad_summaries = []
            for g, v in grads_and_vars:
                if g is not None:
                    grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g)
                    sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
                    grad_summaries.append(grad_hist_summary)
                    grad_summaries.append(sparsity_summary)
            grad_summaries_merged = tf.summary.merge(grad_summaries)
    
            # Output directory for models and summaries
            timestamp = str(int(time.time()))
            out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
            print("Writing to {}\n".format(out_dir))
    
            # Summaries for loss and accuracy
            loss_summary = tf.summary.scalar("loss", cnn.loss)
            acc_summary = tf.summary.scalar("accuracy", cnn.accuracy)
    
            # Train Summaries
            train_summary_op = tf.summary.merge([loss_summary, acc_summary, grad_summaries_merged])
            train_summary_dir = os.path.join(out_dir, "summaries", "train")
            train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
    
            # Dev summaries
            dev_summary_op = tf.summary.merge([loss_summary, acc_summary])
            dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
            dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)
    
            # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it
            checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
            checkpoint_prefix = os.path.join(checkpoint_dir, "model")
            if not os.path.exists(checkpoint_dir):
                os.makedirs(checkpoint_dir)
            saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)
    
            # Write vocabulary
            vocab_processor.save(os.path.join(out_dir, "vocab"))
    
            # Initialize all variables
            sess.run(tf.global_variables_initializer())
    
            vocabulary = vocab_processor.vocabulary_
            initW = data_helpers.load_embedding_vectors_word2vec(vocabulary,
                                                                 cfg['word_embeddings']['word2vec']['path'],
                                                                 cfg['word_embeddings']['word2vec']['binary'])
            print(initW.shape)
            sess.run(cnn.W.assign(initW))
    
    
            def train_step(x_batch, y_batch):
                """
                A single training step
                """
                feed_dict = {
                  cnn.input_x: x_batch,
                  cnn.input_y: y_batch,
                  cnn.dropout_keep_prob: FLAGS.dropout_keep_prob
                }
                _, step, summaries, loss, accuracy = sess.run(
                    [train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy],
                    feed_dict)
                time_str = datetime.datetime.now().isoformat()
                print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
                train_summary_writer.add_summary(summaries, step)
    
            def dev_step(x_batch, y_batch, writer=None):
                """
                Evaluates model on a dev set
                """
                feed_dict = {
                  cnn.input_x: x_batch,
                  cnn.input_y: y_batch,
                  cnn.dropout_keep_prob: 1.0
                }
                step, summaries, loss, accuracy = sess.run(
                    [global_step, dev_summary_op, cnn.loss, cnn.accuracy],
                    feed_dict)
                time_str = datetime.datetime.now().isoformat()
                print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
                if writer:
                    writer.add_summary(summaries, step)
                if step % FLAGS.batch_size == 0:
                    print('epoch ', step % FLAGS.batch_size)
    
            # Generate batches
            batches = data_helpers.batch_iter(
                list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)
            # Training loop. For each batch...
            for batch in batches:
                x_batch, y_batch = zip(*batch)
                train_step(x_batch, y_batch)
                current_step = tf.train.global_step(sess, global_step)
                if current_step % FLAGS.evaluate_every == 0:
                    print("\nEvaluation:")
                    dev_step(x_test, y_test, writer=dev_summary_writer)
                    print("")
                if current_step % FLAGS.checkpoint_every == 0:
                    path = saver.save(sess, checkpoint_prefix, global_step=current_step)
                    print("Saved model checkpoint to {}\n".format(path))
    


  • 占楼待更。


  • 核心层

    大力顶楼主



  • 666666666666666 待会来学习



  • 学了CS224n之后才真正透彻理解TextCNN,果然还是要一点一点把基础打好,现在再看就能更好的理解作者为什么要这么做,为什么这样效果就会好。

    CS224n笔记TextCNN部分



  • 您好,请问一下你这个用word2Vec的数据也是github上的那个数据集吗?还是自己的



  • 可以是自己训练的,也可以是别人训练好的,根据具体情况决定



  • @hunto 你这个就是用的是github上的那个英文数据吗?



  • 不是。。。。。



  • @hunto 您那个w2vec的data_helper的代码没写完,能给我发一下完整的吗



  • 582回来逛逛


 

Copyright © 2018 bbs.dian.org.cn All rights reserved.

与 Dian 的连接断开,我们正在尝试重连,请耐心等待