tensorflow的学习心得（四）CNN in NLP代码详解

这篇文章是解析CNN在NLP中的最基础的应用，将会一点一点地解析每一个函数和变量的关系

1.整体结构的概念理解：

http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

2.具体函数实现的解析:

http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

本文将要介绍：

1. 如何构建CNN在NLP中的模型

2. 如何在训练这个模型

3. input中如何使用pre-trained word embedding

4. 二层CNN模型的尝试

1. 如何构建CNN在NLP中的模型

首先，我们先来看一下整篇代码的结构：

第一层是把embeds words转换层low-dimensional vectors（低纬向量）
第二层使用多个filter size去卷积我们第一层的所输出的vectors
第三层我们使用max-pooling后把所有的feature连接，得到一个long featurevector,这里我们加入dropout 的正规化（regularzation）
最后一层，我们使用softmax去识别我们的result。

1. 首先声明一个TextCNN的类

class TextCNN(object):
    """
    A CNN for text classification.
    Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
    为了接收各种参数我们把结构放入textCNN类中，生成model graph在init函数
    """
    def __init__(
      self, sequence_length, num_classes, vocab_size,
      embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0):
    '''
        sequence_length – The length of our sentences.我们将填充我们的句子去拥有相同的长度（59） 
        num_classes – Number of classes in the output layer, two in our case (positive and negative).
        vocab_size – The size of our vocabulary. This is needed to define the size of our embedding layer, 
        which will have shape [vocabulary_size, embedding_size].
        embedding_size – The dimensionality of our embeddings.
        filter_sizes – The number of words we want our convolutional filters to cover. 
        num_filters – The number of filters per filter size
    '''

2. 声明数据变量和结果还有dropout的占位符

#这里介绍了三个placeholder占位符，
# Placeholders for input, output and dropout
#input_y Means output layer
self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
#placeholder to create the graph to set the sequence_length as the input which should be like 59(the maximum words in the dataset.)
self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
#placeholder to create the graph to set the num_classes which should be like 2 
self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
#probability of keeping a neuron in the dropout layer
'''
tf.placeholder creates a placeholder variable that we feed to the network when we execute it at train or test time.
'''

（1）placeholder的作用在前文中有介绍，就不累赘了。

（2）第二个参数是我们input的tensor，这里可以看到，x的input格式是int32，而y的input格式是float32。

（3）None表示训练的时候，dimension可以是任何的参数，这里是规定batch size的大小，我们可以在训练的时候规定它的大小。

3. 构建我们的第一层，embedding layer

这里我们是把词汇转换层低维度的vector。（后面我们会介绍如何用pre-trained vector去作为我们的input vector）。

# Embedding layer
#first layer is the Embedding layer, which maps vocabulary word indices into low-dimensional vector representations.
#第一次结构，将单词转换为低纬的向量表示。
with tf.device('/cpu:0'), tf.name_scope("embedding"):
    #tf.device("/cpu:0") forces an operation to be executed on the CPU.
    #tf.name_scope creates a new Name Scope with the name “embedding”.
    #W is our embedding matrix that we learn during training.
    # We initialize it using a random uniform distribution.  
    self.W = tf.Variable(
        tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
        name="W")
    #tf.random_uniform(shape,minval=0,maxval=None,dtype=tf.float32,seed=None,name=None):用于生成随机数tensor的，均匀分布随机数，min到max。
    # tf.nn.embedding_lookup 创造实际的嵌入式操作，其结果是3维的张量， [None, sequence_length, embedding_size]
    self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
    #因为在我们的conv2d中，需要的是4维的tensor，所以在这里我们需要一个expand维度的操作， 
    #在最后加一个维度是in_channel，但是，我们的embedding没有channal，所以1就好
    self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

这里我们先介绍一下python中with的用法。

很多任务中，我们需要事先声明变量然后，事后做清理工作，那么python中的with就很好的提供了这一种处理方式。

with语句可以很好的解决两个问题：（让代码更简练，产生异常时，清理工作更简单）

（1）如果事后忘记清理某些变量时

（2）文件发生异常时，可以抛出问题。

with···as···相当于try…except…finally…

如何用：
with 后面直接写需要调用的函数，需要的多个的话，用“,”隔开

1	with tf.device('/cpu:0'), tf.name_scope("embedding"):

W Variable的声明
这里声明W是我们将会在训练中学习的嵌入式矩阵。用random uniform分布初始化。
embedded_chars是创造实际的嵌入式操作。这里最后的结果是3维的tensor。
embedded_chars_expaned给三维加上一维的操作。（因为之后的con2d函数要求4维的输入）

self.W = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),name="W")
                
self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

4. 构建我们的第二层，Convolution layer和maxpooling layer

现在我们来部署一下我们的第二层，Convolution layer。这里我们会用到很多不同的filter，每一个filter去convolute 整个map的时候，会产生很多不同的tensor。
给每一个filter创造一个layer，之后用max-pooling 去合并整个结果，得到一个大的feature vector。

pooled_outputs = [] #这里指的是各个pooling layer后得到的max的值。
for i, filter_size in enumerate(filter_sizes):
			with tf.name_scope("conv-maxpool-%s" % filter_size):
                # Convolution Layer
                filter_shape = [filter_size, embedding_size, 1, num_filters]
                W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
                b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
                conv = tf.nn.conv2d(
                    self.embedded_chars_expanded,
                    W,
                    strides=[1, 1, 1, 1],
                    padding="VALID",
                    name="conv")
'''
tf.nn.conv2d:
这个函数的功能是：给定4维的iput和filter，计算出一个2维的卷积结果：
def conv2d(input, filter, strides, padding, use_cudnn_ongpu=None, data_format=None, name=None):
    input：待卷积的数据。格式要求为一个张量，[batch, in_height, in_width, in_channels]. 
    分别表示 批次数，图像高度，宽度，输入通道数。 
    filter： 卷积核。格式要求为[filter_height, filter_width, in_channels, out_channels]. 
    分别表示 卷积核的高度，宽度，输入通道数，输出通道数。 
    strides :一个长为4的list. 表示每次卷积以后卷积窗口在input中滑动的距离 [batch, height, width, channels]
    padding ：有SAME和VALID两种选项，表示是否要保留图像边上那一圈不完全卷积的部分。如果是SAME，则保留 
    use_cudnn_on_gpu ：是否使用cudnn加速。默认是True
'''
                   
                                   
                    
                    
                # Apply nonlinearity
                h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
                # Maxpooling over the outputs
                pooled = tf.nn.max_pool(
                    h,
                    ksize=[1, sequence_length - filter_size + 1, 1, 1],
                    strides=[1, 1, 1, 1],
                    padding='VALID',
                    name="pool")
                               
'''
tf.nn.max_pool :
进行最大值池化操作,而avg_pool 则进行平均值池化操作.函数的定义为：
def max_pool(value, ksize, strides, padding, data_format="NHWC", name=None):
value: 一个4D张量，格式为[batch, height, width, channels]，与conv2d中input格式一样 
ksize: 长为4的list,表示池化窗口的尺寸 [batch, height, width, channels]
strides: 池化窗口的滑动值，与conv2d中的一样  [batch, height, width, channels]
padding: 与conv2d中用法一样。
那么这里就是h为value即我们的convolution layer之后的结果，
'''
                pooled_outputs.append(pooled)
                #这里是用增加的方式，添加到pooled_outputs中去。

这里的W是我们的filter的矩阵（与embedding layer中的W不同，那里是embedding layer中的，这里是filter中的）。h是我们当前for循环中的filter_size的输出结果，通过非线性函数relu之后的结果。

结合所有的max pooling得到的结果，在pooled_outputs数组里面。