学习笔记 TF018:词向量、维基百科语料库训练词向量模型 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
cralison
V2EX    TensorFlow

学习笔记 TF018:词向量、维基百科语料库训练词向量模型

  •  
  •   cralison 2017-06-03 22:41:10 +08:00 2918 次点击
    这是一个创建于 3051 天前的主题,其中的信息可能已经有所发展或是发生改变。
    词向量嵌入需要高效率处理大规模文本语料库。word2vec。简单方式,词送入独热编码(one-hot encoding)学习系统,长度为词汇表长度的向量,词语对应位置元素为 1,其余元素为 0。向量维数很高,无法刻画不同词语的语义关联。共生关系(co-occurrence)表示单词,解决语义关联,遍历大规模文本语料库,统计每个单词一定距离范围内的周围词汇,用附近词汇规范化数量表示每个词语。类似语境中词语语义相似。用 PCA 或类似方法降维出现向量(occurrence vector),得到更稠密表示。性能好,追踪所有词汇共生矩阵,宽度、高度为词汇表长度。2013 年,Mikolov、Tomas 等提出上下文计算词表示方法,《 Efficient estimation of word representations in vector space 》(arXiv preprint arXiv:1301.3781(2013))。skip-gram 模型,从随机表示开始,依据当前词语预测上下文词语简单分类器,误差通过分类器权值和词表示传播,对两者调整减少预测误差。大规模语料库训练模型表示赂量逼近压缩后共生向量。

    数据集,英文维基百科转储文件包含所有页面完整修订历史,当前页面版本 100GB,https://dumps.wikimedia.org/backup-index.html。

    下载转储文件,提取页面词语。统计词语出现次数,构建常见词汇表。用词汇表对提取页面编码。逐行读取文件,结果立即写入磁盘。在不同步骤间保存检查点,避免程序崩溃重来。

    __iter__遍历词语索引列表页面。encode 获取字符串词语词汇索引。decode 依据词汇索引返回字符串词语。_read_pages 从维基百科转储文件(压缩 XML)提取单词,保存到页面文件,每个页面一行空格分隔的单词。bz2 模块 open 函数读取文件。中间结果压缩处理。正则表达式捕捉任意连续字母序列或单独特殊字母。_build_vocabulary 统计页面文件单词数,出现频率高词语写入文件。独热编码需要词汇表。词汇表索引编码。移除拼写错误、极不常见词语,词汇表只包含 vocabulary_size - 1 个最常见词语。所有不在词汇表词语<unk>标记,未出现单词词向量。

    动态形成训练样本,组织到大批数据,分类器不占大量内存。skip-gram 模型预测当前词语的上下文词语。遍历文本,当前词语数据,周围词语目标,创建训练样本。上下文尺寸 R,每个单词生成 2R 样本,当前词左右各 R 个词。语义上下文,距离近重要,尽量少创建远上下文词语训练样本,范围[1,D=10]随机选择词上下文尺寸。依据 skip-gram 模型形成训练对。Numpy 数组生成数值流批数据。

    初始,单词表示为随机向量。分类器根据中层表示预测上下文单词当前表示。传播误差,微调权值、输入单词表示。MomentumOptimizer模型优化,智能不足,效率高。

    分类器是模型核心。噪声对比估计损失(noisecontrastive estimation loss)性能优异。softmax 分类器建模。tf.nn.nce_loss新随机向量负样本(对比样本),近似 softmax 分类器。

    训练模型结束,最终词向量写入文件。维基百科语料库子集,普通 CPU 训练 5 小时,得到 NumPy 数组嵌入表示。完整语料库:https://dumps.wikimedia.org/enwiki/20160501/enwiki-20160501-pages-meta-current.xml.bz2。AttrDict 类等价 Python dict,键可属性访问。

    import bz2
    import collections
    import os
    import re
    from lxml import etree
    from helpers import download
    class Wikipedia:
    TOKEN_REGEX = re.compile(r'[A-Za-z]+|[!?.:,()]')
    def __init__(self, url, cache_dir, vocabulary_size=10000):
    self._cache_dir = os.path.expanduser(cache_dir)
    self._pages_path = os.path.join(self._cache_dir, 'pages.bz2')
    self._vocabulary_path = os.path.join(self._cache_dir, 'vocabulary.bz2')
    if not os.path.isfile(self._pages_path):
    print('Read pages')
    self._read_pages(url)
    if not os.path.isfile(self._vocabulary_path):
    print('Build vocabulary')
    self._build_vocabulary(vocabulary_size)
    with bz2.open(self._vocabulary_path, 'rt') as vocabulary:
    print('Read vocabulary')
    self._vocabulary = [x.strip() for x in vocabulary]
    self._indices = {x: i for i, x in enumerate(self._vocabulary)}
    def __iter__(self):
    with bz2.open(self._pages_path, 'rt') as pages:
    for page in pages:
    words = page.strip().split()
    words = [self.encode(x) for x in words]
    yield words
    @property
    def vocabulary_size(self):
    return len(self._vocabulary)
    def encode(self, word):
    return self._indices.get(word, 0)
    def decode(self, index):
    return self._vocabulary[index]
    def _read_pages(self, url):
    wikipedia_path = download(url, self._cache_dir)
    with bz2.open(wikipedia_path) as wikipedia, \
    bz2.open(self._pages_path, 'wt') as pages:
    for _, element in etree.iterparse(wikipedia, tag='{*}page'):
    if element.find('./{*}redirect') is not None:
    continue
    page = element.findtext('./{*}revision/{*}text')
    words = self._tokenize(page)
    pages.write(' '.join(words) + '\n')
    element.clear()
    def _build_vocabulary(self, vocabulary_size):
    counter = collections.Counter()
    with bz2.open(self._pages_path, 'rt') as pages:
    for page in pages:
    words = page.strip().split()
    counter.update(words)
    common = ['<unk>'] + counter.most_common(vocabulary_size - 1)
    common = [x[0] for x in common]
    with bz2.open(self._vocabulary_path, 'wt') as vocabulary:
    for word in common:
    vocabulary.write(word + '\n')
    @classmethod
    def _tokenize(cls, page):
    words = cls.TOKEN_REGEX.findall(page)
    words = [x.lower() for x in words]
    return words

    import tensorflow as tf
    import numpy as np
    from helpers import lazy_property
    class EmbeddingModel:
    def __init__(self, data, target, params):
    self.data = data
    self.target = target
    self.params = params
    self.embeddings
    self.cost
    self.optimize
    @lazy_property
    def embeddings(self):
    initial = tf.random_uniform(
    [self.params.vocabulary_size, self.params.embedding_size],
    -1.0, 1.0)
    return tf.Variable(initial)
    @lazy_property
    def optimize(self):
    optimizer = tf.train.MomentumOptimizer(
    self.params.learning_rate, self.params.momentum)
    return optimizer.minimize(self.cost)
    @lazy_property
    def cost(self):
    embedded = tf.nn.embedding_lookup(self.embeddings, self.data)
    weight = tf.Variable(tf.truncated_normal(
    [self.params.vocabulary_size, self.params.embedding_size],
    stddev=1.0 / self.params.embedding_size ** 0.5))
    bias = tf.Variable(tf.zeros([self.params.vocabulary_size]))
    target = tf.expand_dims(self.target, 1)
    return tf.reduce_mean(tf.nn.nce_loss(
    weight, bias, embedded, target,
    self.params.contrastive_examples,
    self.params.vocabulary_size))

    import collections
    import tensorflow as tf
    import numpy as np
    from batched import batched
    from EmbeddingModel import EmbeddingModel
    from skipgrams import skipgrams
    from Wikipedia import Wikipedia
    from helpers import AttrDict
    WIKI_DOWNLOAD_DIR = './wikipedia'
    params = AttrDict(
    vocabulary_size=10000,
    max_cOntext=10,
    embedding_size=200,
    contrastive_examples=100,
    learning_rate=0.5,
    momentum=0.5,
    batch_size=1000,
    )
    data = tf.placeholder(tf.int32, [None])
    target = tf.placeholder(tf.int32, [None])
    model = EmbeddingModel(data, target, params)
    corpus = Wikipedia(
    'https://dumps.wikimedia.org/enwiki/20160501/'
    'enwiki-2016501-pages-meta-current1.xml-p000000010p000030303.bz2',
    WIKI_DOWNLOAD_DIR,
    params.vocabulary_size)
    examples = skipgrams(corpus, params.max_context)
    batches = batched(examples, params.batch_size)
    sess = tf.Session()
    sess.run(tf.initialize_all_variables())
    average = collections.deque(maxlen=100)
    for index, batch in enumerate(batches):
    feed_dict = {data: batch[0], target: batch[1]}
    cost, _ = sess.run([model.cost, model.optimize], feed_dict)
    average.append(cost)
    print('{}: {:5.1f}'.format(index + 1, sum(average) / len(average)))
    if index > 100000:
    break
    embeddings = sess.run(model.embeddings)
    np.save(WIKI_DOWNLOAD_DIR + '/embeddings.npy', embeddings)

    参考资料:
    《面向机器智能的 TensorFlow 实践》

    欢迎加我微信交流:qingxingfengzi
    我的微信公众号:qingxingfengzigz
    我老婆张幸清的微信公众号:qingqingfeifangz
    目前尚无回复
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     961 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 24ms UTC 19:21 PVG 03:21 LAX 12:21 JFK 15:21
    Do have faith in what you're doing.
    ubao snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86