word mover

src-d/wmd-relax: Calculates Word Mover's Distance
June 6, 2017 – 09:44 am
Digital Magnetic Poetry with the Word Mover App | Expect the

Word Mover's DistanceThe high level logic is written in Python, the low level functions related to linear programming are offloaded to the bundled native extension. The native extension can be built as a generic shared library not related to Python at all. Python 2.7 and older are not supported. The heavy-lifting is done by google/or-tools.


pip3 install wmd

Tested on Linux and macOS.


You should have the embeddings numpy array and the nbow model - that is, every sample is a weighted set of items, and every item is embedded.

import numpy from wmd import WMD embeddings = numpy.array([[0.1, 1], [1, 0.1]], dtype=numpy.float32) nbow = {"first": ("#1", [0, 1], numpy.array([1.5, 0.5], dtype=numpy.float32)), "second": ("#2", [0, 1], numpy.array([0.75, 0.15], dtype=numpy.float32))} calc = WMD(embeddings, nbow, vocabulary_min=2) print(calc.nearest_neighbors("first"))

[('second', 0.9123001)]

embeddings must support __getitem__ which returns an item by it's identifier; particularly, numpy.ndarray matches that interface. nbow must be iterable - returns sample identifiers - and support __getitem__ by those identifiers which returns tuples of length 3. The first element is the human-readable name of the sample, the second is an iterable with item identifiers and the third is numpy.ndarray with the corresponding weights. All numpy arrays must be float32. The return format is the list of tuples with sample identifiers and relevancy indices (lower the better).

import spacy import wmd nlp = spacy.load('en', create_pipeline=wmd.WMD.create_spacy_pipeline) doc1 = nlp("Politician speaks to the media in Illinois.") doc2 = nlp("The president greets the press in Chicago.") print(doc1.similarity(doc2))

Building from source

Either build it as a Python package:

pip3 install

or use CMake:

git clone -recursive -D CMAKE_BUILD_TYPE=Release . make -j

Please note the -recursive flag for git clone. This project uses source{d}'s fork of google/or-tools as the git submodule.


Tests are in test.py and use the stock unittest package.


Source: github.com
Related Posts