Embedding Processor

Classes:

Character()

Generates random character embeddings matrix with values in in the range [-1, 1].

Dependency()

Generates Dependency-based embeddings matrix.

EmbeddingProcessor()

EmbeddingProcessor abstract class.

FastText()

Generates FastText embedding matrix.

Glove()

Generates GloVe embedding matrix.

Numberbatch()

Generates ConceptNet Numberbatch embeddings matrix.

Random()

Generates random embeddings matrix with values in in the range [-1, 1].

Word2Vec()

Generates Word2Vec embedding matrix.

Functions:

get_embedding(embedding_dir, embedding_type)

Utility function for returning an embedding matrix generated by an EmbeddingProcessor.

class embedding_processor.Character

Bases: embedding_processor.EmbeddingProcessor

Generates random character embeddings matrix with values in in the range [-1, 1].

Uses the ELMo special character vocabulary. Specifically, char ids 0-255 come from utf-8 encoding bytes. Above 256 are reserved for special tokens:

<bos> (256) – The index of beginning of the sentence character is 256 in ELMo.

<eos> (257) – The index of end of the sentence character is 257 in ELMo.

<bow> (258) – The index of beginning of the word character is 258 in ELMo.

<eow> (259) – The index of end of the word character is 259 in ELMo.

<pad> (260) – The index of padding character is 260 in ELMo. Encoded as 0’s.

Methods:

get_embedding_matrix(embedding_dir, …)

get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)
class embedding_processor.Dependency

Bases: embedding_processor.EmbeddingProcessor

Generates Dependency-based embeddings matrix. Levy, O., & Goldberg, Y. (2014). Dependency-Based Word Embeddings. Proceedings Of the 52nd Annual Meeting Of the Association for Computational Linguistics, 302–308.

Embedding source available from: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

Valid Dependency source files: ‘deps’

Methods:

get_embedding_matrix(embedding_dir, …)

get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)
class embedding_processor.EmbeddingProcessor

Bases: abc.ABC

EmbeddingProcessor abstract class. Contains function for mapping word tokens to embedding vectors.

Methods:

copy_embedding_to_matrix(embedding_dim, …)

Copies embeddings that have been attached to a Gluonnlp Vocabulary into a numpy matrix.

get_embedding_matrix(embedding_dir, …)

Loads embeddings and maps word tokens to embedding vectors.

static copy_embedding_to_matrix(embedding_dim, vocabulary)

Copies embeddings that have been attached to a Gluonnlp Vocabulary into a numpy matrix. The vocabulary <pad> and <unk> tokens are set to 0’s. Words that appear in the vocabulary but not in the original embedding are randomly generated.

Parameters
  • embedding_dim (int) – Length of vector to map tokens to, raises error if longer than loaded source files.

  • vocabulary (Gluonnlp Vocabulary) – Maps word tokens to attached embedding vectors.

Returns

A matrix of shape (vocabulary_size, embedding_dim) mapping words to embeddings.

Return type

embedding_matrix (numpy array)

abstract get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)

Loads embeddings and maps word tokens to embedding vectors.

Parameters
  • embedding_dir (str) – The location to store and load embedding files.

  • embedding_source (string) – Specifies which embedding source file to load.

  • embedding_dim (int) – Length of vector to map tokens to, raises error if longer than loaded source files.

  • vocabulary (Gluonnlp Vocabulary) – Maps word tokens to attached embedding vectors.

Returns

A matrix of shape (vocabulary_size, embedding_dim) mapping words to embeddings.

Return type

embedding_matrix (numpy array)

class embedding_processor.FastText

Bases: embedding_processor.EmbeddingProcessor

Generates FastText embedding matrix. Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T. (2017) Bag of Tricks for Efficient Text Classification. In: the Association for Computational Linguistics [online]. 2017 Valencia, Spain: ACL. pp. 427–431.

Valid FastText source files: ‘crawl-300d-2M’, ‘crawl-300d-2M-subword’, ‘wiki.simple’

Methods:

get_embedding_matrix(embedding_dir, …)

get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)
class embedding_processor.Glove

Bases: embedding_processor.EmbeddingProcessor

Generates GloVe embedding matrix. Pennington, J., Socher, R. and Manning, C.D. (2014) GloVe: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Valid GloVe source files: ‘glove.42B.300d’, ‘glove.6B.100d’, ‘glove.6B.200d’, ‘glove.6B.300d’, ‘glove.6B.50d’, ‘glove.840B.300d’, ‘glove.twitter.27B.100d’, ‘glove.twitter.27B.200d’, ‘glove.twitter.27B.25d’, ‘glove.twitter.27B.50d’

Methods:

get_embedding_matrix(embedding_dir, …)

get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)
class embedding_processor.Numberbatch

Bases: embedding_processor.EmbeddingProcessor

Generates ConceptNet Numberbatch embeddings matrix. Speer, R., Chin, J., & Havasi, C. (2016). ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) ConceptNet, 4444–4451.

Embedding source available from: https://github.com/commonsense/conceptnet-numberbatch

Valid Numberbatch source files: ‘numberbatch-en’

Methods:

get_embedding_matrix(embedding_dir, …)

get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)
class embedding_processor.Random

Bases: embedding_processor.EmbeddingProcessor

Generates random embeddings matrix with values in in the range [-1, 1].

Methods:

get_embedding_matrix(embedding_dir, …)

get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)
class embedding_processor.Word2Vec

Bases: embedding_processor.EmbeddingProcessor

Generates Word2Vec embedding matrix. Mikolov, T., Yih, W.-T. and Zweig, G. (2013) Linguistic Regularities in Continuous Space Word Representations. Proceedings of NAACL-HLT [online]. (June), pp. 746–751.

Valid Word2Vec source files: ‘GoogleNews-vectors-negative300’, ‘freebase-vectors-skipgram1000-en’, ‘freebase-vectors-skipgram1000’

Methods:

get_embedding_matrix(embedding_dir, …)

get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)
embedding_processor.get_embedding(embedding_dir, embedding_type, embedding_source=None, embedding_dim=300, vocabulary=None)

Utility function for returning an embedding matrix generated by an EmbeddingProcessor. Valid embedding types: ‘char’, ‘random’, ‘glove’, ‘word2vec’, ‘fasttext’, ‘numberbatch’, ‘deps’

Parameters
  • embedding_dir (str) – The location to store and load embedding files. If it doesn’t exist it will be created.

  • embedding_type (str) – The name of the EmbeddingProcessor.

  • embedding_source (string) – Specifies which embedding source file to load, or None for char embeddings.

  • embedding_dim (int) – Length of vector to map tokens to, raises error if longer than loaded source files.

  • vocabulary (Gluonnlp Vocabulary) – Maps word tokens to attached embedding vectors, or None for char embeddings.

embedding_processor.embedding_types

Dictionary mapping embedding_type strings to EmbeddingProcessor class.

Type

dict

Returns

A matrix of shape (vocabulary_size, embedding_dim) mapping words to embeddings.

Return type

embedding_matrix (numpy array)