Embedding Processor¶
Classes:
Generates random character embeddings matrix with values in in the range [-1, 1]. |
|
Generates Dependency-based embeddings matrix. |
|
EmbeddingProcessor abstract class. |
|
|
Generates FastText embedding matrix. |
|
Generates GloVe embedding matrix. |
Generates ConceptNet Numberbatch embeddings matrix. |
|
|
Generates random embeddings matrix with values in in the range [-1, 1]. |
|
Generates Word2Vec embedding matrix. |
Functions:
|
Utility function for returning an embedding matrix generated by an EmbeddingProcessor. |
-
class
embedding_processor.Character¶ Bases:
embedding_processor.EmbeddingProcessorGenerates random character embeddings matrix with values in in the range [-1, 1].
Uses the ELMo special character vocabulary. Specifically, char ids 0-255 come from utf-8 encoding bytes. Above 256 are reserved for special tokens:
<bos> (256) – The index of beginning of the sentence character is 256 in ELMo.
<eos> (257) – The index of end of the sentence character is 257 in ELMo.
<bow> (258) – The index of beginning of the word character is 258 in ELMo.
<eow> (259) – The index of end of the word character is 259 in ELMo.
<pad> (260) – The index of padding character is 260 in ELMo. Encoded as 0’s.
Methods:
get_embedding_matrix(embedding_dir, …)-
get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)¶
-
-
class
embedding_processor.Dependency¶ Bases:
embedding_processor.EmbeddingProcessorGenerates Dependency-based embeddings matrix. Levy, O., & Goldberg, Y. (2014). Dependency-Based Word Embeddings. Proceedings Of the 52nd Annual Meeting Of the Association for Computational Linguistics, 302–308.
Embedding source available from: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
Valid Dependency source files: ‘deps’
Methods:
get_embedding_matrix(embedding_dir, …)-
get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)¶
-
-
class
embedding_processor.EmbeddingProcessor¶ Bases:
abc.ABCEmbeddingProcessor abstract class. Contains function for mapping word tokens to embedding vectors.
Methods:
copy_embedding_to_matrix(embedding_dim, …)Copies embeddings that have been attached to a Gluonnlp Vocabulary into a numpy matrix.
get_embedding_matrix(embedding_dir, …)Loads embeddings and maps word tokens to embedding vectors.
-
static
copy_embedding_to_matrix(embedding_dim, vocabulary)¶ Copies embeddings that have been attached to a Gluonnlp Vocabulary into a numpy matrix. The vocabulary <pad> and <unk> tokens are set to 0’s. Words that appear in the vocabulary but not in the original embedding are randomly generated.
- Parameters
embedding_dim (int) – Length of vector to map tokens to, raises error if longer than loaded source files.
vocabulary (Gluonnlp Vocabulary) – Maps word tokens to attached embedding vectors.
- Returns
A matrix of shape (vocabulary_size, embedding_dim) mapping words to embeddings.
- Return type
embedding_matrix (numpy array)
-
abstract
get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)¶ Loads embeddings and maps word tokens to embedding vectors.
- Parameters
embedding_dir (str) – The location to store and load embedding files.
embedding_source (string) – Specifies which embedding source file to load.
embedding_dim (int) – Length of vector to map tokens to, raises error if longer than loaded source files.
vocabulary (Gluonnlp Vocabulary) – Maps word tokens to attached embedding vectors.
- Returns
A matrix of shape (vocabulary_size, embedding_dim) mapping words to embeddings.
- Return type
embedding_matrix (numpy array)
-
static
-
class
embedding_processor.FastText¶ Bases:
embedding_processor.EmbeddingProcessorGenerates FastText embedding matrix. Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T. (2017) Bag of Tricks for Efficient Text Classification. In: the Association for Computational Linguistics [online]. 2017 Valencia, Spain: ACL. pp. 427–431.
Valid FastText source files: ‘crawl-300d-2M’, ‘crawl-300d-2M-subword’, ‘wiki.simple’
Methods:
get_embedding_matrix(embedding_dir, …)-
get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)¶
-
-
class
embedding_processor.Glove¶ Bases:
embedding_processor.EmbeddingProcessorGenerates GloVe embedding matrix. Pennington, J., Socher, R. and Manning, C.D. (2014) GloVe: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Valid GloVe source files: ‘glove.42B.300d’, ‘glove.6B.100d’, ‘glove.6B.200d’, ‘glove.6B.300d’, ‘glove.6B.50d’, ‘glove.840B.300d’, ‘glove.twitter.27B.100d’, ‘glove.twitter.27B.200d’, ‘glove.twitter.27B.25d’, ‘glove.twitter.27B.50d’
Methods:
get_embedding_matrix(embedding_dir, …)-
get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)¶
-
-
class
embedding_processor.Numberbatch¶ Bases:
embedding_processor.EmbeddingProcessorGenerates ConceptNet Numberbatch embeddings matrix. Speer, R., Chin, J., & Havasi, C. (2016). ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) ConceptNet, 4444–4451.
Embedding source available from: https://github.com/commonsense/conceptnet-numberbatch
Valid Numberbatch source files: ‘numberbatch-en’
Methods:
get_embedding_matrix(embedding_dir, …)-
get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)¶
-
-
class
embedding_processor.Random¶ Bases:
embedding_processor.EmbeddingProcessorGenerates random embeddings matrix with values in in the range [-1, 1].
Methods:
get_embedding_matrix(embedding_dir, …)-
get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)¶
-
-
class
embedding_processor.Word2Vec¶ Bases:
embedding_processor.EmbeddingProcessorGenerates Word2Vec embedding matrix. Mikolov, T., Yih, W.-T. and Zweig, G. (2013) Linguistic Regularities in Continuous Space Word Representations. Proceedings of NAACL-HLT [online]. (June), pp. 746–751.
Valid Word2Vec source files: ‘GoogleNews-vectors-negative300’, ‘freebase-vectors-skipgram1000-en’, ‘freebase-vectors-skipgram1000’
Methods:
get_embedding_matrix(embedding_dir, …)-
get_embedding_matrix(embedding_dir, embedding_source, embedding_dim, vocabulary)¶
-
-
embedding_processor.get_embedding(embedding_dir, embedding_type, embedding_source=None, embedding_dim=300, vocabulary=None)¶ Utility function for returning an embedding matrix generated by an EmbeddingProcessor. Valid embedding types: ‘char’, ‘random’, ‘glove’, ‘word2vec’, ‘fasttext’, ‘numberbatch’, ‘deps’
- Parameters
embedding_dir (str) – The location to store and load embedding files. If it doesn’t exist it will be created.
embedding_type (str) – The name of the EmbeddingProcessor.
embedding_source (string) – Specifies which embedding source file to load, or None for char embeddings.
embedding_dim (int) – Length of vector to map tokens to, raises error if longer than loaded source files.
vocabulary (Gluonnlp Vocabulary) – Maps word tokens to attached embedding vectors, or None for char embeddings.
-
embedding_processor.embedding_types¶ Dictionary mapping embedding_type strings to EmbeddingProcessor class.
- Type
dict
- Returns
A matrix of shape (vocabulary_size, embedding_dim) mapping words to embeddings.
- Return type
embedding_matrix (numpy array)