Data Processor¶
Classes:
|
Converts sentences for dialogue act classification into data sets. |
|
A single training/test example for dialogue act classification. |
Functions:
|
Yield successive batch_size chunks from input_arr. |
|
Sorts tokenised sentences by length and pads them so that sentences in each batch have the same length. |
|
Converts one-hot encoded label list into its string representation. |
|
|
|
Converts label string representation into one-hot encoded list. |
-
class
data_processor.DataProcessor(set_name, output_dir, max_seq_length, vocab_size=None, to_tokens=True, to_indices=True, pad_seq=True, to_lower=True, use_punct=False, label_index=2)¶ Bases:
objectConverts sentences for dialogue act classification into data sets.
Methods:
build_dataset_for_bert(set_type, …[, …])Creates an numpy dataset for BERT from the specified .npz File
build_dataset_from_numpy(set_type, batch_size)Creates an numpy dataset from the specified .npz file.
build_dataset_from_record(set_type, batch_size)Creates an iterable dataset from the specified TFRecord File
convert_examples_to_numpy(set_type, …)Converts InputExamples to features and saves as .npz file.
convert_examples_to_record(set_type, …)Converts InputExamples to features and saves as TFRecord file.
Gets the metadata and all datasets (train, test, val) from the Github repository and saves to file.
Generate a Vocabulary and label list from the whole dataset.
Gets the Test set from Github repository.
Gets the Training set from Github repository.
Gets the Validation set from Github repository.
Load Labels from metadata file.
Load Vocabulary and Labels from metadata file.
Load Vocabulary from metadata file.
-
build_dataset_for_bert(set_type, bert_tokenizer, batch_size, is_training=True)¶ Creates an numpy dataset for BERT from the specified .npz File
- Parameters
set_type (str) – Specifies if this is the training, validation or test data
bert_tokenizer (FullTokeniser) – The BERT tokeniser
batch_size (int) – The number of examples per batch
is_training (bool) – Flag determines if training set is shuffled
- Returns
Numpy array of BERT input ids input_masks (np.array): Numpy array of BERT input masks segment_ids (np.array): Numpy array of BERT segment ids labels (np.array): Numpy array of target labels
- Return type
input_ids (np.array)
-
build_dataset_from_numpy(set_type, batch_size, is_training=True, use_crf=False)¶ Creates an numpy dataset from the specified .npz file.
- Parameters
set_type (str) – Specifies if this is the training, validation or test data
batch_size (int) – The number of examples per batch
is_training (bool) – Flag determines if training set is shuffled
use_crf (bool) – Using CRF as final layer requires labels shape [batch_size, num_labels, 1]
- Returns
Numpy array of input text labels (np.array): Numpy array of target labels
- Return type
text (np.array)
-
build_dataset_from_record(set_type, batch_size, repeat=None, is_training=True, drop_remainder=False)¶ Creates an iterable dataset from the specified TFRecord File
- Parameters
set_type (str) – Specifies if this is the training, validation or test data
batch_size (int) – The number of examples per batch
repeat (int) – How many times the dataset with repeat until it is exhausted, if ‘None’ repeats forever
is_training (bool) – Flag determines if training set is shuffled
drop_remainder (bool) – Flag determines if last batch is dropped if not of batch_size
- Returns
Iterable dataset of two tensors ‘text’ and ‘label’
- Return type
dataset (TF Dataset)
-
convert_examples_to_numpy(set_type, examples, vocabulary, labels)¶ Converts InputExamples to features and saves as .npz file.
- if to_tokens is True
Tokenizes all text and strips whitespace. Converts to lowercase if to_lower=True. Removes punctuation if use_punct=False. Pads sentence with <unk> tokens to max_seq_length if pad_seq=True Converts sentence tokens to indices.
Converts labels to indices.
Saves as .npz file.
- Args:
set_type (str): Specifies if this is the training, validation or test data examples (list): List of InputExamples vocabulary (Gluonnlp Vocab): Datasets vocabulary labels (list): Datasets labels list
-
convert_examples_to_record(set_type, examples, vocabulary, labels)¶ Converts InputExamples to features and saves as TFRecord file.
- if to_tokens is True
Tokenizes all text and strips whitespace. Converts to lowercase if to_lower=True. Removes punctuation if use_punct=False. Pads sentence with <unk> tokens to max_seq_length if pad_seq=True Converts sentence tokens to indices.
Converts labels to indices.
Saves as TFRecord file.
- Args:
set_type (str): Specifies if this is the training, validation or test data examples (list): List of InputExamples vocabulary (Gluonnlp Vocab): Datasets vocabulary labels (list): Datasets labels list
-
get_dataset()¶ Gets the metadata and all datasets (train, test, val) from the Github repository and saves to file.
-
get_metadata()¶ Generate a Vocabulary and label list from the whole dataset.
Tokenizes all text and strips whitespace. Converts to lowercase if to_lower=True. Removes punctuation if use_punct=False. Keeps only vocab_size number of words.
Counts labels and creates list of strings sorted in descending order of frequency
Saves the vocabulary and labels to a pickle file.
- Returns
Datasets vocabulary labels (list): Datasets labels
- Return type
vocabulary (Gluonnlp Vocab)
-
get_test_examples()¶ Gets the Test set from Github repository. Used to make predictions.
- Returns
A list of InputExamples for the training set
- Return type
examples (list)
-
get_train_examples()¶ Gets the Training set from Github repository.
- Returns
A list of InputExamples for the training set
- Return type
examples (list)
-
get_val_examples()¶ Gets the Validation set from Github repository. Used to evaluate training.
- Returns
A list of InputExamples for the training set
- Return type
examples (list)
-
load_labels()¶ Load Labels from metadata file.
- Returns
Datasets labels list
- Return type
labels (list)
-
load_metadata()¶ Load Vocabulary and Labels from metadata file.
- Returns
Datasets vocabulary labels (list): Datasets labels list
- Return type
vocabulary (Gluonnlp Vocab)
-
load_vocabulary()¶ Load Vocabulary from metadata file.
- Returns
Datasets vocabulary
- Return type
vocabulary (Gluonnlp Vocab)
-
-
class
data_processor.InputExample(example_id, text, label)¶ Bases:
objectA single training/test example for dialogue act classification.
-
data_processor.batch(input_arr, batch_size)¶ Yield successive batch_size chunks from input_arr.
-
data_processor.batch_and_pad(text, labels, batch_size, max_seq_length, min_seq_length=5, pad_value=1)¶ Sorts tokenised sentences by length and pads them so that sentences in each batch have the same length.
- Parameters
text (list) – List of tokenised sentences to batch.
labels (list) – List of labels to batch.
batch_size (int) – Number of sentences to put in each batch.
max_seq_length (int) – Maximum length of any sequence.
min_seq_length (int) – Minimum length of any sequence.
pad_value (int/str) – Value to pad sequences with.
- Returns
List of batches (lists) of sentences. labels_batches (list): List of batches (lists) of labels.
- Return type
text_batches (list)
-
data_processor.from_one_hot(one_hot, labels)¶ Converts one-hot encoded label list into its string representation.
-
data_processor.join_punctuation(tokens, characters='.,;?!')¶
-
data_processor.to_one_hot(label, labels)¶ Converts label string representation into one-hot encoded list.