Data Processor¶

Classes:

`DataProcessor`(set_name, output_dir, …[, …])	Converts sentences for dialogue act classification into data sets.
`InputExample`(example_id, text, label)	A single training/test example for dialogue act classification.

Functions:

`batch`(input_arr, batch_size)	Yield successive batch_size chunks from input_arr.
`batch_and_pad`(text, labels, batch_size, …)	Sorts tokenised sentences by length and pads them so that sentences in each batch have the same length.
`from_one_hot`(one_hot, labels)	Converts one-hot encoded label list into its string representation.
`join_punctuation`(tokens[, characters])
`to_one_hot`(label, labels)	Converts label string representation into one-hot encoded list.

class data_processor.DataProcessor(set_name, output_dir, max_seq_length, vocab_size=None, to_tokens=True, to_indices=True, pad_seq=True, to_lower=True, use_punct=False, label_index=2)¶

Bases: object

Converts sentences for dialogue act classification into data sets.

Methods:

`build_dataset_for_bert`(set_type, …[, …])	Creates an numpy dataset for BERT from the specified .npz File
`build_dataset_from_numpy`(set_type, batch_size)	Creates an numpy dataset from the specified .npz file.
`build_dataset_from_record`(set_type, batch_size)	Creates an iterable dataset from the specified TFRecord File
`convert_examples_to_numpy`(set_type, …)	Converts InputExamples to features and saves as .npz file.
`convert_examples_to_record`(set_type, …)	Converts InputExamples to features and saves as TFRecord file.
`get_dataset`()	Gets the metadata and all datasets (train, test, val) from the Github repository and saves to file.
`get_metadata`()	Generate a Vocabulary and label list from the whole dataset.
`get_test_examples`()	Gets the Test set from Github repository.
`get_train_examples`()	Gets the Training set from Github repository.
`get_val_examples`()	Gets the Validation set from Github repository.
`load_labels`()	Load Labels from metadata file.
`load_metadata`()	Load Vocabulary and Labels from metadata file.
`load_vocabulary`()	Load Vocabulary from metadata file.

build_dataset_for_bert(set_type, bert_tokenizer, batch_size, is_training=True)¶

Creates an numpy dataset for BERT from the specified .npz File

Parameters

set_type (str) – Specifies if this is the training, validation or test data
bert_tokenizer (FullTokeniser) – The BERT tokeniser
batch_size (int) – The number of examples per batch
is_training (bool) – Flag determines if training set is shuffled

Returns

Numpy array of BERT input ids input_masks (np.array): Numpy array of BERT input masks segment_ids (np.array): Numpy array of BERT segment ids labels (np.array): Numpy array of target labels

Return type

input_ids (np.array)

build_dataset_from_numpy(set_type, batch_size, is_training=True, use_crf=False)¶

Creates an numpy dataset from the specified .npz file.

Parameters

set_type (str) – Specifies if this is the training, validation or test data
batch_size (int) – The number of examples per batch
is_training (bool) – Flag determines if training set is shuffled
use_crf (bool) – Using CRF as final layer requires labels shape [batch_size, num_labels, 1]

Returns

Numpy array of input text labels (np.array): Numpy array of target labels

Return type

text (np.array)

build_dataset_from_record(set_type, batch_size, repeat=None, is_training=True, drop_remainder=False)¶

Creates an iterable dataset from the specified TFRecord File

Parameters

set_type (str) – Specifies if this is the training, validation or test data
batch_size (int) – The number of examples per batch
repeat (int) – How many times the dataset with repeat until it is exhausted, if ‘None’ repeats forever
is_training (bool) – Flag determines if training set is shuffled
drop_remainder (bool) – Flag determines if last batch is dropped if not of batch_size

Returns

Iterable dataset of two tensors ‘text’ and ‘label’

Return type

dataset (TF Dataset)

convert_examples_to_numpy(set_type, examples, vocabulary, labels)¶

Converts InputExamples to features and saves as .npz file.

if to_tokens is True: Tokenizes all text and strips whitespace. Converts to lowercase if to_lower=True. Removes punctuation if use_punct=False. Pads sentence with <unk> tokens to max_seq_length if pad_seq=True Converts sentence tokens to indices.

Converts labels to indices.

Saves as .npz file.

Args:
set_type (str): Specifies if this is the training, validation or test data examples (list): List of InputExamples vocabulary (Gluonnlp Vocab): Datasets vocabulary labels (list): Datasets labels list

convert_examples_to_record(set_type, examples, vocabulary, labels)¶

Converts InputExamples to features and saves as TFRecord file.

if to_tokens is True: Tokenizes all text and strips whitespace. Converts to lowercase if to_lower=True. Removes punctuation if use_punct=False. Pads sentence with <unk> tokens to max_seq_length if pad_seq=True Converts sentence tokens to indices.

Converts labels to indices.

Saves as TFRecord file.

Args:
set_type (str): Specifies if this is the training, validation or test data examples (list): List of InputExamples vocabulary (Gluonnlp Vocab): Datasets vocabulary labels (list): Datasets labels list

get_dataset()¶: Gets the metadata and all datasets (train, test, val) from the Github repository and saves to file.

get_metadata()¶

Generate a Vocabulary and label list from the whole dataset.

Tokenizes all text and strips whitespace. Converts to lowercase if to_lower=True. Removes punctuation if use_punct=False. Keeps only vocab_size number of words.

Counts labels and creates list of strings sorted in descending order of frequency

Saves the vocabulary and labels to a pickle file.

Returns: Datasets vocabulary labels (list): Datasets labels
Return type: vocabulary (Gluonnlp Vocab)

get_test_examples()¶

Gets the Test set from Github repository. Used to make predictions.

Returns: A list of InputExamples for the training set
Return type: examples (list)

get_train_examples()¶

Gets the Training set from Github repository.

Returns: A list of InputExamples for the training set
Return type: examples (list)

get_val_examples()¶

Gets the Validation set from Github repository. Used to evaluate training.

Returns: A list of InputExamples for the training set
Return type: examples (list)

load_labels()¶

Load Labels from metadata file.

Returns: Datasets labels list
Return type: labels (list)

load_metadata()¶

Load Vocabulary and Labels from metadata file.

Returns: Datasets vocabulary labels (list): Datasets labels list
Return type: vocabulary (Gluonnlp Vocab)

load_vocabulary()¶

Load Vocabulary from metadata file.

Returns: Datasets vocabulary
Return type: vocabulary (Gluonnlp Vocab)

class data_processor.InputExample(example_id, text, label)¶

Bases: object

A single training/test example for dialogue act classification.

data_processor.batch(input_arr, batch_size)¶: Yield successive batch_size chunks from input_arr.

data_processor.batch_and_pad(text, labels, batch_size, max_seq_length, min_seq_length=5, pad_value=1)¶

Sorts tokenised sentences by length and pads them so that sentences in each batch have the same length.

Parameters

text (list) – List of tokenised sentences to batch.
labels (list) – List of labels to batch.
batch_size (int) – Number of sentences to put in each batch.
max_seq_length (int) – Maximum length of any sequence.
min_seq_length (int) – Minimum length of any sequence.
pad_value (int/str) – Value to pad sequences with.

Returns

List of batches (lists) of sentences. labels_batches (list): List of batches (lists) of labels.

Return type

text_batches (list)

data_processor.from_one_hot(one_hot, labels)¶: Converts one-hot encoded label list into its string representation.

data_processor.join_punctuation(tokens, characters='.,;?!')¶

data_processor.to_one_hot(label, labels)¶: Converts label string representation into one-hot encoded list.