Data Processor

Classes:

DataProcessor(set_name, output_dir, …[, …])

Converts sentences for dialogue act classification into data sets.

InputExample(example_id, text, label)

A single training/test example for dialogue act classification.

Functions:

batch(input_arr, batch_size)

Yield successive batch_size chunks from input_arr.

batch_and_pad(text, labels, batch_size, …)

Sorts tokenised sentences by length and pads them so that sentences in each batch have the same length.

from_one_hot(one_hot, labels)

Converts one-hot encoded label list into its string representation.

join_punctuation(tokens[, characters])

to_one_hot(label, labels)

Converts label string representation into one-hot encoded list.

class data_processor.DataProcessor(set_name, output_dir, max_seq_length, vocab_size=None, to_tokens=True, to_indices=True, pad_seq=True, to_lower=True, use_punct=False, label_index=2)

Bases: object

Converts sentences for dialogue act classification into data sets.

Methods:

build_dataset_for_bert(set_type, …[, …])

Creates an numpy dataset for BERT from the specified .npz File

build_dataset_from_numpy(set_type, batch_size)

Creates an numpy dataset from the specified .npz file.

build_dataset_from_record(set_type, batch_size)

Creates an iterable dataset from the specified TFRecord File

convert_examples_to_numpy(set_type, …)

Converts InputExamples to features and saves as .npz file.

convert_examples_to_record(set_type, …)

Converts InputExamples to features and saves as TFRecord file.

get_dataset()

Gets the metadata and all datasets (train, test, val) from the Github repository and saves to file.

get_metadata()

Generate a Vocabulary and label list from the whole dataset.

get_test_examples()

Gets the Test set from Github repository.

get_train_examples()

Gets the Training set from Github repository.

get_val_examples()

Gets the Validation set from Github repository.

load_labels()

Load Labels from metadata file.

load_metadata()

Load Vocabulary and Labels from metadata file.

load_vocabulary()

Load Vocabulary from metadata file.

build_dataset_for_bert(set_type, bert_tokenizer, batch_size, is_training=True)

Creates an numpy dataset for BERT from the specified .npz File

Parameters
  • set_type (str) – Specifies if this is the training, validation or test data

  • bert_tokenizer (FullTokeniser) – The BERT tokeniser

  • batch_size (int) – The number of examples per batch

  • is_training (bool) – Flag determines if training set is shuffled

Returns

Numpy array of BERT input ids input_masks (np.array): Numpy array of BERT input masks segment_ids (np.array): Numpy array of BERT segment ids labels (np.array): Numpy array of target labels

Return type

input_ids (np.array)

build_dataset_from_numpy(set_type, batch_size, is_training=True, use_crf=False)

Creates an numpy dataset from the specified .npz file.

Parameters
  • set_type (str) – Specifies if this is the training, validation or test data

  • batch_size (int) – The number of examples per batch

  • is_training (bool) – Flag determines if training set is shuffled

  • use_crf (bool) – Using CRF as final layer requires labels shape [batch_size, num_labels, 1]

Returns

Numpy array of input text labels (np.array): Numpy array of target labels

Return type

text (np.array)

build_dataset_from_record(set_type, batch_size, repeat=None, is_training=True, drop_remainder=False)

Creates an iterable dataset from the specified TFRecord File

Parameters
  • set_type (str) – Specifies if this is the training, validation or test data

  • batch_size (int) – The number of examples per batch

  • repeat (int) – How many times the dataset with repeat until it is exhausted, if ‘None’ repeats forever

  • is_training (bool) – Flag determines if training set is shuffled

  • drop_remainder (bool) – Flag determines if last batch is dropped if not of batch_size

Returns

Iterable dataset of two tensors ‘text’ and ‘label’

Return type

dataset (TF Dataset)

convert_examples_to_numpy(set_type, examples, vocabulary, labels)

Converts InputExamples to features and saves as .npz file.

if to_tokens is True

Tokenizes all text and strips whitespace. Converts to lowercase if to_lower=True. Removes punctuation if use_punct=False. Pads sentence with <unk> tokens to max_seq_length if pad_seq=True Converts sentence tokens to indices.

Converts labels to indices.

Saves as .npz file.

Args:

set_type (str): Specifies if this is the training, validation or test data examples (list): List of InputExamples vocabulary (Gluonnlp Vocab): Datasets vocabulary labels (list): Datasets labels list

convert_examples_to_record(set_type, examples, vocabulary, labels)

Converts InputExamples to features and saves as TFRecord file.

if to_tokens is True

Tokenizes all text and strips whitespace. Converts to lowercase if to_lower=True. Removes punctuation if use_punct=False. Pads sentence with <unk> tokens to max_seq_length if pad_seq=True Converts sentence tokens to indices.

Converts labels to indices.

Saves as TFRecord file.

Args:

set_type (str): Specifies if this is the training, validation or test data examples (list): List of InputExamples vocabulary (Gluonnlp Vocab): Datasets vocabulary labels (list): Datasets labels list

get_dataset()

Gets the metadata and all datasets (train, test, val) from the Github repository and saves to file.

get_metadata()

Generate a Vocabulary and label list from the whole dataset.

Tokenizes all text and strips whitespace. Converts to lowercase if to_lower=True. Removes punctuation if use_punct=False. Keeps only vocab_size number of words.

Counts labels and creates list of strings sorted in descending order of frequency

Saves the vocabulary and labels to a pickle file.

Returns

Datasets vocabulary labels (list): Datasets labels

Return type

vocabulary (Gluonnlp Vocab)

get_test_examples()

Gets the Test set from Github repository. Used to make predictions.

Returns

A list of InputExamples for the training set

Return type

examples (list)

get_train_examples()

Gets the Training set from Github repository.

Returns

A list of InputExamples for the training set

Return type

examples (list)

get_val_examples()

Gets the Validation set from Github repository. Used to evaluate training.

Returns

A list of InputExamples for the training set

Return type

examples (list)

load_labels()

Load Labels from metadata file.

Returns

Datasets labels list

Return type

labels (list)

load_metadata()

Load Vocabulary and Labels from metadata file.

Returns

Datasets vocabulary labels (list): Datasets labels list

Return type

vocabulary (Gluonnlp Vocab)

load_vocabulary()

Load Vocabulary from metadata file.

Returns

Datasets vocabulary

Return type

vocabulary (Gluonnlp Vocab)

class data_processor.InputExample(example_id, text, label)

Bases: object

A single training/test example for dialogue act classification.

data_processor.batch(input_arr, batch_size)

Yield successive batch_size chunks from input_arr.

data_processor.batch_and_pad(text, labels, batch_size, max_seq_length, min_seq_length=5, pad_value=1)

Sorts tokenised sentences by length and pads them so that sentences in each batch have the same length.

Parameters
  • text (list) – List of tokenised sentences to batch.

  • labels (list) – List of labels to batch.

  • batch_size (int) – Number of sentences to put in each batch.

  • max_seq_length (int) – Maximum length of any sequence.

  • min_seq_length (int) – Minimum length of any sequence.

  • pad_value (int/str) – Value to pad sequences with.

Returns

List of batches (lists) of sentences. labels_batches (list): List of batches (lists) of labels.

Return type

text_batches (list)

data_processor.from_one_hot(one_hot, labels)

Converts one-hot encoded label list into its string representation.

data_processor.join_punctuation(tokens, characters='.,;?!')
data_processor.to_one_hot(label, labels)

Converts label string representation into one-hot encoded list.