Skip to content

Dataset

Alignment Datasets

AlignmentDatasetSample

Container for a single Alignment Dataset Sample.

This representation is faithful to the "TRL Preference Format with explicit prompt". See: https://huggingface.co/docs/trl/en/dataset_formats.

Parameters:

  • prompt (str) –

    The prompt associated with the sample.

  • chosen (str) –

    The winning response associated with the sample.

  • rejected (str) –

    The losing response associated with the sample.

AlignmentDataset

Container object for an Alignment Dataset.

Parameters:

  • task (AligmnentTask) –

    The AlignmentTask associated with the dataset.

  • samples (List[AlignmentDatasetSample]) –

    The samples in this AlignmentDataset.

  • train_frac (float) –

    Fraction of samples that belong to the training split.

Raises:

  • ValueError

    If train_frac is not in the interval [0, 1.0]

Methods:

  • from_dict

    Construct an AlignmentDataset from dictionary representation.

  • from_json

    Load the AlignmentDataset from a json file.

  • to_dict

    Convert the AlignmentDataset to dictionary represenetation.

  • to_hf_compatible

    Convert the AlignmentDataset to a dictionary compatible with HuggingFace datasets.

  • to_json

    Save the AlignmentDataset to a json file.

Attributes:

num_samples property

num_samples: int

int: The number of samples associated with the AlignmentDataset.

num_test_samples property

num_test_samples: int

int: The number of test samples associated with the AlignmentDataset.

num_train_samples property

num_train_samples: int

int: The number of training samples associated with the AlignmentDataset.

test property

List[AlignmentDatasetSample]: The list of testing samples associated with the AlignmentDataset.

test_frac property

test_frac: float

Fraction of samples that belong to the testing split.

train property

List[AlignmentDatasetSample]: The list of training samples associated with the AlignmentDataset.

from_dict classmethod

from_dict(dataset_dict: Dict[str, Any]) -> AlignmentDataset

Construct an AlignmentDataset from dictionary representation.

Note

Expects 'task', and 'train', 'test' keys to be present in the dictionary. The 'task' value should be parsable by AlignmentTask.from_dict(). The 'train' and 'test' value should be a list of dictionaries, each of which are parsable by AlignmentDatasetSample.

Parameters:

  • dataset_dict (Dict[str, Any]) –

    The dictionary that encodes the AlignmentDataset.

Returns:

  • AlignmentDataset ( AlignmentDataset ) –

    The newly constructed AlignmentDataset.

Raises:

  • ValueError

    If the input dictionary is missing any required keys.

Source code in aif_gen/dataset/alignment_dataset.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
@classmethod
def from_dict(cls, dataset_dict: Dict[str, Any]) -> AlignmentDataset:
    r"""Construct an AlignmentDataset from dictionary representation.

    Note:
        Expects 'task', and 'train', 'test' keys to be present in the dictionary.
        The 'task' value should be parsable by AlignmentTask.from_dict().
        The 'train' and 'test' value should be a list of dictionaries, each of which
        are parsable by AlignmentDatasetSample.

    Args:
        dataset_dict (Dict[str, Any]): The dictionary that encodes the AlignmentDataset.

    Returns:
        AlignmentDataset: The newly constructed AlignmentDataset.

    Raises:
        ValueError: If the input dictionary is missing any required keys.
    """
    task = AlignmentTask.from_dict(dataset_dict['task'])
    samples = []
    for sample in dataset_dict['train']:
        samples.append(AlignmentDatasetSample(**sample))
    num_train_samples = len(samples)

    for sample in dataset_dict['test']:
        samples.append(AlignmentDatasetSample(**sample))

    train_frac = num_train_samples / len(samples)
    return cls(task, samples, train_frac)

from_json classmethod

from_json(file_path: str | Path) -> AlignmentDataset

Load the AlignmentDataset from a json file.

Note: Uses AlignmentDataset.from_dict() under the hood to parse the representation.

Parameters:

  • file_path (Union[str, Path]) –

    The os.pathlike object to read from.

Returns:

  • AlignmentDataset ( AlignmentDataset ) –

    The newly constructed AlignmentDataset.

Source code in aif_gen/dataset/alignment_dataset.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
@classmethod
def from_json(cls, file_path: str | pathlib.Path) -> AlignmentDataset:
    r"""Load the AlignmentDataset from a json file.

    Note: Uses AlignmentDataset.from_dict() under the hood to parse the representation.

    Args:
        file_path (Union[str, pathlib.Path]): The os.pathlike object to read from.

    Returns:
        AlignmentDataset: The newly constructed AlignmentDataset.
    """
    with open(file_path, 'r') as f:
        dataset_dict = json.load(f)
    return cls.from_dict(dataset_dict)

to_dict

to_dict() -> Dict[str, Any]

Convert the AlignmentDataset to dictionary represenetation.

Returns:

  • Dict[str, Any]

    Dict[str, Any]: The dictionary representation of the AlignmentDataset.

Source code in aif_gen/dataset/alignment_dataset.py
85
86
87
88
89
90
91
92
93
94
95
def to_dict(self) -> Dict[str, Any]:
    r"""Convert the AlignmentDataset to dictionary represenetation.

    Returns:
        Dict[str, Any]: The dictionary representation of the AlignmentDataset.
    """
    dataset_dict: Dict[str, Any] = {}
    dataset_dict['task'] = self.task.to_dict()
    dataset_dict['train'] = [asdict(sample) for sample in self.train]
    dataset_dict['test'] = [asdict(sample) for sample in self.test]
    return dataset_dict

to_hf_compatible

to_hf_compatible() -> Dict[str, Dataset]

Convert the AlignmentDataset to a dictionary compatible with HuggingFace datasets.

Returns:

  • Dict[str, Dataset]

    Dict[str, Dataset]: The dictionary compatible with HuggingFace datasets.

Source code in aif_gen/dataset/alignment_dataset.py
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
def to_hf_compatible(self) -> Dict[str, Dataset]:
    r"""Convert the AlignmentDataset to a dictionary compatible with HuggingFace datasets.

    Returns:
        Dict[str, Dataset]: The dictionary compatible with HuggingFace datasets.
    """
    hf_dict: Dict[str, Dataset] = {
        'train': Dataset.from_dict(
            {
                'prompt': [sample.prompt for sample in self.train],
                'chosen': [sample.chosen for sample in self.train],
                'rejected': [sample.rejected for sample in self.train],
            },
            split='train',
        ),
        'test': Dataset.from_dict(
            {
                'prompt': [sample.prompt for sample in self.test],
                'chosen': [sample.chosen for sample in self.test],
                'rejected': [sample.rejected for sample in self.test],
            },
            split='test',
        ),
    }
    return hf_dict

to_json

to_json(file_path: str | Path) -> None

Save the AlignmentDataset to a json file.

Note: Uses to_dict() under the hood to get a dictionary representation.

Parameters:

  • file_path (Union[str, Path]) –

    The os.pathlike object to write to.

Source code in aif_gen/dataset/alignment_dataset.py
73
74
75
76
77
78
79
80
81
82
83
def to_json(self, file_path: str | pathlib.Path) -> None:
    r"""Save the AlignmentDataset to a json file.

    Note: Uses to_dict() under the hood to get a dictionary representation.

    Args:
        file_path (Union[str, pathlib.Path]): The os.pathlike object to write to.
    """
    dataset_dict = self.to_dict()
    with open(file_path, 'w') as f:
        json.dump(dataset_dict, f)

ContinualAlignmentDataset

Container object for a Continual Alignment Dataset.

Parameters:

Methods:

  • append

    Append a single AlignmentDataset to the ContinualAlignmentDataset.

  • extend

    Append multiple AlignmentDataset's to the ContinualAlignmentDataset.

  • from_dict

    Construct a ContinualAlignmentDataset from dictionary representation.

  • from_json

    Load the ContinualAlignmentDataset from a json file.

  • to_dict

    Convert the ContinualAlignmentDataset to dictionary represenetation.

  • to_hf_compatible

    Convert the ContinualAlignmentDataset to a list of dictionaries compatible with HuggingFace datasets.

  • to_json

    Save the ContinualAlignmentDataset to a json file.

Attributes:

  • num_datasets (int) –

    int: The number of AlignmentDataset constituents.

  • num_samples (int) –

    int: The total number of samples acros all AlignmentDataset constituents.

num_datasets property

num_datasets: int

int: The number of AlignmentDataset constituents.

num_samples property

num_samples: int

int: The total number of samples acros all AlignmentDataset constituents.

append

append(dataset: AlignmentDataset) -> None

Append a single AlignmentDataset to the ContinualAlignmentDataset.

Parameters:

Raises:

  • TypeError

    if the sample is not of type AlignmentDataset.

Source code in aif_gen/dataset/continual_alignment_dataset.py
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def append(self, dataset: AlignmentDataset) -> None:
    r"""Append a single AlignmentDataset to the ContinualAlignmentDataset.

    Args:
        dataset (AlignmentDataset): The new dataset to add.

    Raises:
        TypeError: if the sample is not of type AlignmentDataset.
    """
    if isinstance(dataset, AlignmentDataset):
        self.datasets.append(dataset)
    else:
        raise TypeError(
            f'Dataset: {dataset} must be of type AlignmentDataset but got {dataset.__class__.__name__}'
        )

extend

extend(datasets: List[AlignmentDataset]) -> None

Append multiple AlignmentDataset's to the ContinualAlignmentDataset.

Parameters:

Raises:

  • TypeError

    if any dataset is not of type AlignmentDataset.

Source code in aif_gen/dataset/continual_alignment_dataset.py
63
64
65
66
67
68
69
70
71
72
73
def extend(self, datasets: List[AlignmentDataset]) -> None:
    r"""Append multiple AlignmentDataset's to the ContinualAlignmentDataset.

    Args:
        datasets (List[AlignmentDataset]): The new datasets to add.

    Raises:
        TypeError: if any dataset is not of type AlignmentDataset.
    """
    for dataset in datasets:
        self.append(dataset)

from_dict classmethod

from_dict(
    dataset_dict: Dict[str, Any],
) -> ContinualAlignmentDataset

Construct a ContinualAlignmentDataset from dictionary representation.

Note

Expects 'datasets' key to be present in the dictionary. The value is a list of dictionaries, each parsable by AlignmentDataset.from_dict().

Parameters:

  • dataset_dict (Dict[str, Any]) –

    The dictionary that encodes the ContinualAlignmentDataset.

Returns:

Raises:

  • ValueError

    If the input dictionary is missing any required keys.

Source code in aif_gen/dataset/continual_alignment_dataset.py
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
@classmethod
def from_dict(cls, dataset_dict: Dict[str, Any]) -> ContinualAlignmentDataset:
    r"""Construct a ContinualAlignmentDataset from dictionary representation.

    Note:
        Expects 'datasets' key to be present in the dictionary. The value is a list
        of dictionaries, each parsable by AlignmentDataset.from_dict().

    Args:
        dataset_dict (Dict[str, Any]): The dictionary that encodes the ContinualAlignmentDataset.

    Returns:
        ContinualAlignmentDataset: The newly constructed ContinualAlignmentDataset.

    Raises:
        ValueError: If the input dictionary is missing any required keys.
    """
    datasets = []
    for dataset in dataset_dict['datasets']:
        datasets.append(AlignmentDataset.from_dict(dataset))
    return cls(datasets)

from_json classmethod

from_json(
    file_path: str | Path,
) -> ContinualAlignmentDataset

Load the ContinualAlignmentDataset from a json file.

Note: Uses ContinualAlignmentDataset.from_dict() under the hood to parse the representation.

Parameters:

  • file_path (Union[str, Path]) –

    The os.pathlike object to read from.

Returns:

Source code in aif_gen/dataset/continual_alignment_dataset.py
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
@classmethod
def from_json(cls, file_path: str | pathlib.Path) -> ContinualAlignmentDataset:
    r"""Load the ContinualAlignmentDataset from a json file.

    Note: Uses ContinualAlignmentDataset.from_dict() under the hood to parse the representation.

    Args:
        file_path (Union[str, pathlib.Path]): The os.pathlike object to read from.

    Returns:
        ContinualAlignmentDataset: The newly constructed ContinualAlignmentDataset.
    """
    with open(file_path, 'r') as f:
        dataset_dict = json.load(f)
    return cls.from_dict(dataset_dict)

to_dict

to_dict() -> Dict[str, Any]

Convert the ContinualAlignmentDataset to dictionary represenetation.

Returns:

  • Dict[str, Any]

    Dict[str, Any]: The dictionary representation of the ContinualAlignmentDataset.

Source code in aif_gen/dataset/continual_alignment_dataset.py
87
88
89
90
91
92
93
94
95
96
def to_dict(self) -> Dict[str, Any]:
    r"""Convert the ContinualAlignmentDataset to dictionary represenetation.

    Returns:
        Dict[str, Any]: The dictionary representation of the ContinualAlignmentDataset.
    """
    dataset_dict: Dict[str, List[Any]] = {'datasets': []}
    for dataset in self.datasets:
        dataset_dict['datasets'].append(dataset.to_dict())
    return dataset_dict

to_hf_compatible

to_hf_compatible() -> List[Dict[str, Dataset]]

Convert the ContinualAlignmentDataset to a list of dictionaries compatible with HuggingFace datasets.

Returns:

  • List[Dict[str, Dataset]]

    List[Dict[str, Dataset]]: The list of dictionaries compatible with HuggingFace datasets.

Source code in aif_gen/dataset/continual_alignment_dataset.py
136
137
138
139
140
141
142
def to_hf_compatible(self) -> List[Dict[str, Dataset]]:
    r"""Convert the ContinualAlignmentDataset to a list of dictionaries compatible with HuggingFace datasets.

    Returns:
        List[Dict[str, Dataset]]: The list of dictionaries compatible with HuggingFace datasets.
    """
    return [dataset.to_hf_compatible() for dataset in self.datasets]

to_json

to_json(file_path: str | Path) -> None

Save the ContinualAlignmentDataset to a json file.

Note: Uses to_dict() under the hood to get a dictionary representation.

Parameters:

  • file_path (Union[str, Path]) –

    The os.pathlike object to write to.

Source code in aif_gen/dataset/continual_alignment_dataset.py
75
76
77
78
79
80
81
82
83
84
85
def to_json(self, file_path: str | pathlib.Path) -> None:
    r"""Save the ContinualAlignmentDataset to a json file.

    Note: Uses to_dict() under the hood to get a dictionary representation.

    Args:
        file_path (Union[str, pathlib.Path]): The os.pathlike object to write to.
    """
    dataset_dict = self.to_dict()
    with open(file_path, 'w') as f:
        json.dump(dataset_dict, f)