data_juicer.ops.grouper#

class data_juicer.ops.grouper.KeyValueGrouper(group_by_keys: List[str] | None = None, *args, **kwargs)[source]#

Bases: Grouper

Group samples to batched samples according values in given keys.

__init__(group_by_keys: List[str] | None = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • group_by_keys โ€“ group samples according values in the keys. Support for nested keys such as โ€œ__dj__stats__.text_lenโ€. It is [self.text_key] in default.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process(dataset)[source]#

Dataset โ€“> dataset.

Parameters:

dataset โ€“ input dataset

Returns:

dataset of batched samples.

class data_juicer.ops.grouper.NaiveGrouper(*args, **kwargs)[source]#

Bases: Grouper

Group all samples to one batched sample.

__init__(*args, **kwargs)[source]#

Initialization method.

Parameters:
  • args โ€“ extra args

  • kwargs โ€“ extra args

process(dataset)[source]#

Dataset โ€“> dataset.

Parameters:

dataset โ€“ input dataset

Returns:

dataset of batched samples.

class data_juicer.ops.grouper.NaiveReverseGrouper(batch_meta_export_path=None, *args, **kwargs)[source]#

Bases: Grouper

Split batched samples to samples.

__init__(batch_meta_export_path=None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • batch_meta_export_path โ€“ the path to export the batch meta. Just drop the batch meta if it is None.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process(dataset)[source]#

Dataset โ€“> dataset.

Parameters:

dataset โ€“ input dataset

Returns:

dataset of batched samples.