data_juicer.utils.common_utils module#
- data_juicer.utils.common_utils.stats_to_number(s, reverse=True)[source]#
convert a stats value which can be string of list to a float.
- data_juicer.utils.common_utils.dict_to_hash(input_dict: dict, hash_length=None)[source]#
hash a dict to a string with length hash_length
- Parameters:
input_dict â the given dict
- data_juicer.utils.common_utils.nested_access(data, path, digit_allowed=True)[source]#
Access nested data using a dot-separated path.
- Parameters:
data â A dictionary or a list to access the nested data from.
path â A dot-separated string representing the path to access. This can include numeric indices when accessing list elements.
digit_allowed â Allow transferring string to digit.
- Returns:
The value located at the specified path, or raises a KeyError or IndexError if the path does not exist.
- data_juicer.utils.common_utils.is_string_list(var)[source]#
return if the var is list of string.
- Parameters:
var â input variance
- data_juicer.utils.common_utils.avg_split_string_list_under_limit(str_list: list, token_nums: list, max_token_num=None)[source]#
Split the string list to several sub str_list, such that the total token num of each sub string list is less than max_token_num, keeping the total token nums of sub string lists are similar.
- Parameters:
str_list â input string list.
token_nums â token num of each string list.
max_token_num â max token num of each sub string list.