data_juicer.ops.mapper.agent_dialog_normalize_mapper module#

class data_juicer.ops.mapper.agent_dialog_normalize_mapper.AgentDialogNormalizeMapper(*args, **kwargs)[source]#

Bases: Mapper

Normalize agent format (messages + choices) to DJ fields.

Outputs: text, dialog_history, query, response; optionally meta tags agent_tool_types, agent_skill_types, agent_turn_count. When copy_lineage_fields is True, also copies request_model, pt, total_cost_time, and (when copy_request_id) the first non-empty id among request_id_keys from the sample root into meta for cohort analysis and stable drill-down links. Always records last user/assistant message indices (in the raw messages list) when present. Supports multi-format tool_calls (e.g. tool_calls[].function.name as in OpenAI / demos/local/demo-agent-data-content.json) and configurable user/assistant labels. Optional history_*_max_chars caps keep head+tail with an explicit middle-omitted marker so dialog_history, flattened text, and last query / response stay aligned; meta.agent_dialog_history_compressed is set when any cap fires.

__init__(messages_key: str = 'messages', choices_key: str = 'choices', text_key: str = 'text', history_key: str = 'dialog_history', query_key: str = 'query', response_key: str = 'response', extract_tool_skill_tags: bool = True, include_system_in_first_user: bool = False, user_label: str = 'User', assistant_label: str = 'Assistant', copy_lineage_fields: bool = True, copy_request_id: bool = True, request_id_keys: List[str] = ['request_id', 'trace_id', 'id'], history_tool_result_max_chars: int = 10000, history_max_assistant_trace_chars: int = 0, history_max_user_chars: int = 0, history_compress_head_ratio: float = 0.62, **kwargs)[source]#

Base class that conducts data editing.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed.

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • image_bytes_key – the key name of field that stores sample image bytes list to be processed

  • query_key – the key name of field that stores sample queries

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

process_single(sample)[source]#

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample