Data-Juicer Q&A Copilot#

Q&A Copilot is the intelligent question-answering component of the Data-Juicer Agents system, a professional Data-Juicer AI assistant built on the AgentScope framework.

You can chat with our Q&A Copilot Juicer on the official documentation site of Data-Juicer! Feel free to ask Juicer anything related to Data-Juicer ecosystem.

Core Components#

  • Agent: Intelligent Q&A agent based on ReActAgent

  • FAQ RAG System: Fast and accurate FAQ retrieval powered by Qdrant vector database and DashScope text embedding model

  • MCP Integration: Online GitHub search capabilities through GitHub MCP Server

  • Redis Storage: Supports session history and feedback data persistence

  • Web API: Provides RESTful interfaces for frontend integration

Quick Start#

Prerequisites#

  • 3.10 <= Python <= 3.12

  • Docker (for running Qdrant vector database)

  • Redis server (optional, activated by SESSION_STORE_TYPE=redis)

  • DashScope API Key (for large language model calls and text embedding)

Installation#

  1. Install dependencies

    cd ..
    uv pip install .[qa]
    cd qa-copilot
    
  2. Install Docker (for Qdrant vector database)

    # Ubuntu/Debian
    sudo apt-get install docker.io
    sudo systemctl start docker
    
    # macOS
    brew install docker
    

    Note: The system will automatically check and start the Qdrant Docker container on startup. If FAQ data is not initialized, the system will automatically read from qa-copilot/rag_utils/faq.txt and initialize the RAG data.

  3. Install and start Redis (optional - skip if using the default SESSION_STORE_TYPE=json)

    # Ubuntu/Debian
    sudo apt-get install redis-server
    redis-server --daemonize yes
    
    # macOS
    brew install redis
    brew services start redis
    

    Note:

    • If you set SESSION_STORE_TYPE=json (default), session history will be stored as JSON files in the SESSION_STORE_DIR directory with automatic TTL-based cleanup.

    • If you set SESSION_STORE_TYPE=redis, you need to have Redis server running. Session state is automatically managed by RedisMemory, and TTL is handled by Redis server configuration.

Configuration#

  1. Set required environment variables

    export DASHSCOPE_API_KEY="your_dashscope_api_key"
    export GITHUB_TOKEN="your_github_token"  # Required: for GitHub MCP integration
    
  2. Set optional environment variables

    Session Storage Configuration:

    # Session store type: "json" (default) or "redis"
    export SESSION_STORE_TYPE="json"  # or "redis"
    
    # For JSON mode (default):
    export SESSION_STORE_DIR="./sessions"  # Session file storage directory (default: "./sessions")
    export SESSION_TTL_SECONDS="21600"  # Session TTL in seconds (default: 21600 = 6 hours)
    export SESSION_CLEANUP_INTERVAL="1800"  # Cleanup interval in seconds (default: 1800 = 30 minutes)
    
    # For Redis mode:
    export REDIS_HOST="localhost"  # Redis server host (default: "localhost")
    export REDIS_PORT="6379"  # Redis server port (default: 6379)
    export REDIS_DB="0"  # Redis database number (default: 0)
    export REDIS_PASSWORD=""  # Redis password (default: None, optional)
    export REDIS_MAX_CONNECTIONS="10"  # Redis max connections (default: 10)
    # Note: Redis TTL is handled by Redis server configuration, not by application
    

    Model Configuration:

    export MAX_TOKENS="200000"  # Maximum tokens for context window (default: 200000)
    # Note: This value is multiplied by 3 when passed to DashScopeChatFormatter
    # because CharTokenCounter counts characters, and ~3 chars ≈ 1 token for mixed CHN & ENG text
    

    Qdrant Vector Database:

    export QDRANT_HOST="127.0.0.1"  # Qdrant server host (default: "127.0.0.1")
    export QDRANT_PORT="6333"  # Qdrant server port (default: 6333)
    

    Service Configuration:

    export DJ_COPILOT_SERVICE_HOST="127.0.0.1"  # Service host address (default: "127.0.0.1")
    export DJ_COPILOT_ENABLE_LOGGING="true"  # Enable session logging (default: "true")
    export DJ_COPILOT_LOG_DIR="./logs"  # Log directory (default: "./logs")
    

    Advanced Configuration:

    export FASTAPI_CONFIG_PATH=""  # Path to FastAPI config JSON file (optional)
    export SAFE_CHECK_HANDLER_PATH=""  # Path to custom safe check handler module (optional)
    
  3. Configure FAQ file (optional)

    The system uses qa-copilot/rag_utils/faq.txt as the FAQ data source by default. You can edit this file to customize FAQ content. FAQ file format example:

    'id': 'FAQ_001', 'question': 'What is Data-Juicer?', 'answer': 'Data-Juicer is a...'
    'id': 'FAQ_002', 'question': 'How to install?', 'answer': 'You can install by...'
    
  4. Start the service

    bash setup_server.sh
    

    On first startup, the system will automatically:

    • Check and start the Qdrant Docker container (port 6333)

    • Initialize FAQ RAG data (if not already initialized)

    • Start the Web API service

Usage#

Web API Interfaces#

After starting the service, the system provides the following API interfaces:

1. Q&A Conversation#

POST /process
Content-Type: application/json

{
  "input": [
    {
      "role": "user", 
      "content": [{"type": "text", "text": "How to use Data-Juicer for data cleaning?"}]
    }
  ],
  "session_id": "your_session_id",
  "user_id": "user_id"
}

2. Get Session History#

POST /memory
Content-Type: application/json

{
  "session_id": "your_session_id",
  "user_id": "user_id"
}

3. Clear Session History#

POST /clear
Content-Type: application/json

{
  "session_id": "your_session_id",
  "user_id": "user_id"
}

4. Submit User Feedback#

POST /feedback
Content-Type: application/json

{
  "data": {
    "message_id": "message_id_here",
    "feedback_type": "like",
    "comment": "optional user comment"
  },
  "session_id": "your_session_id",
  "user_id": "user_id"
}

Parameters:

  • message_id: The ID of the message to provide feedback on (required)

  • feedback_type: Type of feedback, either "like" or "dislike" (required)

  • comment: Optional user comment text (optional)

Response example:

{
  "status": "ok",
  "message": "Feedback recorded successfully"
}

WebUI#

you can simply run the following command in your terminal:

npx @agentscope-ai/chat agentscope-runtime-webui --url http://localhost:8080/process

Refer to AgentScope Runtime WebUI for more information.

Configuration Details#

Environment Variables Summary#

Variable

Required

Default

Description

DASHSCOPE_API_KEY

✅ Yes

-

DashScope API key for LLM and embedding

GITHUB_TOKEN

✅ Yes

-

GitHub token for MCP integration

SESSION_STORE_TYPE

❌ No

"json"

Session storage type: "json" or "redis"

SESSION_STORE_DIR

❌ No

"./sessions"

Session file directory (JSON mode only)

SESSION_TTL_SECONDS

❌ No

21600

Session TTL in seconds (JSON mode only, 6 hours)

SESSION_CLEANUP_INTERVAL

❌ No

1800

Cleanup interval in seconds (JSON mode only, 30 minutes)

REDIS_HOST

❌ No

"localhost"

Redis server host (Redis mode only)

REDIS_PORT

❌ No

6379

Redis server port (Redis mode only)

REDIS_DB

❌ No

0

Redis database number (Redis mode only)

REDIS_PASSWORD

❌ No

None

Redis password (Redis mode only, optional)

REDIS_MAX_CONNECTIONS

❌ No

10

Redis max connections (Redis mode only)

QDRANT_HOST

❌ No

"127.0.0.1"

Qdrant server host

QDRANT_PORT

❌ No

6333

Qdrant server port

MAX_TOKENS

❌ No

200000

Maximum tokens for context window (multiplied by 3 for CharTokenCounter)

DJ_COPILOT_SERVICE_HOST

❌ No

"127.0.0.1"

Service host address

DJ_COPILOT_ENABLE_LOGGING

❌ No

"true"

Enable session logging

DJ_COPILOT_LOG_DIR

❌ No

"./logs"

Log directory

FASTAPI_CONFIG_PATH

❌ No

""

Path to FastAPI config JSON file

SAFE_CHECK_HANDLER_PATH

❌ No

""

Path to custom safe check handler

Model Configuration#

In app_deploy.py, you can configure the language model to use:

model=DashScopeChatModel(
    "qwen3-max-2026-01-23",  # Model name
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    stream=True,  # Enable streaming response
    enable_thinking=True,  # Enable thinking mode
)

The formatter uses MAX_TOKENS environment variable (default: 200000) to limit the context window size. Since CharTokenCounter counts characters and approximately 3 characters ≈ 1 token for mixed Chinese and English text, the value is multiplied by 3 when passed to DashScopeChatFormatter.

Session Storage Configuration#

JSON Mode (Default):

  • Session history is stored as JSON files in SESSION_STORE_DIR directory

  • Automatic TTL-based cleanup runs every SESSION_CLEANUP_INTERVAL seconds

  • Sessions expire after SESSION_TTL_SECONDS seconds of inactivity

  • No external dependencies required

Redis Mode:

  • Session history is stored in Redis

  • Session state is automatically managed by RedisMemory

  • TTL is handled by Redis server configuration (not application-level)

  • Requires Redis server to be running

FAQ RAG Configuration#

The FAQ RAG system uses the following configuration:

  • Vector Database: Qdrant (running in Docker container)

  • Embedding Model: DashScope text-embedding-v4

  • Vector Dimension: 1024

  • Data Source: qa-copilot/rag_utils/faq.txt

  • Storage Location: qa-copilot/rag_utils/qdrant_storage

  • Qdrant Host: Configurable via QDRANT_HOST (default: 127.0.0.1)

  • Qdrant Port: Configurable via QDRANT_PORT (default: 6333)

The system automatically checks if RAG data is initialized on startup. If not initialized, it will automatically read the FAQ file and create vector indexes.

Troubleshooting#

Common Issues#

  1. Docker/Qdrant Issues

    • Ensure Docker service is running: docker --version

    • Check Qdrant container status: docker ps | grep qdrant

    • Manually start Qdrant container: docker start qdrant

    • Check if Qdrant port is occupied: netstat -tlnp | grep 6333

    • To reinitialize RAG data, delete the qa-copilot/rag_utils/qdrant_storage directory and restart the service

  2. Redis connection failure (when using SESSION_STORE_TYPE=redis)

    • Ensure Redis service is running: redis-cli ping

    • Check if Redis port is occupied: netstat -tlnp | grep 6379 (or your configured REDIS_PORT)

    • Verify Redis configuration: Check REDIS_HOST, REDIS_PORT, REDIS_DB, and REDIS_PASSWORD environment variables

    • Note: Redis TTL is managed by Redis server, not by the application

  3. MCP service startup failure

    • Ensure GITHUB_TOKEN is set and correct (required environment variable)

    • Verify GitHub token has necessary permissions for MCP integration

  4. API Key error

    • Verify DASHSCOPE_API_KEY environment variable is correctly set

    • Confirm API Key is valid and has sufficient quota

  5. FAQ retrieval returns no results

    • Confirm FAQ file qa-copilot/rag_utils/faq.txt exists and is properly formatted

    • Check if Qdrant container is running normally

    • Review logs to confirm RAG data was successfully initialized

Acknowledgments#

Parts of this project’s code are adapted from the following open-source projects:

Special thanks to the AgentScope team for their excellent framework and sample code!

License#

This project uses the same license as the main project. For details, please refer to the LICENSE file.