# Data-Juicer Q&A Copilot Q&A Copilot is the intelligent question-answering component of the Data-Juicer Agents system, a professional Data-Juicer AI assistant built on the AgentScope framework. You can chat with our [Q&A Copilot](./README.md) ***Juicer*** on the official [documentation site](https://datajuicer.github.io/data-juicer/en/main/index.html) of Data-Juicer! Feel free to ask ***Juicer*** anything related to Data-Juicer ecosystem.

### Core Components - **Agent**: Intelligent Q&A agent based on ReActAgent - **FAQ RAG System**: Fast and accurate FAQ retrieval powered by Qdrant vector database and DashScope text embedding model - **MCP Integration**: Online GitHub search capabilities through GitHub MCP Server - **Redis Storage**: Supports session history and feedback data persistence - **Web API**: Provides RESTful interfaces for frontend integration ## Quick Start ### Prerequisites - 3.10 <= Python <= 3.12 - Docker (for running Qdrant vector database) - Redis server (optional, activated by `SESSION_STORE_TYPE=redis`) - DashScope API Key (for large language model calls and text embedding) ### Installation 1. Install dependencies ```bash cd .. uv pip install .[qa] cd qa-copilot ``` 2. Install Docker (for Qdrant vector database) ```bash # Ubuntu/Debian sudo apt-get install docker.io sudo systemctl start docker # macOS brew install docker ``` **Note**: The system will automatically check and start the Qdrant Docker container on startup. If FAQ data is not initialized, the system will automatically read from `qa-copilot/rag_utils/faq.txt` and initialize the RAG data. 3. Install and start Redis (optional - skip if using the default `SESSION_STORE_TYPE=json`) ```bash # Ubuntu/Debian sudo apt-get install redis-server redis-server --daemonize yes # macOS brew install redis brew services start redis ``` **Note**: - If you set `SESSION_STORE_TYPE=json` (default), session history will be stored as JSON files in the `SESSION_STORE_DIR` directory with automatic TTL-based cleanup. - If you set `SESSION_STORE_TYPE=redis`, you need to have Redis server running. Session state is automatically managed by RedisMemory, and TTL is handled by Redis server configuration. ### Configuration 1. Set required environment variables ```bash export DASHSCOPE_API_KEY="your_dashscope_api_key" export GITHUB_TOKEN="your_github_token" # Required: for GitHub MCP integration ``` 2. Set optional environment variables **Session Storage Configuration:** ```bash # Session store type: "json" (default) or "redis" export SESSION_STORE_TYPE="json" # or "redis" # For JSON mode (default): export SESSION_STORE_DIR="./sessions" # Session file storage directory (default: "./sessions") export SESSION_TTL_SECONDS="21600" # Session TTL in seconds (default: 21600 = 6 hours) export SESSION_CLEANUP_INTERVAL="1800" # Cleanup interval in seconds (default: 1800 = 30 minutes) # For Redis mode: export REDIS_HOST="localhost" # Redis server host (default: "localhost") export REDIS_PORT="6379" # Redis server port (default: 6379) export REDIS_DB="0" # Redis database number (default: 0) export REDIS_PASSWORD="" # Redis password (default: None, optional) export REDIS_MAX_CONNECTIONS="10" # Redis max connections (default: 10) # Note: Redis TTL is handled by Redis server configuration, not by application ``` **Model Configuration:** ```bash export MAX_TOKENS="200000" # Maximum tokens for context window (default: 200000) # Note: This value is multiplied by 3 when passed to DashScopeChatFormatter # because CharTokenCounter counts characters, and ~3 chars ≈ 1 token for mixed CHN & ENG text ``` **Qdrant Vector Database:** ```bash export QDRANT_HOST="127.0.0.1" # Qdrant server host (default: "127.0.0.1") export QDRANT_PORT="6333" # Qdrant server port (default: 6333) ``` **Service Configuration:** ```bash export DJ_COPILOT_SERVICE_HOST="127.0.0.1" # Service host address (default: "127.0.0.1") export DJ_COPILOT_ENABLE_LOGGING="true" # Enable session logging (default: "true") export DJ_COPILOT_LOG_DIR="./logs" # Log directory (default: "./logs") ``` **Advanced Configuration:** ```bash export FASTAPI_CONFIG_PATH="" # Path to FastAPI config JSON file (optional) export SAFE_CHECK_HANDLER_PATH="" # Path to custom safe check handler module (optional) ``` 2. Configure FAQ file (optional) The system uses `qa-copilot/rag_utils/faq.txt` as the FAQ data source by default. You can edit this file to customize FAQ content. FAQ file format example: ``` 'id': 'FAQ_001', 'question': 'What is Data-Juicer?', 'answer': 'Data-Juicer is a...' 'id': 'FAQ_002', 'question': 'How to install?', 'answer': 'You can install by...' ``` 3. Start the service ```bash bash setup_server.sh ``` On first startup, the system will automatically: - Check and start the Qdrant Docker container (port 6333) - Initialize FAQ RAG data (if not already initialized) - Start the Web API service ## Usage ### Web API Interfaces After starting the service, the system provides the following API interfaces: #### 1. Q&A Conversation ```http POST /process Content-Type: application/json { "input": [ { "role": "user", "content": [{"type": "text", "text": "How to use Data-Juicer for data cleaning?"}] } ], "session_id": "your_session_id", "user_id": "user_id" } ``` #### 2. Get Session History ```http POST /memory Content-Type: application/json { "session_id": "your_session_id", "user_id": "user_id" } ``` #### 3. Clear Session History ```http POST /clear Content-Type: application/json { "session_id": "your_session_id", "user_id": "user_id" } ``` #### 4. Submit User Feedback ```http POST /feedback Content-Type: application/json { "data": { "message_id": "message_id_here", "feedback_type": "like", "comment": "optional user comment" }, "session_id": "your_session_id", "user_id": "user_id" } ``` **Parameters:** - `message_id`: The ID of the message to provide feedback on (required) - `feedback_type`: Type of feedback, either `"like"` or `"dislike"` (required) - `comment`: Optional user comment text (optional) **Response example:** ```json { "status": "ok", "message": "Feedback recorded successfully" } ``` ### WebUI you can simply run the following command in your terminal: ```bash npx @agentscope-ai/chat agentscope-runtime-webui --url http://localhost:8080/process ``` Refer to [AgentScope Runtime WebUI](https://runtime.agentscope.io/en/webui.html#method-2-quick-start-via-npx) for more information. ## Configuration Details ### Environment Variables Summary | Variable | Required | Default | Description | |----------|----------|---------|-------------| | `DASHSCOPE_API_KEY` | ✅ Yes | - | DashScope API key for LLM and embedding | | `GITHUB_TOKEN` | ✅ Yes | - | GitHub token for MCP integration | | `SESSION_STORE_TYPE` | ❌ No | `"json"` | Session storage type: `"json"` or `"redis"` | | `SESSION_STORE_DIR` | ❌ No | `"./sessions"` | Session file directory (JSON mode only) | | `SESSION_TTL_SECONDS` | ❌ No | `21600` | Session TTL in seconds (JSON mode only, 6 hours) | | `SESSION_CLEANUP_INTERVAL` | ❌ No | `1800` | Cleanup interval in seconds (JSON mode only, 30 minutes) | | `REDIS_HOST` | ❌ No | `"localhost"` | Redis server host (Redis mode only) | | `REDIS_PORT` | ❌ No | `6379` | Redis server port (Redis mode only) | | `REDIS_DB` | ❌ No | `0` | Redis database number (Redis mode only) | | `REDIS_PASSWORD` | ❌ No | `None` | Redis password (Redis mode only, optional) | | `REDIS_MAX_CONNECTIONS` | ❌ No | `10` | Redis max connections (Redis mode only) | | `QDRANT_HOST` | ❌ No | `"127.0.0.1"` | Qdrant server host | | `QDRANT_PORT` | ❌ No | `6333` | Qdrant server port | | `MAX_TOKENS` | ❌ No | `200000` | Maximum tokens for context window (multiplied by 3 for CharTokenCounter) | | `DJ_COPILOT_SERVICE_HOST` | ❌ No | `"127.0.0.1"` | Service host address | | `DJ_COPILOT_ENABLE_LOGGING` | ❌ No | `"true"` | Enable session logging | | `DJ_COPILOT_LOG_DIR` | ❌ No | `"./logs"` | Log directory | | `FASTAPI_CONFIG_PATH` | ❌ No | `""` | Path to FastAPI config JSON file | | `SAFE_CHECK_HANDLER_PATH` | ❌ No | `""` | Path to custom safe check handler | ### Model Configuration In `app_deploy.py`, you can configure the language model to use: ```python model=DashScopeChatModel( "qwen3-max-2026-01-23", # Model name api_key=os.getenv("DASHSCOPE_API_KEY"), stream=True, # Enable streaming response enable_thinking=True, # Enable thinking mode ) ``` The formatter uses `MAX_TOKENS` environment variable (default: 200000) to limit the context window size. Since `CharTokenCounter` counts characters and approximately 3 characters ≈ 1 token for mixed Chinese and English text, the value is multiplied by 3 when passed to `DashScopeChatFormatter`. ### Session Storage Configuration **JSON Mode (Default):** - Session history is stored as JSON files in `SESSION_STORE_DIR` directory - Automatic TTL-based cleanup runs every `SESSION_CLEANUP_INTERVAL` seconds - Sessions expire after `SESSION_TTL_SECONDS` seconds of inactivity - No external dependencies required **Redis Mode:** - Session history is stored in Redis - Session state is automatically managed by `RedisMemory` - TTL is handled by Redis server configuration (not application-level) - Requires Redis server to be running ### FAQ RAG Configuration The FAQ RAG system uses the following configuration: - **Vector Database**: Qdrant (running in Docker container) - **Embedding Model**: DashScope text-embedding-v4 - **Vector Dimension**: 1024 - **Data Source**: `qa-copilot/rag_utils/faq.txt` - **Storage Location**: `qa-copilot/rag_utils/qdrant_storage` - **Qdrant Host**: Configurable via `QDRANT_HOST` (default: `127.0.0.1`) - **Qdrant Port**: Configurable via `QDRANT_PORT` (default: `6333`) The system automatically checks if RAG data is initialized on startup. If not initialized, it will automatically read the FAQ file and create vector indexes. ## Troubleshooting ### Common Issues 1. **Docker/Qdrant Issues** - Ensure Docker service is running: `docker --version` - Check Qdrant container status: `docker ps | grep qdrant` - Manually start Qdrant container: `docker start qdrant` - Check if Qdrant port is occupied: `netstat -tlnp | grep 6333` - To reinitialize RAG data, delete the `qa-copilot/rag_utils/qdrant_storage` directory and restart the service 2. **Redis connection failure** (when using `SESSION_STORE_TYPE=redis`) - Ensure Redis service is running: `redis-cli ping` - Check if Redis port is occupied: `netstat -tlnp | grep 6379` (or your configured `REDIS_PORT`) - Verify Redis configuration: Check `REDIS_HOST`, `REDIS_PORT`, `REDIS_DB`, and `REDIS_PASSWORD` environment variables - Note: Redis TTL is managed by Redis server, not by the application 3. **MCP service startup failure** - Ensure `GITHUB_TOKEN` is set and correct (required environment variable) - Verify GitHub token has necessary permissions for MCP integration 4. **API Key error** - Verify `DASHSCOPE_API_KEY` environment variable is correctly set - Confirm API Key is valid and has sufficient quota 5. **FAQ retrieval returns no results** - Confirm FAQ file `qa-copilot/rag_utils/faq.txt` exists and is properly formatted - Check if Qdrant container is running normally - Review logs to confirm RAG data was successfully initialized ## Acknowledgments Parts of this project's code are adapted from the following open-source projects: - **FAQ RAG System & GitHub MCP Integration**: Adapted from the implementation in [AgentScope Samples - Alias](https://github.com/agentscope-ai/agentscope-samples/tree/main/alias) Special thanks to the AgentScope team for their excellent framework and sample code! ## License This project uses the same license as the main project. For details, please refer to the [LICENSE](../LICENSE) file. ## Related Links - [Data-Juicer Official Repository](https://github.com/datajuicer/data-juicer) - [AgentScope Framework](https://github.com/agentscope-ai/agentscope) - [AgentScope Samples](https://github.com/agentscope-ai/agentscope-samples) - [GitHub MCP Server](https://github.com/github/github-mcp-server)