Configuration Reference

The server loads configuration from a YAML file. You can specify the config file path using the -c/--config CLI flag or the DOCRAG_CONFIG environment variable.

Environment Variables Override

All configuration fields can be overridden using environment variables prefixed with DOCRAG_ and using a double underscore __ for nesting.

Examples: - Override storage directory: DOCRAG_STORAGE__DATA_DIR="/tmp/data" - Override embedding model: DOCRAG_EMBEDDING__MODEL="text-embedding-3-small" - Override vision enabled: DOCRAG_VISION__ENABLED="true"

Detailed Configuration Options

Collections Settings

Configure folders and matching patterns: - name: Unique name of the collection (conform to alphanumeric rules). - paths: Absolute paths to folders/files to index. - file_patterns: Glob patterns, e.g. ["*.txt", "*.md", "*.pdf"].

Embedding Settings

Configure the OpenAI-compatible embedding API: - base_url: Target endpoint (e.g. http://localhost:8080/v1 for lemonade). - api_key: Authorization API Key (use "unused" if endpoint does not require one). - model: Embedding model name. - dimensions: Vector dimensions size (e.g., 768 for Gemma). - batch_size: Maximum texts sent in a single batch request.

Local Chunking Settings

Configure document splitting limits: - max_chunk_size: Maximum number of tokens per chunk. - similarity_threshold: Threshold for semantic boundary splits. - local_model: Local model name (e.g. all-MiniLM-L6-v2) used for chunking boundaries.

Vision Settings (Optional)

Configure scanned PDF page-to-image extraction: - enabled: Set true to enable multimodal OCR fallback. - base_url: OpenAI-compatible vision completion endpoint. - model: Multimodal model name (e.g. gpt-4o).