Oboyu Query Engine Architecture
Overview
The Query Engine is the user-facing component of Oboyu, handling search requests, executing queries against the indexes, and delivering relevant results. It supports multiple search modes and provides specialized processing for Japanese queries.
Design Goals
- Process search queries with language-appropriate handling
- Support multiple search modes (vector, BM25, hybrid)
- Deliver relevant results with appropriate ranking
- Provide effective snippet generation for context
- Offer flexible integration through CLI and MCP interfaces
Component Structure
Query Processor
The Query Processor handles:
- Query analysis and normalization
- Language detection for query text
- Japanese-specific query processing
- Query expansion and refinement
def process_query(query_text):
# Implementation details
# Returns processed query ready for search
Search Engine
The Search Engine implements multiple search strategies:
- Vector search using HNSW index
- BM25 search using full-text index
- Hybrid search combining both approaches
- Result ranking and scoring
def search(query, mode="hybrid", **options):
# Implementation details
# Returns ranked search results
Result Formatter
The Result Formatter prepares search results for presentation:
- Formats documents with metadata
- Generates contextual snippets
- Highlights matching terms
- Provides relevance explanations
def format_results(results, query):
# Implementation details
# Returns formatted results for presentation
Search Modes
Oboyu supports three distinct search modes, each optimized for different types of queries:
Vector Search
Vector search performs semantic similarity matching using embeddings:
- Converts query to high-dimensional vector representation
- Performs approximate nearest neighbor search via HNSW index
- Ranks results by cosine similarity
- Best for: conceptual queries, semantic understanding, synonyms
Example CLI usage:
# Search for conceptually similar content
oboyu query --mode vector "機械学習の基本概念"
oboyu query --mode vector "What are design patterns?"
When to use:
- Queries about concepts rather than specific terms
- When looking for semantically related content
- Cross-lingual or synonym matching needs
BM25 Search
BM25 search performs keyword-based matching with term frequency analysis:
- Tokenizes query using language-appropriate processing (MeCab for Japanese)
- Executes BM25 ranking algorithm (k1=1.2, b=0.75)
- Scores documents based on term frequency and document length normalization
- Best for: exact keyword matching, specific terminology, precise queries
Example CLI usage:
# Search for specific keywords
oboyu query --mode bm25 "データベース設計"
oboyu query --mode bm25 "REST API implementation"
When to use:
- Looking for specific terms or keywords
- Technical documentation with precise terminology
- When exact word matching is important
Hybrid Search (Default)
Hybrid search combines both approaches for optimal results using RRF (Reciprocal Rank Fusion), a proven rank-based fusion method:
- Parallel Execution: Executes both vector and BM25 searches simultaneously for efficiency
- Rank-based Fusion: Uses RRF algorithm instead of score-based weighting for more robust results
- Configurable RRF Parameter: Uses a configurable
k
parameter (default: 60) to control fusion behavior - Result Fusion: Merges and re-ranks results using ranks rather than scores for better handling of different scoring systems
- Best for: most general queries, balanced precision and recall, complex information needs, proper nouns and technical terms
How RRF Hybrid Search Works:
- Query Processing: The same query is processed for both search methods
- Parallel Search: Vector and BM25 searches execute simultaneously
- Rank Assignment: Each method assigns ranks to documents (1st, 2nd, 3rd, etc.)
- RRF Calculation: Final score = 1/(k + rank_vector) + 1/(k + rank_bm25)
- Result Combination: Documents are scored using RRF formula where k=60 by default
- Ranking: Results are sorted by RRF score (higher is better) and top-k selected
RRF Configuration:
# Default RRF hybrid search (recommended for most use cases)
oboyu query --query "Pythonでの非同期処理の実装方法"
# Custom RRF parameter for different fusion behavior
oboyu query --rrf-k 30 "database optimization techniques" # More aggressive fusion
oboyu query --rrf-k 100 "REST API status codes" # More conservative fusion
RRF Parameter Effects:
- Lower k (e.g., 30): More aggressive fusion, higher weight to top-ranked results from each method
- Higher k (e.g., 100): More conservative fusion, more balanced contribution from all ranks
- Default k=60: Optimal balance for most content types and query patterns
Performance Benefits:
- Comprehensive Coverage: Finds both semantically similar and keyword-matching documents
- Robustness: Rank-based fusion is more stable than score-based weighting across different content types
- Better Term Handling: Superior performance on proper nouns and technical terminology
- Efficiency: Parallel execution means minimal performance penalty over single methods
- Parameter Simplicity: Single
k
parameter is easier to tune than dual weight system
Japanese Query Support
Oboyu provides advanced Japanese language support through specialized processing:
Tokenization
Japanese text is processed using MeCab morphological analyzer:
- MeCab with fugashi: Primary tokenizer for Japanese text
- Part-of-speech filtering: Extracts content words (nouns, verbs, adjectives)
- Stop word removal: Filters common particles and auxiliary words
- Fallback tokenizer: Simple character-based tokenization when MeCab is unavailable
Example tokenization:
Input: "機械学習ではPythonがよく使われています"
Output: ["機械学習", "Python", "使わ", "れる"]
Character Normalization
Text normalization ensures consistent matching:
- Unicode normalization: Converts to NFKC form
- Character variant handling: ひらがな/カタカナ conversion when needed
- Width normalization: Full-width ↔ half-width character conversion
Query Processing Examples
# Natural Japanese queries
oboyu query --query "機械学習のアルゴリズムについて教えて"
oboyu query --query "Pythonでのデータ処理方法"
# Mixed Japanese-English queries
oboyu query --query "REST APIの設計パターン"
oboyu query --query "データベースのNormalization理論"
# Technical terminology
oboyu query --query "非同期処理とPromise"
oboyu query --query "マイクロサービスアーキテクチャの利点"
Search Mode Recommendations for Japanese
- Vector search: Best for conceptual Japanese queries
- BM25 search: Excellent for specific Japanese technical terms
- Hybrid search: Optimal balance for most Japanese content
Data Flow
- Receive query from user via CLI or MCP
- Process query with language-specific handling
- Execute search using appropriate mode
- Rank and format results
- Return formatted results to user
Configuration Options
The Query Engine is configured through the following settings in config.yaml
:
query:
default_mode: "hybrid" # Default search mode
rrf_k: 60 # RRF parameter for hybrid search (default: 60)
top_k: 5 # Number of results to return
snippet_length: 160 # Character length for snippets
highlight_matches: true # Whether to highlight matching terms
Command Line Interface
The Query Engine provides a comprehensive command-line interface:
Basic Usage
# Default hybrid search
oboyu query --query "システムの設計原則について教えてください"
# Specify search mode
oboyu query --mode vector "What are the key concepts?"
oboyu query --mode bm25 "データベース設計"
oboyu query --mode hybrid "machine learning algorithms"
Advanced Options
# Control number of results
oboyu query --limit 10 "Python programming best practices"
# Adjust RRF parameter for hybrid search
oboyu query --rrf-k 30 "システム設計" # More aggressive fusion
# Use different language settings
oboyu query --language ja "英語ドキュメントの日本語検索"
# Enable reranker for improved results
oboyu query --use-reranker "complex technical concepts"
Complete Options
oboyu query [OPTIONS] QUERY
Options:
--mode [vector|bm25|hybrid] Search mode (default: hybrid)
--limit INTEGER Number of results (default: 10)
--rrf-k INTEGER RRF parameter for hybrid search (default: 60)
--language TEXT Language hint for processing
--use-reranker / --no-reranker Enable reranking (default: auto)
--help Show this message and exit
Real-world Examples
# Research query with semantic understanding
oboyu query --mode vector "distributed systems consistency models"
# Exact terminology lookup
oboyu query --mode bm25 "REST API status codes 404"
# Balanced search for documentation
oboyu query --query "Pythonでの例外処理のベストプラクティス"
# Technical documentation with custom RRF parameter
oboyu query --rrf-k 30 --limit 15 "database normalization rules"
Performance Considerations
Search Mode Performance
- Vector search: Fast for small-medium datasets (< 100K documents)
- BM25 search: Scales well with large datasets, fast keyword lookups
- Hybrid search: Slightly slower but provides best quality results
Optimization Strategies
- Parallel processing: Vector and BM25 searches run in parallel during hybrid mode
- Index optimization: HNSW parameters tuned for Japanese content
- Tokenization caching: MeCab tokenization results cached for common queries
- Score normalization: Efficient min-max normalization prevents score dominance
Japanese-specific Optimizations
- Tokenizer selection: Automatic fallback from MeCab to simple tokenizer
- Character normalization: Preprocessing reduces search space
- Stop word filtering: Removes common Japanese particles for better performance
- Memory management: Efficient handling of Japanese Unicode strings
Tuning Recommendations
For large Japanese document collections (>50K docs):
# Use conservative RRF for better scaling
oboyu query --rrf-k 100 "検索クエリ"
For precise technical documentation:
# Use aggressive RRF for better top-result fusion
oboyu query --rrf-k 30 "API documentation patterns"
For general purpose searches:
# Default RRF parameter works best
oboyu query --mode hybrid "technical concepts"
Integration with Other Components
- Accesses indexed data via the database created by the Indexer
- Maintains clean separation from indexing and crawling processes
- Provides well-defined interfaces for external integration