Oboyu Query Engine Architecture

Overview

The Query Engine is the user-facing component of Oboyu, handling search requests, executing queries against the indexes, and delivering relevant results. It supports multiple search modes and provides specialized processing for Japanese queries.

Design Goals

Process search queries with language-appropriate handling
Support multiple search modes (vector, BM25, hybrid)
Deliver relevant results with appropriate ranking
Provide effective snippet generation for context
Offer flexible integration through CLI and MCP interfaces

Component Structure

Query Processor

The Query Processor handles:

Query analysis and normalization
Language detection for query text
Japanese-specific query processing
Query expansion and refinement

def process_query(query_text):
    # Implementation details
    # Returns processed query ready for search

Search Engine

The Search Engine implements multiple search strategies:

Vector search using HNSW index
BM25 search using full-text index
Hybrid search combining both approaches
Result ranking and scoring

def search(query, mode="hybrid", **options):
    # Implementation details
    # Returns ranked search results

Result Formatter

The Result Formatter prepares search results for presentation:

Formats documents with metadata
Generates contextual snippets
Highlights matching terms
Provides relevance explanations

def format_results(results, query):
    # Implementation details
    # Returns formatted results for presentation

Search Modes

Oboyu supports three distinct search modes, each optimized for different types of queries:

Vector Search

Vector search performs semantic similarity matching using embeddings:

Converts query to high-dimensional vector representation
Performs approximate nearest neighbor search via HNSW index
Ranks results by cosine similarity
Best for: conceptual queries, semantic understanding, synonyms

Example CLI usage:

# Search for conceptually similar content
oboyu query --mode vector "機械学習の基本概念"
oboyu query --mode vector "What are design patterns?"

When to use:

Queries about concepts rather than specific terms
When looking for semantically related content
Cross-lingual or synonym matching needs

BM25 Search

BM25 search performs keyword-based matching with term frequency analysis:

Tokenizes query using language-appropriate processing (MeCab for Japanese)
Executes BM25 ranking algorithm (k1=1.2, b=0.75)
Scores documents based on term frequency and document length normalization
Best for: exact keyword matching, specific terminology, precise queries

Example CLI usage:

# Search for specific keywords
oboyu query --mode bm25 "データベース設計"
oboyu query --mode bm25 "REST API implementation"

When to use:

Looking for specific terms or keywords
Technical documentation with precise terminology
When exact word matching is important

Hybrid Search (Default)

Hybrid search combines both approaches for optimal results using RRF (Reciprocal Rank Fusion), a proven rank-based fusion method:

Parallel Execution: Executes both vector and BM25 searches simultaneously for efficiency
Rank-based Fusion: Uses RRF algorithm instead of score-based weighting for more robust results
Configurable RRF Parameter: Uses a configurable k parameter (default: 60) to control fusion behavior
Result Fusion: Merges and re-ranks results using ranks rather than scores for better handling of different scoring systems
Best for: most general queries, balanced precision and recall, complex information needs, proper nouns and technical terms

How RRF Hybrid Search Works:

Query Processing: The same query is processed for both search methods
Parallel Search: Vector and BM25 searches execute simultaneously
Rank Assignment: Each method assigns ranks to documents (1st, 2nd, 3rd, etc.)
RRF Calculation: Final score = 1/(k + rank_vector) + 1/(k + rank_bm25)
Result Combination: Documents are scored using RRF formula where k=60 by default
Ranking: Results are sorted by RRF score (higher is better) and top-k selected

RRF Configuration:

# Default RRF hybrid search (recommended for most use cases)
oboyu query --query "Pythonでの非同期処理の実装方法"

# Custom RRF parameter for different fusion behavior
oboyu query --rrf-k 30 "database optimization techniques"  # More aggressive fusion
oboyu query --rrf-k 100 "REST API status codes"           # More conservative fusion

RRF Parameter Effects:

Lower k (e.g., 30): More aggressive fusion, higher weight to top-ranked results from each method
Higher k (e.g., 100): More conservative fusion, more balanced contribution from all ranks
Default k=60: Optimal balance for most content types and query patterns

Performance Benefits:

Comprehensive Coverage: Finds both semantically similar and keyword-matching documents
Robustness: Rank-based fusion is more stable than score-based weighting across different content types
Better Term Handling: Superior performance on proper nouns and technical terminology
Efficiency: Parallel execution means minimal performance penalty over single methods
Parameter Simplicity: Single k parameter is easier to tune than dual weight system

Japanese Query Support

Oboyu provides advanced Japanese language support through specialized processing:

Tokenization

Japanese text is processed using MeCab morphological analyzer:

MeCab with fugashi: Primary tokenizer for Japanese text
Part-of-speech filtering: Extracts content words (nouns, verbs, adjectives)
Stop word removal: Filters common particles and auxiliary words
Fallback tokenizer: Simple character-based tokenization when MeCab is unavailable

Example tokenization:

Input:  "機械学習ではPythonがよく使われています"
Output: ["機械学習", "Python", "使わ", "れる"]

Character Normalization

Text normalization ensures consistent matching:

Unicode normalization: Converts to NFKC form
Character variant handling: ひらがな/カタカナ conversion when needed
Width normalization: Full-width ↔ half-width character conversion

Query Processing Examples

# Natural Japanese queries
oboyu query --query "機械学習のアルゴリズムについて教えて"
oboyu query --query "Pythonでのデータ処理方法"

# Mixed Japanese-English queries
oboyu query --query "REST APIの設計パターン"
oboyu query --query "データベースのNormalization理論"

# Technical terminology
oboyu query --query "非同期処理とPromise"
oboyu query --query "マイクロサービスアーキテクチャの利点"

Search Mode Recommendations for Japanese

Vector search: Best for conceptual Japanese queries
BM25 search: Excellent for specific Japanese technical terms
Hybrid search: Optimal balance for most Japanese content

Data Flow

Receive query from user via CLI or MCP
Process query with language-specific handling
Execute search using appropriate mode
Rank and format results
Return formatted results to user

Configuration Options

The Query Engine is configured through the following settings in config.yaml:

query:
  default_mode: "hybrid"         # Default search mode
  rrf_k: 60                      # RRF parameter for hybrid search (default: 60)
  top_k: 5                       # Number of results to return
  snippet_length: 160            # Character length for snippets
  highlight_matches: true        # Whether to highlight matching terms

Command Line Interface

The Query Engine provides a comprehensive command-line interface:

Basic Usage

# Default hybrid search
oboyu query --query "システムの設計原則について教えてください"

# Specify search mode
oboyu query --mode vector "What are the key concepts?"
oboyu query --mode bm25 "データベース設計"
oboyu query --mode hybrid "machine learning algorithms"

Advanced Options

# Control number of results
oboyu query --limit 10 "Python programming best practices"

# Adjust RRF parameter for hybrid search
oboyu query --rrf-k 30 "システム設計"  # More aggressive fusion

# Use different language settings
oboyu query --language ja "英語ドキュメントの日本語検索"

# Enable reranker for improved results
oboyu query --use-reranker "complex technical concepts"

Complete Options

oboyu query [OPTIONS] QUERY

Options:
  --mode [vector|bm25|hybrid]     Search mode (default: hybrid)
  --limit INTEGER                 Number of results (default: 10)
  --rrf-k INTEGER                RRF parameter for hybrid search (default: 60)
  --language TEXT                Language hint for processing
  --use-reranker / --no-reranker Enable reranking (default: auto)
  --help                         Show this message and exit

Real-world Examples

# Research query with semantic understanding
oboyu query --mode vector "distributed systems consistency models"

# Exact terminology lookup
oboyu query --mode bm25 "REST API status codes 404"

# Balanced search for documentation
oboyu query --query "Pythonでの例外処理のベストプラクティス"

# Technical documentation with custom RRF parameter
oboyu query --rrf-k 30 --limit 15 "database normalization rules"

Performance Considerations

Search Mode Performance

Vector search: Fast for small-medium datasets (< 100K documents)
BM25 search: Scales well with large datasets, fast keyword lookups
Hybrid search: Slightly slower but provides best quality results

Optimization Strategies

Parallel processing: Vector and BM25 searches run in parallel during hybrid mode
Index optimization: HNSW parameters tuned for Japanese content
Tokenization caching: MeCab tokenization results cached for common queries
Score normalization: Efficient min-max normalization prevents score dominance

Japanese-specific Optimizations

Tokenizer selection: Automatic fallback from MeCab to simple tokenizer
Character normalization: Preprocessing reduces search space
Stop word filtering: Removes common Japanese particles for better performance
Memory management: Efficient handling of Japanese Unicode strings

Tuning Recommendations

For large Japanese document collections (>50K docs):

# Use conservative RRF for better scaling
oboyu query --rrf-k 100 "検索クエリ"

For precise technical documentation:

# Use aggressive RRF for better top-result fusion
oboyu query --rrf-k 30 "API documentation patterns"

For general purpose searches:

# Default RRF parameter works best
oboyu query --mode hybrid "technical concepts"

Integration with Other Components

Accesses indexed data via the database created by the Indexer
Maintains clean separation from indexing and crawling processes
Provides well-defined interfaces for external integration

Overview​

Design Goals​

Component Structure​

Query Processor​

Search Engine​

Result Formatter​

Search Modes​

Vector Search​

BM25 Search​

Hybrid Search (Default)​

Japanese Query Support​

Tokenization​

Character Normalization​

Query Processing Examples​

Search Mode Recommendations for Japanese​

Data Flow​

Configuration Options​

Command Line Interface​

Basic Usage​

Advanced Options​

Complete Options​

Real-world Examples​

Performance Considerations​

Search Mode Performance​

Optimization Strategies​

Japanese-specific Optimizations​

Tuning Recommendations​

Integration with Other Components​