Japanese Language Support in Oboyu
Oboyu provides world-class support for Japanese text processing with advanced features designed specifically for the unique characteristics of the Japanese language. This includes automatic encoding detection, intelligent text normalization, morphological analysis, and optimized search algorithms that understand Japanese grammar and writing systems.
Table of Contents
- Core Features
- Encoding Detection
- Text Normalization
- Tokenization and Morphological Analysis
- Search Optimization
- Mixed Language Support
- Performance Benchmarks
- Best Practices
- Troubleshooting
- Advanced Configuration
Core Features
Oboyu's Japanese support includes:
- 🔍 Automatic Encoding Detection: Handles UTF-8, Shift-JIS, EUC-JP, CP932, and ISO-2022-JP
- 🔤 Intelligent Normalization: Full/half-width conversion, variant unification
- 🔬 Morphological Analysis: MeCab and SentencePiece integration
- 🎯 Optimized Models: Japanese-specific embedding models (Ruri v3)
- 🌏 Mixed Language: Seamless Japanese-English document handling
- ⚡ High Performance: Optimized for Japanese text characteristics
Encoding Detection
Overview
Japanese text files often use various character encodings due to historical reasons. Oboyu automatically detects and handles multiple Japanese encodings to ensure all documents are properly indexed.
Supported Encodings
Oboyu supports automatic detection and conversion of common Japanese encodings:
- UTF-8: Modern standard encoding (default)
- Shift-JIS: Windows Japanese encoding (CP932)
- EUC-JP: Unix/Linux Japanese encoding
- ISO-2022-JP: Email and older systems encoding
How It Works
When processing files, Oboyu:
- Attempts UTF-8 first as the modern standard
- Detects encoding using byte patterns specific to Japanese encodings
- Converts to UTF-8 internally for consistent processing
- Handles mixed encodings in different files within the same directory
Automatic Detection
Oboyu automatically detects and handles Japanese encodings without requiring configuration. The system intelligently identifies the encoding of each file and converts it to UTF-8 for consistent processing.
Text Normalization
Unicode Normalization
All Japanese text undergoes Unicode NFKC (Normalization Form KC) normalization to ensure consistent matching:
Input: "hello ワールド" (mixed width)
Output: "hello ワールド" (normalized)
This normalization handles:
- Full-width to half-width conversion for ASCII characters
- Character variant unification (異体字の統一)
- Combining character normalization
Character Width Normalization
Oboyu normalizes character widths for consistent search:
Full-width: ABC123
Half-width: ABC123
Full-width: カタカナ
Full-width: カタカナ
Benefits
- Consistent search results regardless of input method
- Improved matching between different document sources
- Reduced index size through normalization
Japanese Text Processing Pipeline
Processing Steps
-
File Reading
- Detect encoding using configured encoding list
- Read file with detected encoding
-
Text Extraction
- Convert to UTF-8 if needed
- Extract text content
-
Normalization
- Apply Unicode NFKC normalization
- Normalize character widths
- Clean up whitespace
-
Tokenization (for BM25 search)
- Use MeCab for morphological analysis
- Extract meaningful tokens
Example Pipeline
# Original file (Shift-JIS encoded)
"データベース 設計 パターン"
# After encoding detection and conversion
"データベース 設計 パターン" (UTF-8)
# After normalization
"データベース 設計 パターン" (normalized spaces)
# After tokenization
["データベース", "設計", "パターン"]
Tokenization and Morphological Analysis
MeCab Integration
Oboyu integrates with MeCab for accurate Japanese morphological analysis:
# Install MeCab (optional but recommended)
# macOS
brew install mecab mecab-ipadic
# Ubuntu/Debian
sudo apt-get install mecab libmecab-dev mecab-ipadic-utf8
# Install Python binding
pip install mecab-python3
Tokenization Examples
Basic Tokenization
# Input text
"自然言語処理を学習する"
# MeCab tokenization
["自然", "言語", "処理", "を", "学習", "する"]
# With part-of-speech filtering (nouns and verbs only)
["自然", "言語", "処理", "学習"]
Complex Sentences
# Input
"機械学習モデルの性能を向上させるための最適化手法について説明します。"
# Tokenized output
["機械", "学習", "モデル", "性能", "向上", "最適", "化", "手法", "説明"]
Named Entity Recognition
# Company names and products
"株式会社OpenAIのGPT-4は革新的なAIモデルです。"
# Intelligent tokenization preserves entities
["株式会社", "OpenAI", "GPT-4", "革新", "的", "AI", "モデル"]
SentencePiece Fallback
When MeCab is not available, Oboyu uses SentencePiece:
# SentencePiece subword tokenization
"未知の専門用語も適切に処理"
["▁未知", "の", "▁専門", "用語", "も", "▁適切", "に", "▁処理"]
Search Optimization
Japanese-Specific Ranking
Oboyu optimizes search ranking for Japanese characteristics:
1. Compound Word Handling
# Query: "データベース"
# Matches both:
- "データベース" (exact match)
- "データ" + "ベース" (compound components)
2. Reading Variations
# Handles different readings
- "会議" (kaigi)
- "打ち合わせ" (uchiawase)
- "ミーティング" (meeting)
# All treated as related concepts
3. Script Mixing
# Query: "AI技術"
# Intelligently matches:
- "AI技術"
- "人工知能技術"
- "エーアイ技術"
Search Examples
Technical Documentation Search
# Search for machine learning concepts
oboyu query --query "機械学習アルゴリズム"
# Results ranked by:
# 1. Exact phrase matches
# 2. All terms present
# 3. Partial term matches
# 4. Semantic similarity
Business Document Search
# Search for meeting notes
oboyu query --query "営業会議 議事録 2024年"
# Intelligent date parsing and entity recognition
Code Documentation
# Mixed language search
oboyu query --query "async/await 非同期処理"
# Handles technical terms in both languages
Mixed Language Support
Seamless Language Detection
Oboyu automatically detects and handles mixed language content:
# Document with mixed content
"""
API設計のベストプラクティス
1. RESTful Design Principles
- リソース指向アーキテクチャ
- HTTPメソッドの適切な使用
2. Error Handling
- エラーコードの統一
- 日本語エラーメッセージの提供
"""
Language-Aware Chunking
# Intelligent chunk boundaries
- Respects sentence boundaries in both languages
- Maintains context across language switches
- Preserves code blocks and technical terms
Search Across Languages
# Japanese query finding English content
oboyu query --query "認証システム"
# Also finds: "authentication system", "auth module"
# English query finding Japanese content
oboyu query --query "database optimization"
# Also finds: "データベース最適化", "DB高速化"
Performance Benchmarks
Encoding Detection Speed
File Size | UTF-8 | Shift-JIS | EUC-JP | Mixed |
---|---|---|---|---|
1KB | 0.1ms | 0.3ms | 0.3ms | 0.5ms |
100KB | 0.5ms | 1.2ms | 1.1ms | 2.1ms |
1MB | 2.1ms | 4.8ms | 4.5ms | 8.2ms |
10MB | 18ms | 42ms | 39ms | 71ms |
Tokenization Performance
Text Length | MeCab | SentencePiece | Hybrid |
---|---|---|---|
100 chars | 0.8ms | 1.2ms | 0.9ms |
1,000 chars | 3.2ms | 4.8ms | 3.5ms |
10,000 chars | 28ms | 41ms | 30ms |
Search Performance (10,000 documents)
Query Type | BM25 | Vector | Hybrid |
---|---|---|---|
Single term | 12ms | 45ms | 52ms |
Phrase | 18ms | 46ms | 58ms |
Complex | 25ms | 48ms | 65ms |
Best Practices
Document Preparation
-
Consistent Encoding
# Convert legacy files to UTF-8
iconv -f SHIFT-JIS -t UTF-8 old_file.txt > new_file.txt -
Structured Content
# プロジェクト仕様書
## 概要
本プロジェクトは...
## Technical Requirements
- Python 3.10+
- PostgreSQL 14+ -
Metadata Usage
---
title: システム設計書
date: 2024-01-15
tags: [設計, アーキテクチャ, API]
---
Query Optimization
-
Use Natural Phrases
# Good: Natural Japanese
oboyu query --query "ユーザー認証の実装方法"
# Less optimal: Keyword list
oboyu query --query "ユーザー 認証 実装 方法" -
Leverage Filters
# Filter by file type for technical docs
oboyu query --query "API設計" --filter "*.md"
# Filter by date for recent content
oboyu query --query "会議録" --filter "2024-*" -
Mode Selection
# Semantic for concepts
oboyu query --query "オブジェクト指向の原則" --mode semantic
# Keyword for exact terms
oboyu query --query "エラーコード E001" --mode keyword
Best Practices
For Modern Projects
Oboyu automatically handles all encoding scenarios:
# Modern UTF-8 documents
oboyu index ./modern_docs
# Legacy systems with Shift-JIS or EUC-JP
oboyu index ./legacy_system
# Mixed encoding environments
oboyu index ./mixed_docs
The system will automatically detect and handle UTF-8, Shift-JIS, EUC-JP, CP932, and ISO-2022-JP encodings.
Troubleshooting
Encoding Detection Issues
Symptoms:
- Garbled characters (文字化け)
- Missing Japanese text
- Indexing errors
Solutions:
- Check file encoding:
file -i filename.txt
- Ensure the file is in a supported encoding (UTF-8, Shift-JIS, EUC-JP, CP932, or ISO-2022-JP)
- Convert unsupported encodings to UTF-8 before indexing
Normalization Issues
Symptoms:
- Search misses documents with different character widths
- Inconsistent search results
Solutions:
- Ensure all documents are re-indexed after configuration changes
- Use consistent input methods for queries
- Check normalization in debug mode
Debug Mode
Enable debug logging for Japanese processing:
# Index with debug output
oboyu index ./docs --debug
Performance Considerations
Encoding Detection Overhead
- Minimal impact: ~1-5ms per file
- Cached results: Encoding detected once per file
- Parallel processing: Multiple files processed concurrently
Normalization Performance
- Fast processing: <1ms per document
- In-memory operation: No disk I/O required
- Single-pass: Normalization done during initial processing
Advanced Configuration
Tokenizer Selection
# ~/.config/oboyu/config.yaml
indexer:
language:
japanese_tokenizer: "mecab" # or "sentencepiece"
compound_splitting: true # Split compound words
reading_normalization: true # Normalize readings
crawler:
encoding_detection:
enabled: true # Default: true
confidence_threshold: 0.8
fallback_encoding: "utf-8"
# Custom encoding priority (optional)
encoding_priority:
- "utf-8"
- "shift_jis"
- "euc_jp"
- "cp932"
- "iso2022_jp"
Search Optimization
query_engine:
japanese:
# Boost exact matches in Japanese
exact_match_boost: 2.0
# Enable reading variation matching
reading_variations: true
# Compound word handling
compound_word_search: true
# Japanese-specific reranker
reranker:
model: "hotchpotch/japanese-reranker-cross-encoder-small-v1"
enabled: true
Performance Tuning
indexer:
processing:
# Larger chunks for Japanese (due to character density)
chunk_size: 800 # characters, not tokens
chunk_overlap: 200
# Parallel processing
japanese_processing:
max_workers: 4
batch_size: 100
Integration with Search
The encoding detection and normalization ensure that:
- Vector search works with properly encoded text
- BM25 search matches normalized tokens correctly
- Hybrid search combines both effectively
All search modes benefit from consistent text processing, ensuring reliable results regardless of the original file encoding or character variations.
Common Issues and Solutions
Issue: Poor Search Results for Japanese Queries
Symptoms:
- English results for Japanese queries
- Missing obvious matches
- Irrelevant results
Solutions:
-
Check tokenization:
# Verify MeCab is installed
which mecab
# Test tokenization
echo "テスト文章" | mecab -
Use appropriate search mode:
# For technical terms
oboyu query --query "機械学習" --mode hybrid
# For concepts
oboyu query --query "人工知能の応用" --mode semantic -
Rebuild index:
# Clear and rebuild with debug info
oboyu index --clear
oboyu index /path/to/docs --debug
Issue: Encoding Problems
Symptoms:
- 文字化け (mojibake/garbled text)
- Question marks or squares
- Missing content
Solutions:
-
Check file encoding:
# Detect file encoding
file -i problematic_file.txt
# Or use nkf
nkf -g problematic_file.txt -
Convert to UTF-8:
# Using iconv
iconv -f SHIFT-JIS -t UTF-8 input.txt > output.txt
# Using nkf
nkf -w --overwrite *.txt -
Batch conversion:
# Convert all Shift-JIS files in directory
find . -name "*.txt" -exec sh -c '
encoding=$(nkf -g "$1")
if [ "$encoding" = "Shift_JIS" ]; then
nkf -w --overwrite "$1"
echo "Converted: $1"
fi
' _ {} \;
Tips for Optimal Japanese Search
1. Document Structure
- Use clear headings in Japanese
- Include ruby annotations for difficult kanji
- Add English translations for key terms
- Use consistent terminology
2. Query Formulation
- Use full sentences for semantic search
- Include context in queries
- Try both kanji and hiragana versions
- Use synonyms and related terms
3. Index Optimization
- Separate Japanese and English content when possible
- Use metadata for categorization
- Regular index maintenance
- Monitor search performance
Future Enhancements
Planned improvements for Japanese support:
-
Advanced NLP Features
- Named entity recognition (NER)
- Dependency parsing
- Sentiment analysis
-
Dictionary Integration
- Custom domain dictionaries
- Synonym expansion
- Technical term databases
-
Enhanced Models
- Larger Japanese embedding models
- Domain-specific fine-tuning
- Multilingual model options