Japanese Language Support in Oboyu

Oboyu provides world-class support for Japanese text processing with advanced features designed specifically for the unique characteristics of the Japanese language. This includes automatic encoding detection, intelligent text normalization, morphological analysis, and optimized search algorithms that understand Japanese grammar and writing systems.

Core Features
Encoding Detection
Text Normalization
Tokenization and Morphological Analysis
Search Optimization
Mixed Language Support
Performance Benchmarks
Best Practices
Troubleshooting
Advanced Configuration

Core Features

Oboyu's Japanese support includes:

🔍 Automatic Encoding Detection: Handles UTF-8, Shift-JIS, EUC-JP, CP932, and ISO-2022-JP
🔤 Intelligent Normalization: Full/half-width conversion, variant unification
🔬 Morphological Analysis: MeCab and SentencePiece integration
🎯 Optimized Models: Japanese-specific embedding models (Ruri v3)
🌏 Mixed Language: Seamless Japanese-English document handling
⚡ High Performance: Optimized for Japanese text characteristics

Encoding Detection

Overview

Japanese text files often use various character encodings due to historical reasons. Oboyu automatically detects and handles multiple Japanese encodings to ensure all documents are properly indexed.

Supported Encodings

Oboyu supports automatic detection and conversion of common Japanese encodings:

UTF-8: Modern standard encoding (default)
Shift-JIS: Windows Japanese encoding (CP932)
EUC-JP: Unix/Linux Japanese encoding
ISO-2022-JP: Email and older systems encoding

How It Works

When processing files, Oboyu:

Attempts UTF-8 first as the modern standard
Detects encoding using byte patterns specific to Japanese encodings
Converts to UTF-8 internally for consistent processing
Handles mixed encodings in different files within the same directory

Automatic Detection

Oboyu automatically detects and handles Japanese encodings without requiring configuration. The system intelligently identifies the encoding of each file and converts it to UTF-8 for consistent processing.

Text Normalization

Unicode Normalization

All Japanese text undergoes Unicode NFKC (Normalization Form KC) normalization to ensure consistent matching:

Input:  "ｈｅｌｌｏ　ワールド"  (mixed width)
Output: "hello ワールド"      (normalized)

This normalization handles:

Full-width to half-width conversion for ASCII characters
Character variant unification (異体字の統一)
Combining character normalization

Character Width Normalization

Oboyu normalizes character widths for consistent search:

Full-width: ＡＢＣ１２３
Half-width: ABC123

Full-width: ｶﾀｶﾅ
Full-width: カタカナ

Benefits

Consistent search results regardless of input method
Improved matching between different document sources
Reduced index size through normalization

Japanese Text Processing Pipeline

Processing Steps

File Reading
- Detect encoding using configured encoding list
- Read file with detected encoding
Text Extraction
- Convert to UTF-8 if needed
- Extract text content
Normalization
- Apply Unicode NFKC normalization
- Normalize character widths
- Clean up whitespace
Tokenization (for BM25 search)
- Use MeCab for morphological analysis
- Extract meaningful tokens

Example Pipeline

# Original file (Shift-JIS encoded)
"データベース　設計　パターン"

# After encoding detection and conversion
"データベース　設計　パターン" (UTF-8)

# After normalization
"データベース 設計 パターン" (normalized spaces)

# After tokenization
["データベース", "設計", "パターン"]

Tokenization and Morphological Analysis

MeCab Integration

Oboyu integrates with MeCab for accurate Japanese morphological analysis:

# Install MeCab (optional but recommended)
# macOS
brew install mecab mecab-ipadic

# Ubuntu/Debian
sudo apt-get install mecab libmecab-dev mecab-ipadic-utf8

# Install Python binding
pip install mecab-python3

Tokenization Examples

Basic Tokenization

# Input text
"自然言語処理を学習する"

# MeCab tokenization
["自然", "言語", "処理", "を", "学習", "する"]

# With part-of-speech filtering (nouns and verbs only)
["自然", "言語", "処理", "学習"]

Complex Sentences

# Input
"機械学習モデルの性能を向上させるための最適化手法について説明します。"

# Tokenized output
["機械", "学習", "モデル", "性能", "向上", "最適", "化", "手法", "説明"]

Named Entity Recognition

# Company names and products
"株式会社OpenAIのGPT-4は革新的なAIモデルです。"

# Intelligent tokenization preserves entities
["株式会社", "OpenAI", "GPT-4", "革新", "的", "AI", "モデル"]

SentencePiece Fallback

When MeCab is not available, Oboyu uses SentencePiece:

# SentencePiece subword tokenization
"未知の専門用語も適切に処理"
["▁未知", "の", "▁専門", "用語", "も", "▁適切", "に", "▁処理"]

Search Optimization

Japanese-Specific Ranking

Oboyu optimizes search ranking for Japanese characteristics:

1. Compound Word Handling

# Query: "データベース"
# Matches both:
- "データベース" (exact match)
- "データ" + "ベース" (compound components)

2. Reading Variations

# Handles different readings
- "会議" (kaigi)
- "打ち合わせ" (uchiawase)
- "ミーティング" (meeting)
# All treated as related concepts

3. Script Mixing

# Query: "AI技術"
# Intelligently matches:
- "AI技術"
- "人工知能技術"
- "エーアイ技術"

Search Examples

Technical Documentation Search

# Search for machine learning concepts
oboyu query --query "機械学習アルゴリズム"

# Results ranked by:
# 1. Exact phrase matches
# 2. All terms present
# 3. Partial term matches
# 4. Semantic similarity

Business Document Search

# Search for meeting notes
oboyu query --query "営業会議 議事録 2024年"

# Intelligent date parsing and entity recognition

Code Documentation

# Mixed language search
oboyu query --query "async/await 非同期処理"

# Handles technical terms in both languages

Mixed Language Support

Seamless Language Detection

Oboyu automatically detects and handles mixed language content:

# Document with mixed content
"""
API設計のベストプラクティス

1. RESTful Design Principles
   - リソース指向アーキテクチャ
   - HTTPメソッドの適切な使用

2. Error Handling
   - エラーコードの統一
   - 日本語エラーメッセージの提供
"""

Language-Aware Chunking

# Intelligent chunk boundaries
- Respects sentence boundaries in both languages
- Maintains context across language switches
- Preserves code blocks and technical terms

Search Across Languages

# Japanese query finding English content
oboyu query --query "認証システム"
# Also finds: "authentication system", "auth module"

# English query finding Japanese content  
oboyu query --query "database optimization"
# Also finds: "データベース最適化", "DB高速化"

Performance Benchmarks

Encoding Detection Speed

File Size	UTF-8	Shift-JIS	EUC-JP	Mixed
1KB	0.1ms	0.3ms	0.3ms	0.5ms
100KB	0.5ms	1.2ms	1.1ms	2.1ms
1MB	2.1ms	4.8ms	4.5ms	8.2ms
10MB	18ms	42ms	39ms	71ms

Tokenization Performance

Text Length	MeCab	SentencePiece	Hybrid
100 chars	0.8ms	1.2ms	0.9ms
1,000 chars	3.2ms	4.8ms	3.5ms
10,000 chars	28ms	41ms	30ms

Search Performance (10,000 documents)

Query Type	BM25	Vector	Hybrid
Single term	12ms	45ms	52ms
Phrase	18ms	46ms	58ms
Complex	25ms	48ms	65ms

Best Practices

Document Preparation

Consistent Encoding

# Convert legacy files to UTF-8
iconv -f SHIFT-JIS -t UTF-8 old_file.txt > new_file.txt

Structured Content

# プロジェクト仕様書

## 概要
本プロジェクトは...

## Technical Requirements
- Python 3.10+
- PostgreSQL 14+

Metadata Usage

---
title: システム設計書
date: 2024-01-15
tags: [設計, アーキテクチャ, API]
---

Query Optimization

Use Natural Phrases

# Good: Natural Japanese
oboyu query --query "ユーザー認証の実装方法"

# Less optimal: Keyword list
oboyu query --query "ユーザー 認証 実装 方法"

Leverage Filters

# Filter by file type for technical docs
oboyu query --query "API設計" --filter "*.md"

# Filter by date for recent content
oboyu query --query "会議録" --filter "2024-*"

Mode Selection

# Semantic for concepts
oboyu query --query "オブジェクト指向の原則" --mode semantic

# Keyword for exact terms
oboyu query --query "エラーコード E001" --mode keyword

Best Practices

For Modern Projects

Oboyu automatically handles all encoding scenarios:

# Modern UTF-8 documents
oboyu index ./modern_docs

# Legacy systems with Shift-JIS or EUC-JP
oboyu index ./legacy_system

# Mixed encoding environments
oboyu index ./mixed_docs

The system will automatically detect and handle UTF-8, Shift-JIS, EUC-JP, CP932, and ISO-2022-JP encodings.

Troubleshooting

Encoding Detection Issues

Symptoms:

Garbled characters (文字化け)
Missing Japanese text
Indexing errors

Solutions:

Check file encoding: file -i filename.txt
Ensure the file is in a supported encoding (UTF-8, Shift-JIS, EUC-JP, CP932, or ISO-2022-JP)
Convert unsupported encodings to UTF-8 before indexing

Normalization Issues

Symptoms:

Search misses documents with different character widths
Inconsistent search results

Solutions:

Ensure all documents are re-indexed after configuration changes
Use consistent input methods for queries
Check normalization in debug mode

Debug Mode

Enable debug logging for Japanese processing:

# Index with debug output
oboyu index ./docs --debug

Performance Considerations

Encoding Detection Overhead

Minimal impact: ~1-5ms per file
Cached results: Encoding detected once per file
Parallel processing: Multiple files processed concurrently

Normalization Performance

Fast processing: <1ms per document
In-memory operation: No disk I/O required
Single-pass: Normalization done during initial processing

Advanced Configuration

Tokenizer Selection

# ~/.config/oboyu/config.yaml
indexer:
  language:
    japanese_tokenizer: "mecab"  # or "sentencepiece"
    compound_splitting: true      # Split compound words
    reading_normalization: true   # Normalize readings
    
crawler:
  encoding_detection:
    enabled: true  # Default: true
    confidence_threshold: 0.8
    fallback_encoding: "utf-8"
    # Custom encoding priority (optional)
    encoding_priority:
      - "utf-8"
      - "shift_jis"
      - "euc_jp"
      - "cp932"
      - "iso2022_jp"

Search Optimization

query_engine:
  japanese:
    # Boost exact matches in Japanese
    exact_match_boost: 2.0
    # Enable reading variation matching
    reading_variations: true
    # Compound word handling
    compound_word_search: true
    
  # Japanese-specific reranker
  reranker:
    model: "hotchpotch/japanese-reranker-cross-encoder-small-v1"
    enabled: true

Performance Tuning

indexer:
  processing:
    # Larger chunks for Japanese (due to character density)
    chunk_size: 800  # characters, not tokens
    chunk_overlap: 200
    
  # Parallel processing
  japanese_processing:
    max_workers: 4
    batch_size: 100

Integration with Search

The encoding detection and normalization ensure that:

Vector search works with properly encoded text
BM25 search matches normalized tokens correctly
Hybrid search combines both effectively

All search modes benefit from consistent text processing, ensuring reliable results regardless of the original file encoding or character variations.

Common Issues and Solutions

Issue: Poor Search Results for Japanese Queries

Symptoms:

English results for Japanese queries
Missing obvious matches
Irrelevant results

Solutions:

Check tokenization:

# Verify MeCab is installed
which mecab

# Test tokenization
echo "テスト文章" | mecab

Use appropriate search mode:

# For technical terms
oboyu query --query "機械学習" --mode hybrid

# For concepts
oboyu query --query "人工知能の応用" --mode semantic

Rebuild index:

# Clear and rebuild with debug info
oboyu index --clear
oboyu index /path/to/docs --debug

Issue: Encoding Problems

Symptoms:

文字化け (mojibake/garbled text)
Question marks or squares
Missing content

Solutions:

Check file encoding:

# Detect file encoding
file -i problematic_file.txt

# Or use nkf
nkf -g problematic_file.txt

Convert to UTF-8:

# Using iconv
iconv -f SHIFT-JIS -t UTF-8 input.txt > output.txt

# Using nkf
nkf -w --overwrite *.txt

Batch conversion:

# Convert all Shift-JIS files in directory
find . -name "*.txt" -exec sh -c '
  encoding=$(nkf -g "$1")
  if [ "$encoding" = "Shift_JIS" ]; then
    nkf -w --overwrite "$1"
    echo "Converted: $1"
  fi
' _ {} \;

Tips for Optimal Japanese Search

1. Document Structure

Use clear headings in Japanese
Include ruby annotations for difficult kanji
Add English translations for key terms
Use consistent terminology

2. Query Formulation

Use full sentences for semantic search
Include context in queries
Try both kanji and hiragana versions
Use synonyms and related terms

3. Index Optimization

Separate Japanese and English content when possible
Use metadata for categorization
Regular index maintenance
Monitor search performance

Future Enhancements

Planned improvements for Japanese support:

Advanced NLP Features
- Named entity recognition (NER)
- Dependency parsing
- Sentiment analysis
Dictionary Integration
- Custom domain dictionaries
- Synonym expansion
- Technical term databases
Enhanced Models
- Larger Japanese embedding models
- Domain-specific fine-tuning
- Multilingual model options

Table of Contents​

Core Features​

Encoding Detection​

Overview​

Supported Encodings​

How It Works​

Automatic Detection​

Text Normalization​

Unicode Normalization​

Character Width Normalization​

Benefits​

Japanese Text Processing Pipeline​

Processing Steps​

Example Pipeline​

Tokenization and Morphological Analysis​

MeCab Integration​

Tokenization Examples​

Basic Tokenization​

Complex Sentences​

Named Entity Recognition​

SentencePiece Fallback​

Search Optimization​

Japanese-Specific Ranking​

1. Compound Word Handling​

2. Reading Variations​

3. Script Mixing​

Search Examples​

Technical Documentation Search​

Business Document Search​

Code Documentation​

Mixed Language Support​

Seamless Language Detection​

Language-Aware Chunking​

Search Across Languages​

Performance Benchmarks​

Encoding Detection Speed​

Tokenization Performance​

Search Performance (10,000 documents)​

Best Practices​

Document Preparation​

Query Optimization​

Best Practices​

For Modern Projects​

Troubleshooting​

Encoding Detection Issues​

Normalization Issues​

Debug Mode​

Performance Considerations​

Encoding Detection Overhead​

Normalization Performance​

Advanced Configuration​

Tokenizer Selection​

Search Optimization​

Performance Tuning​

Integration with Search​

Common Issues and Solutions​

Issue: Poor Search Results for Japanese Queries​

Issue: Encoding Problems​

Tips for Optimal Japanese Search​

1. Document Structure​

2. Query Formulation​

3. Index Optimization​

Future Enhancements​

Table of Contents

Core Features

Encoding Detection

Overview

Supported Encodings

How It Works

Automatic Detection

Text Normalization

Unicode Normalization

Character Width Normalization

Benefits

Japanese Text Processing Pipeline

Processing Steps

Example Pipeline

Tokenization and Morphological Analysis

MeCab Integration

Tokenization Examples

Basic Tokenization

Complex Sentences

Named Entity Recognition

SentencePiece Fallback

Search Optimization

Japanese-Specific Ranking

1. Compound Word Handling

2. Reading Variations

3. Script Mixing

Search Examples

Technical Documentation Search

Business Document Search

Code Documentation

Mixed Language Support

Seamless Language Detection

Language-Aware Chunking

Search Across Languages

Performance Benchmarks

Encoding Detection Speed

Tokenization Performance

Search Performance (10,000 documents)

Best Practices

Document Preparation

Query Optimization

Best Practices

For Modern Projects

Troubleshooting

Encoding Detection Issues

Normalization Issues

Debug Mode

Performance Considerations

Encoding Detection Overhead

Normalization Performance

Advanced Configuration

Tokenizer Selection

Search Optimization

Performance Tuning

Integration with Search

Common Issues and Solutions

Issue: Poor Search Results for Japanese Queries

Issue: Encoding Problems

Tips for Optimal Japanese Search

1. Document Structure

2. Query Formulation

3. Index Optimization

Future Enhancements