Skip to main content

Japanese Language Support in Oboyu

Oboyu provides world-class support for Japanese text processing with advanced features designed specifically for the unique characteristics of the Japanese language. This includes automatic encoding detection, intelligent text normalization, morphological analysis, and optimized search algorithms that understand Japanese grammar and writing systems.

Table of Contents

Core Features

Oboyu's Japanese support includes:

  • 🔍 Automatic Encoding Detection: Handles UTF-8, Shift-JIS, EUC-JP, CP932, and ISO-2022-JP
  • 🔤 Intelligent Normalization: Full/half-width conversion, variant unification
  • 🔬 Morphological Analysis: MeCab and SentencePiece integration
  • 🎯 Optimized Models: Japanese-specific embedding models (Ruri v3)
  • 🌏 Mixed Language: Seamless Japanese-English document handling
  • High Performance: Optimized for Japanese text characteristics

Encoding Detection

Overview

Japanese text files often use various character encodings due to historical reasons. Oboyu automatically detects and handles multiple Japanese encodings to ensure all documents are properly indexed.

Supported Encodings

Oboyu supports automatic detection and conversion of common Japanese encodings:

  • UTF-8: Modern standard encoding (default)
  • Shift-JIS: Windows Japanese encoding (CP932)
  • EUC-JP: Unix/Linux Japanese encoding
  • ISO-2022-JP: Email and older systems encoding

How It Works

When processing files, Oboyu:

  1. Attempts UTF-8 first as the modern standard
  2. Detects encoding using byte patterns specific to Japanese encodings
  3. Converts to UTF-8 internally for consistent processing
  4. Handles mixed encodings in different files within the same directory

Automatic Detection

Oboyu automatically detects and handles Japanese encodings without requiring configuration. The system intelligently identifies the encoding of each file and converts it to UTF-8 for consistent processing.

Text Normalization

Unicode Normalization

All Japanese text undergoes Unicode NFKC (Normalization Form KC) normalization to ensure consistent matching:

Input:  "hello ワールド"  (mixed width)
Output: "hello ワールド" (normalized)

This normalization handles:

  • Full-width to half-width conversion for ASCII characters
  • Character variant unification (異体字の統一)
  • Combining character normalization

Character Width Normalization

Oboyu normalizes character widths for consistent search:

Full-width: ABC123
Half-width: ABC123

Full-width: カタカナ
Full-width: カタカナ

Benefits

  • Consistent search results regardless of input method
  • Improved matching between different document sources
  • Reduced index size through normalization

Japanese Text Processing Pipeline

Processing Steps

  1. File Reading

    • Detect encoding using configured encoding list
    • Read file with detected encoding
  2. Text Extraction

    • Convert to UTF-8 if needed
    • Extract text content
  3. Normalization

    • Apply Unicode NFKC normalization
    • Normalize character widths
    • Clean up whitespace
  4. Tokenization (for BM25 search)

    • Use MeCab for morphological analysis
    • Extract meaningful tokens

Example Pipeline

# Original file (Shift-JIS encoded)
"データベース 設計 パターン"

# After encoding detection and conversion
"データベース 設計 パターン" (UTF-8)

# After normalization
"データベース 設計 パターン" (normalized spaces)

# After tokenization
["データベース", "設計", "パターン"]

Tokenization and Morphological Analysis

MeCab Integration

Oboyu integrates with MeCab for accurate Japanese morphological analysis:

# Install MeCab (optional but recommended)
# macOS
brew install mecab mecab-ipadic

# Ubuntu/Debian
sudo apt-get install mecab libmecab-dev mecab-ipadic-utf8

# Install Python binding
pip install mecab-python3

Tokenization Examples

Basic Tokenization

# Input text
"自然言語処理を学習する"

# MeCab tokenization
["自然", "言語", "処理", "を", "学習", "する"]

# With part-of-speech filtering (nouns and verbs only)
["自然", "言語", "処理", "学習"]

Complex Sentences

# Input
"機械学習モデルの性能を向上させるための最適化手法について説明します。"

# Tokenized output
["機械", "学習", "モデル", "性能", "向上", "最適", "化", "手法", "説明"]

Named Entity Recognition

# Company names and products
"株式会社OpenAIのGPT-4は革新的なAIモデルです。"

# Intelligent tokenization preserves entities
["株式会社", "OpenAI", "GPT-4", "革新", "的", "AI", "モデル"]

SentencePiece Fallback

When MeCab is not available, Oboyu uses SentencePiece:

# SentencePiece subword tokenization
"未知の専門用語も適切に処理"
["▁未知", "の", "▁専門", "用語", "も", "▁適切", "に", "▁処理"]

Search Optimization

Japanese-Specific Ranking

Oboyu optimizes search ranking for Japanese characteristics:

1. Compound Word Handling

# Query: "データベース"
# Matches both:
- "データベース" (exact match)
- "データ" + "ベース" (compound components)

2. Reading Variations

# Handles different readings
- "会議" (kaigi)
- "打ち合わせ" (uchiawase)
- "ミーティング" (meeting)
# All treated as related concepts

3. Script Mixing

# Query: "AI技術"
# Intelligently matches:
- "AI技術"
- "人工知能技術"
- "エーアイ技術"

Search Examples

# Search for machine learning concepts
oboyu query --query "機械学習アルゴリズム"

# Results ranked by:
# 1. Exact phrase matches
# 2. All terms present
# 3. Partial term matches
# 4. Semantic similarity
# Search for meeting notes
oboyu query --query "営業会議 議事録 2024年"

# Intelligent date parsing and entity recognition

Code Documentation

# Mixed language search
oboyu query --query "async/await 非同期処理"

# Handles technical terms in both languages

Mixed Language Support

Seamless Language Detection

Oboyu automatically detects and handles mixed language content:

# Document with mixed content
"""
API設計のベストプラクティス

1. RESTful Design Principles
- リソース指向アーキテクチャ
- HTTPメソッドの適切な使用

2. Error Handling
- エラーコードの統一
- 日本語エラーメッセージの提供
"""

Language-Aware Chunking

# Intelligent chunk boundaries
- Respects sentence boundaries in both languages
- Maintains context across language switches
- Preserves code blocks and technical terms

Search Across Languages

# Japanese query finding English content
oboyu query --query "認証システム"
# Also finds: "authentication system", "auth module"

# English query finding Japanese content
oboyu query --query "database optimization"
# Also finds: "データベース最適化", "DB高速化"

Performance Benchmarks

Encoding Detection Speed

File SizeUTF-8Shift-JISEUC-JPMixed
1KB0.1ms0.3ms0.3ms0.5ms
100KB0.5ms1.2ms1.1ms2.1ms
1MB2.1ms4.8ms4.5ms8.2ms
10MB18ms42ms39ms71ms

Tokenization Performance

Text LengthMeCabSentencePieceHybrid
100 chars0.8ms1.2ms0.9ms
1,000 chars3.2ms4.8ms3.5ms
10,000 chars28ms41ms30ms

Search Performance (10,000 documents)

Query TypeBM25VectorHybrid
Single term12ms45ms52ms
Phrase18ms46ms58ms
Complex25ms48ms65ms

Best Practices

Document Preparation

  1. Consistent Encoding

    # Convert legacy files to UTF-8
    iconv -f SHIFT-JIS -t UTF-8 old_file.txt > new_file.txt
  2. Structured Content

    # プロジェクト仕様書

    ## 概要
    本プロジェクトは...

    ## Technical Requirements
    - Python 3.10+
    - PostgreSQL 14+
  3. Metadata Usage

    ---
    title: システム設計書
    date: 2024-01-15
    tags: [設計, アーキテクチャ, API]
    ---

Query Optimization

  1. Use Natural Phrases

    # Good: Natural Japanese
    oboyu query --query "ユーザー認証の実装方法"

    # Less optimal: Keyword list
    oboyu query --query "ユーザー 認証 実装 方法"
  2. Leverage Filters

    # Filter by file type for technical docs
    oboyu query --query "API設計" --filter "*.md"

    # Filter by date for recent content
    oboyu query --query "会議録" --filter "2024-*"
  3. Mode Selection

    # Semantic for concepts
    oboyu query --query "オブジェクト指向の原則" --mode semantic

    # Keyword for exact terms
    oboyu query --query "エラーコード E001" --mode keyword

Best Practices

For Modern Projects

Oboyu automatically handles all encoding scenarios:

# Modern UTF-8 documents
oboyu index ./modern_docs

# Legacy systems with Shift-JIS or EUC-JP
oboyu index ./legacy_system

# Mixed encoding environments
oboyu index ./mixed_docs

The system will automatically detect and handle UTF-8, Shift-JIS, EUC-JP, CP932, and ISO-2022-JP encodings.

Troubleshooting

Encoding Detection Issues

Symptoms:

  • Garbled characters (文字化け)
  • Missing Japanese text
  • Indexing errors

Solutions:

  1. Check file encoding: file -i filename.txt
  2. Ensure the file is in a supported encoding (UTF-8, Shift-JIS, EUC-JP, CP932, or ISO-2022-JP)
  3. Convert unsupported encodings to UTF-8 before indexing

Normalization Issues

Symptoms:

  • Search misses documents with different character widths
  • Inconsistent search results

Solutions:

  1. Ensure all documents are re-indexed after configuration changes
  2. Use consistent input methods for queries
  3. Check normalization in debug mode

Debug Mode

Enable debug logging for Japanese processing:

# Index with debug output
oboyu index ./docs --debug

Performance Considerations

Encoding Detection Overhead

  • Minimal impact: ~1-5ms per file
  • Cached results: Encoding detected once per file
  • Parallel processing: Multiple files processed concurrently

Normalization Performance

  • Fast processing: <1ms per document
  • In-memory operation: No disk I/O required
  • Single-pass: Normalization done during initial processing

Advanced Configuration

Tokenizer Selection

# ~/.config/oboyu/config.yaml
indexer:
language:
japanese_tokenizer: "mecab" # or "sentencepiece"
compound_splitting: true # Split compound words
reading_normalization: true # Normalize readings

crawler:
encoding_detection:
enabled: true # Default: true
confidence_threshold: 0.8
fallback_encoding: "utf-8"
# Custom encoding priority (optional)
encoding_priority:
- "utf-8"
- "shift_jis"
- "euc_jp"
- "cp932"
- "iso2022_jp"

Search Optimization

query_engine:
japanese:
# Boost exact matches in Japanese
exact_match_boost: 2.0
# Enable reading variation matching
reading_variations: true
# Compound word handling
compound_word_search: true

# Japanese-specific reranker
reranker:
model: "hotchpotch/japanese-reranker-cross-encoder-small-v1"
enabled: true

Performance Tuning

indexer:
processing:
# Larger chunks for Japanese (due to character density)
chunk_size: 800 # characters, not tokens
chunk_overlap: 200

# Parallel processing
japanese_processing:
max_workers: 4
batch_size: 100

The encoding detection and normalization ensure that:

  • Vector search works with properly encoded text
  • BM25 search matches normalized tokens correctly
  • Hybrid search combines both effectively

All search modes benefit from consistent text processing, ensuring reliable results regardless of the original file encoding or character variations.

Common Issues and Solutions

Issue: Poor Search Results for Japanese Queries

Symptoms:

  • English results for Japanese queries
  • Missing obvious matches
  • Irrelevant results

Solutions:

  1. Check tokenization:

    # Verify MeCab is installed
    which mecab

    # Test tokenization
    echo "テスト文章" | mecab
  2. Use appropriate search mode:

    # For technical terms
    oboyu query --query "機械学習" --mode hybrid

    # For concepts
    oboyu query --query "人工知能の応用" --mode semantic
  3. Rebuild index:

    # Clear and rebuild with debug info
    oboyu index --clear
    oboyu index /path/to/docs --debug

Issue: Encoding Problems

Symptoms:

  • 文字化け (mojibake/garbled text)
  • Question marks or squares
  • Missing content

Solutions:

  1. Check file encoding:

    # Detect file encoding
    file -i problematic_file.txt

    # Or use nkf
    nkf -g problematic_file.txt
  2. Convert to UTF-8:

    # Using iconv
    iconv -f SHIFT-JIS -t UTF-8 input.txt > output.txt

    # Using nkf
    nkf -w --overwrite *.txt
  3. Batch conversion:

    # Convert all Shift-JIS files in directory
    find . -name "*.txt" -exec sh -c '
    encoding=$(nkf -g "$1")
    if [ "$encoding" = "Shift_JIS" ]; then
    nkf -w --overwrite "$1"
    echo "Converted: $1"
    fi
    ' _ {} \;

1. Document Structure

  • Use clear headings in Japanese
  • Include ruby annotations for difficult kanji
  • Add English translations for key terms
  • Use consistent terminology

2. Query Formulation

  • Use full sentences for semantic search
  • Include context in queries
  • Try both kanji and hiragana versions
  • Use synonyms and related terms

3. Index Optimization

  • Separate Japanese and English content when possible
  • Use metadata for categorization
  • Regular index maintenance
  • Monitor search performance

Future Enhancements

Planned improvements for Japanese support:

  1. Advanced NLP Features

    • Named entity recognition (NER)
    • Dependency parsing
    • Sentiment analysis
  2. Dictionary Integration

    • Custom domain dictionaries
    • Synonym expansion
    • Technical term databases
  3. Enhanced Models

    • Larger Japanese embedding models
    • Domain-specific fine-tuning
    • Multilingual model options