Handling Different Document Formats

Oboyu supports a wide range of document formats. This guide shows you how to effectively search and manage different types of files in your collection.

PDF Documents

Oboyu fully supports PDF document indexing and search. Text content is automatically extracted from PDF files, including multi-page documents, and metadata is preserved for enhanced search capabilities.

Supported PDF Features

Text extraction: Extracts plain text content from all pages
Metadata extraction: Preserves title, author, creation date, and modification date
Multi-page support: Handles documents with multiple pages seamlessly
Japanese text support: Full support for Japanese text in PDFs

Basic PDF Search

# Search all indexed PDF files
oboyu search "annual report"

# Find PDFs with specific content
oboyu search "financial statement 2024" --mode hybrid

# Search for content across multiple pages
oboyu search "conclusion recommendations"

PDF Indexing

# Index a directory containing PDFs
oboyu index ~/Documents/PDFs

# Index with mixed file types (PDFs included by default)
oboyu index ~/Documents --include "*.pdf,*.txt,*.md"

PDF-Specific Examples

# Find research papers
oboyu search "machine learning neural networks"

# Find forms and applications
oboyu search "application form"

# Find presentations converted to PDF
oboyu search "slide deck presentation"

# Search PDF metadata
oboyu search "author:Smith" --mode vector

Japanese PDF Support

# Search Japanese PDFs
oboyu search "機械学習"

# Mixed language search
oboyu search "machine learning 機械学習" --mode hybrid

Performance Tips for PDFs

Large PDFs are processed efficiently with automatic text chunking
Metadata extraction is fast and preserves document properties
Text-based PDFs perform better than image-only PDFs
Consider using semantic search mode for concept-based queries

Markdown Files

Markdown is perfect for notes, documentation, and technical writing.

Basic Markdown Search

# Search only markdown files
oboyu search "TODO"

# Find markdown with specific headers
oboyu search "## Installation"

Markdown Structure Search

# Find files with code blocks
oboyu search "```python"

# Find files with links
oboyu search "[link]("

# Find files with images
oboyu search "![image]"

Common Markdown Workflows

# Find all README files
oboyu search "*" --db-path ~/indexes/example.db

# Find documentation files
oboyu search "documentation"

# Find blog posts
oboyu search "date:"

Office Documents

Microsoft Word (.docx)

# Search Word documents
oboyu search "contract"

# Find templates
oboyu search "template"

# Find track changes comments
oboyu search "comment:" --mode vector

Excel Files (.xlsx)

# Search spreadsheets
oboyu search "budget"

# Find files with specific data
oboyu search "Q4 revenue"

# Find formulas (if extracted)
oboyu search "SUM(A1:"

PowerPoint (.pptx)

# Search presentations
oboyu search "roadmap"

# Find slide titles
oboyu search "Agenda"

# Find speaker notes
oboyu search "note:" --mode vector

Plain Text Files

Simple but powerful for logs, notes, and data files.

Basic Text Search

# Search text files
oboyu search "error"

# Search log files
oboyu search "ERROR"

# Search configuration files
oboyu search "port"

Structured Text Files

# Search CSV files
oboyu search "customer_id"

# Search JSON files
oboyu search '"api_key"'

# Search XML files
oboyu search "<configuration>"

Code Files

Oboyu can search through source code effectively.

Language-Specific Search

# Python files
oboyu search "def process_data"

# JavaScript files
oboyu search "async function"

# Java files
oboyu search "public class"

Code Pattern Search

# Find imports
oboyu search "import pandas"

# Find function definitions
oboyu search "function.*export"

# Find TODO comments
oboyu search "TODO:|FIXME:"

Email Files

If you export emails to files:

Email Search Patterns

# Search email files
oboyu search "meeting invitation"

# Find emails from specific sender
oboyu search "From: boss@company.com"

# Find emails with attachments
oboyu search "attachment" --mode vector

Web Documents

HTML Files

# Search HTML content
oboyu search "contact form"

# Find specific tags
oboyu search "<form"

# Find meta descriptions
oboyu search 'meta name="description"'

Mixed Format Workflows

Project Documentation

When projects have multiple file types:

# Search across all documentation
oboyu search "API endpoint"

# Find all files about a feature
oboyu search "user authentication" --mode vector

Research Collection

For mixed academic materials:

# Search papers and notes
oboyu search "hypothesis testing"

# Find citations
oboyu search "et al. 2024"

Format-Specific Tips

Large Files

# For large PDFs or documents
oboyu index ~/large-docs --chunk-size 1000

# Search with context
oboyu search "conclusion"

Compressed Archives

# Index contents of archives
oboyu index ~/Documents --extract-archives

# Search within extracted content
oboyu search "readme"

Binary Files with Metadata

# Search image metadata
oboyu search "Canon EOS"

# Search audio file tags
oboyu search "Beatles"

Best Practices by Format

For PDFs

Keep OCR quality high for scanned documents
Use semantic search for concept-based queries
Index regularly as PDFs are often updated

For Markdown

Use consistent formatting for better search
Include front matter for metadata
Use headers for structure-based search

For Office Documents

Use document properties and metadata
Keep formatting consistent
Extract tables and charts when possible

For Code Files

Include comments for context
Use consistent naming conventions
Index documentation alongside code

Format Conversion Tips

When to Convert

Convert proprietary formats to open formats
Convert old formats to supported formats
Extract text from complex formats

Conversion Examples

# Convert before indexing
pandoc input.docx -o output.md
oboyu index ~/converted-docs

# Batch conversion
find . -name "*.doc" -exec pandoc {} -o {}.md \;

Troubleshooting Format Issues

Unsupported Format

# Check if format is supported
# (Note: formats list command not available)

# Use text extraction for unsupported formats
textract unsupported.xyz > supported.txt

Corrupted Files

# Skip corrupted files
oboyu index ~/Documents

# Find problematic files
oboyu index ~/Documents | grep "ERROR"

Encoding Issues

# Handle different encodings
oboyu index ~/Documents

# Force specific encoding
# (Note: encoding options not available in current implementation)

Next Steps

Explore Search Patterns for format-specific search techniques
Learn about Performance Tuning for large collections
See Real-world Scenarios for practical examples

PDF Documents​

Supported PDF Features​

Basic PDF Search​

PDF Indexing​

PDF-Specific Examples​

Japanese PDF Support​

Performance Tips for PDFs​

Markdown Files​

Basic Markdown Search​

Markdown Structure Search​

Common Markdown Workflows​

Office Documents​

Microsoft Word (.docx)​

Excel Files (.xlsx)​

PowerPoint (.pptx)​

Plain Text Files​

Basic Text Search​

Structured Text Files​

Code Files​

Language-Specific Search​

Code Pattern Search​

Email Files​

Email Search Patterns​

Web Documents​

HTML Files​

Mixed Format Workflows​

Project Documentation​

Research Collection​

Format-Specific Tips​

Large Files​

Compressed Archives​

Binary Files with Metadata​

Best Practices by Format​

For PDFs​

For Markdown​

For Office Documents​

For Code Files​

Format Conversion Tips​

When to Convert​

Conversion Examples​

Troubleshooting Format Issues​

Unsupported Format​

Corrupted Files​

Encoding Issues​

Next Steps​