diff --git a/documentation/architecture/designs/index.rst b/documentation/architecture/designs/index.rst index 83430a2..fcb5dfa 100644 --- a/documentation/architecture/designs/index.rst +++ b/documentation/architecture/designs/index.rst @@ -28,8 +28,4 @@ Each design documents Python-specific architecture, interface contracts, module :maxdepth: 2 :glob: - processor-detection-system - inventory-processors - structure-processors - results-objects ../openspec/specs/*/design diff --git a/documentation/architecture/designs/processor-detection-system.rst b/documentation/architecture/designs/processor-detection-system.rst deleted file mode 100644 index 4491eed..0000000 --- a/documentation/architecture/designs/processor-detection-system.rst +++ /dev/null @@ -1,446 +0,0 @@ -.. vim: set fileencoding=utf-8: -.. -*- coding: utf-8 -*- -.. +--------------------------------------------------------------------------+ - | | - | Licensed under the Apache License, Version 2.0 (the "License"); | - | you may not use this file except in compliance with the License. | - | You may obtain a copy of the License at | - | | - | http://www.apache.org/licenses/LICENSE-2.0 | - | | - | Unless required by applicable law or agreed to in writing, software | - | distributed under the License is distributed on an "AS IS" BASIS, | - | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | - | See the License for the specific language governing permissions and | - | limitations under the License. | - | | - +--------------------------------------------------------------------------+ - - -******************************************************************************* -Processor Detection System Design -******************************************************************************* - -Overview -=============================================================================== - -The processor detection system provides automated selection of appropriate -inventory and structure processors for documentation sources. The design -implements confidence-based scoring with TTL-based caching to balance -performance with accuracy and data freshness. - -This document focuses on the orchestration layer that coordinates processor -selection across processor genera (inventory vs. structure processors), while -detailed processor-specific detection patterns are covered in the respective -processor architecture documents. - -Architecture -=============================================================================== - -Design Principles -------------------------------------------------------------------------------- - -**Genus-Based Separation** - Inventory processors and structure processors operate in separate detection - pipelines, allowing independent evolution and different selection criteria. - Each genus maintains its own cache and processor registry. - -**Confidence-Based Selection** - Processors return numerical confidence scores (0.0-1.0). Only processors - exceeding ``CONFIDENCE_THRESHOLD_MINIMUM`` (0.5) are considered, with highest - confidence and registration order as stable tiebreaker. - -**Immutable Data Structures** - All detection results use immutable containers (``__.immut.Dictionary``, - ``tuple``) following project practices for thread safety and predictable - behavior. - -**Wide Parameter, Narrow Return Pattern** - Public functions accept abstract base classes for parameters and return - specific concrete types, following established project practices. - -Component Structure -------------------------------------------------------------------------------- - -**Detection Orchestration** (``detection.py``) - Central coordination of processor selection across inventory and structure - genera. Provides both high-level convenience functions and low-level - extensible functions for custom processor mappings. - -**Cache Management** - TTL-based caching system with lazy expiration cleanup. Separate cache - instances per processor genus enable different configuration and evolution - patterns. - -**Processor Integration** - Abstract base classes in ``processors.py`` define detection contracts. - Format-specific implementations in ``inventories/`` and ``structures/`` - subpackages provide concrete detection logic. - -Processor Genera System -=============================================================================== - -ProcessorGenera Enumeration -------------------------------------------------------------------------------- - -The system defines distinct processor genera that operate independently: - -.. code-block:: python - - class ProcessorGenera( __.typx.Enum ): - ''' Enumeration of processor genera for detection orchestration. ''' - - Inventory = 'inventory' # Inventory object extraction processors - Structure = 'structure' # Content extraction processors - -**Inventory Processors**: Extract object inventories from documentation sources, -providing discovery and search capabilities across different documentation formats. -Detailed architecture covered in ``inventory-processors.rst``. - -**Structure Processors**: Extract content from documentation pages, transforming -HTML into structured documents for search and analysis. Detailed architecture -covered in ``structure-processors.rst``. - -Genus-Specific Detection Pipelines -------------------------------------------------------------------------------- - -Each processor genus maintains independent detection infrastructure: - -**Separate Cache Instances**: Each genus has dedicated cache management with -genus-appropriate TTL values and eviction strategies. - -**Independent Processor Registries**: Processor registration and discovery -operates independently per genus, enabling different processor lifecycle management. - -**Genus-Specific Selection Logic**: Processor selection algorithms can differ -between genera based on their operational characteristics and requirements. - -**Separate Error Handling**: Each genus implements error handling appropriate -to its operational context and failure modes. - -Interface Specifications -=============================================================================== - -Primary Detection Functions -------------------------------------------------------------------------------- - -.. code-block:: python - - async def detect( - auxdata: _state.Globals, - source: str, /, - genus: _interfaces.ProcessorGenera, *, - processor_name: __.Absential[ str ] = __.absent, - ) -> _processors.Detection - - async def detect_inventory( - auxdata: _state.Globals, - source: str, /, *, - processor_name: __.Absential[ str ] = __.absent, - ) -> _processors.InventoryDetection - - async def detect_structure( - auxdata: _state.Globals, - source: str, /, *, - processor_name: __.Absential[ str ] = __.absent, - ) -> _processors.StructureDetection - -**Contract:** -- Returns highest-confidence processor detection above threshold -- Raises ``ProcessorInavailability`` if no suitable processor found -- Bypasses detection when specific ``processor_name`` provided -- Maintains detection results in genus-specific cache - -Cache Access Functions -------------------------------------------------------------------------------- - -.. code-block:: python - - async def access_detections( - auxdata: _state.Globals, - source: str, /, *, - genus: _interfaces.ProcessorGenera - ) -> tuple[ - _processors.DetectionsByProcessor, - __.Absential[ _processors.Detection ] - ] - - async def access_detections_ll( - auxdata: _state.Globals, - source: str, /, *, - cache: DetectionsCache, - processors: __.cabc.Mapping[ str, _processors.Processor ], - ) -> tuple[ - _processors.DetectionsByProcessor, - __.Absential[ _processors.Detection ] - ] - -**Contract:** -- Returns all processor detections plus optimal selection -- Executes fresh detection if cache miss or expiration -- Low-level variant accepts arbitrary processor mapping for extensibility -- Never raises exceptions; returns ``__.absent`` for missing optimal detection - -Data Structures -=============================================================================== - -Detection Cache Design -------------------------------------------------------------------------------- - -.. code-block:: python - - class DetectionsCacheEntry( __.immut.DataclassObject ): - detections: __.cabc.Mapping[ str, _processors.Detection ] - timestamp: float - ttl: int - - @property - def detection_optimal( self ) -> __.Absential[ _processors.Detection ] - - def invalid( self, current_time: float ) -> bool - - class DetectionsCache( __.immut.DataclassObject ): - ttl: int = 3600 - _entries: dict[ str, DetectionsCacheEntry ] = __.dcls.field( - default_factory = dict[ str, DetectionsCacheEntry ] ) - - def access_detections( - self, source: str - ) -> __.Absential[ _processors.DetectionsByProcessor ] - - def access_detection_optimal( - self, source: str - ) -> __.Absential[ _processors.Detection ] - - def add_entry( - self, source: str, detections: _processors.DetectionsByProcessor - ) -> __.typx.Self - -**Design Features:** -- TTL-based expiration with configurable timeouts per cache instance -- Lazy cleanup on access operations to minimize overhead -- Pre-computed optimal selection stored in cache entries -- Method chaining support through ``__.typx.Self`` returns - -Type Aliases -------------------------------------------------------------------------------- - -.. code-block:: python - - DetectionsByProcessor: __.typx.TypeAlias = __.cabc.Mapping[ - str, _processors.Detection ] - -**Purpose:** Provides semantic clarity for function signatures and return types -while maintaining wide parameter acceptance patterns. - -Behavioral Contracts -=============================================================================== - -Processor Selection Contract -------------------------------------------------------------------------------- - -**Selection Algorithm:** -1. Execute all processors in genus-specific registry on source -2. Filter results to confidence >= ``CONFIDENCE_THRESHOLD_MINIMUM`` (0.5) -3. Select highest confidence; use registration order for ties -4. Return ``__.absent`` if no processors meet confidence threshold - -**Error Handling:** -- Individual processor detection failures are logged but not propagated -- Failed processors are excluded from selection consideration -- Selection continues with remaining successful processors - -Cache Management Contract -------------------------------------------------------------------------------- - -**Cache Population:** -- Fresh detection triggered on cache miss or TTL expiration -- All genus processors executed in parallel (future enhancement) -- Results cached regardless of optimal selection success - -**Cache Access:** -- Thread-safe read operations using immutable data structures -- Expired entries removed lazily on access -- Missing or expired entries trigger fresh processor execution - -**TTL Management:** -- Configurable per-cache instance (default: 3600 seconds) -- Based on cache entry creation timestamp -- Independent expiration per source URL - -Extension Points -=============================================================================== - -Processor Genus Extension -------------------------------------------------------------------------------- - -**Adding New Processor Types:** -1. Extend ``ProcessorGenera`` enumeration in ``interfaces.py`` -2. Add genus-specific cache instance in ``detection.py`` -3. Update genus dispatch in ``access_detections`` function -4. Register processors in genus-specific registry - -**Processor Implementation Requirements:** -- Implement ``detect`` method returning confidence-scored ``Detection`` -- Handle detection failures gracefully (should not raise exceptions) -- Return confidence score in range 0.0-1.0 -- Provide processor capabilities metadata - -Cache Strategy Extension -------------------------------------------------------------------------------- - -**Custom Cache Implementations:** -- ``DetectionsCache`` interface supports alternative implementations -- Size-based eviction strategies can be added via subclassing -- Different TTL strategies per processor type or source pattern -- External cache stores (Redis, etc.) through interface compliance - -**Performance Optimization:** -- Parallel processor execution via async fanout (marked TODO) -- Processor-specific timeout configuration -- Cache warming strategies for frequently accessed sources - -Error Handling Design -=============================================================================== - -Structured Error Response System -------------------------------------------------------------------------------- - -The system implements a structured error response pattern where the functions layer -handles all processor detection exceptions and returns user-friendly structured -responses. This design eliminates error interpretation at interface layers while -providing consistent, actionable error messaging. - -**Response Structure:** - -.. code-block:: python - - ErrorResponse: __.typx.TypeAlias = __.immut.Dictionary[ str, __.typx.Any ] - - def _produce_inventory_error_response( - source: str, - attempted_patterns: __.Absential[ __.cabc.Sequence[ str ] ] = __.absent - ) -> ErrorResponse - - def _produce_structure_error_response( source: str ) -> ErrorResponse - - def _produce_generic_error_response( - source: str, genus: str - ) -> ErrorResponse - -**Error Response Content:** -- Structured responses include error type, user-friendly title, detailed message -- Actionable suggestions provided based on specific failure scenarios -- Clear distinction between inventory and structure detection failures -- Pre-formatted messages eliminate interface layer error interpretation - -Automatic URL Pattern Extension -------------------------------------------------------------------------------- - -The detection system implements universal URL pattern extension that applies to -all processor types. When detection fails at the original URL, the system -automatically probes common documentation site patterns before reporting failure. - -**Universal Pattern Extension:** -- Applies to both inventory and structure processors uniformly -- Documentation content location affects both inventory files and content uniformly -- Common patterns include ``/en/latest/``, ``/latest/``, ``/main/``, etc. -- Working URLs are cached in global redirects mapping for future operations - -**Redirects Cache Integration:** - -.. code-block:: python - - _url_redirects_cache: dict[ str, str ] # original_url → working_url - - def normalize_location( location: str ) -> str - -**Transparent URL Resolution:** -- All operations automatically use working URLs from redirects cache -- Users receive actual working URLs as canonical source in responses -- Cache updates ensure consistent URL usage across all subsequent operations - -Exception Hierarchy -------------------------------------------------------------------------------- - -**Core Exceptions:** - -.. code-block:: python - - class ProcessorInavailability( Omnierror, RuntimeError ): - ''' No processor found to handle source. ''' - - def __init__( - self, source: str, genus: str, - attempted_processors: __.cabc.Sequence[ str ] - ) - - class DetectionFailure( Omnierror, RuntimeError ): - ''' Processor detection operation failed. ''' - - def __init__( - self, source: str, genus: str, - processor_errors: __.cabc.Mapping[ str, Exception ] - ) - -**Error Propagation:** -- Individual processor failures are caught and logged, not propagated upward -- Functions layer catches all detection exceptions and produces structured responses -- Interface layers receive pre-formatted error information, never raw exceptions - -Multiple Inventory Handling Strategy -=============================================================================== - -Processor Precedence Design -------------------------------------------------------------------------------- - -When multiple inventory processors successfully detect inventory sources for the -same documentation site, the system applies a precedence-based selection strategy -to maintain consistency and user predictability. - -**Detection Precedence Order:** -1. **Sphinx Inventory Processor** (``objects.inv`` files) -2. **MkDocs Inventory Processor** (``search_index.json`` files) -3. **Future processors** in registration order - -**Precedence Selection Algorithm:** - -.. code-block:: python - - def select_optimal_detection( - detections: __.cabc.Mapping[ str, _processors.Detection ] - ) -> __.Absential[ _processors.Detection ]: - ''' Selects optimal detection using precedence and confidence. ''' - # 1. Filter detections meeting confidence threshold - # 2. Apply processor precedence order for qualified detections - # 3. Use highest confidence as tiebreaker within same precedence level - # 4. Return __.absent if no detections meet threshold - -**Design Rationale:** -- **Consistency**: Predictable processor selection across documentation sites -- **Granularity**: Sphinx inventories provide API-level symbol granularity -- **Completeness**: MkDocs search indices provide page-level content coverage -- **Extensibility**: Registration order precedence supports future processor types - -Inventory Content Coordination -------------------------------------------------------------------------------- - -For sites with multiple detected inventories, the system coordinates content -operations to leverage the selected inventory processor while maintaining -architectural separation between inventory and structure processing. - -**Content Operation Coordination:** -- Selected inventory processor determines object enumeration and filtering -- Structure processors operate independently on content extraction -- Content queries use inventory-selected URIs to guide structure processor operations -- No cross-processor inventory merging to maintain architectural boundaries - -**Cache Strategy:** -- Detection cache stores all successful detections per processor type -- Optimal detection selection cached separately from individual processor results -- Cache entries track processor precedence decisions for consistency -- TTL expiration applies uniformly to all cached detection results - -This detection system design provides robust, extensible automated processor -selection while maintaining clean architectural boundaries between processor -genera and established project practices compliance. \ No newline at end of file diff --git a/documentation/architecture/designs/results-objects.rst b/documentation/architecture/designs/results-objects.rst deleted file mode 100644 index afaefb5..0000000 --- a/documentation/architecture/designs/results-objects.rst +++ /dev/null @@ -1,902 +0,0 @@ -.. vim: set fileencoding=utf-8: -.. -*- coding: utf-8 -*- -.. +--------------------------------------------------------------------------+ - | | - | Licensed under the Apache License, Version 2.0 (the "License"); | - | you may not use this file except in compliance with the License. | - | You may obtain a copy of the License at | - | | - | http://www.apache.org/licenses/LICENSE-2.0 | - | | - | Unless required by applicable law or agreed to in writing, software | - | distributed under the License is distributed on an "AS IS" BASIS, | - | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | - | See the License for the specific language governing permissions and | - | limitations under the License. | - | | - +--------------------------------------------------------------------------+ - - -******************************************************************************* -Results Module Design -******************************************************************************* - -Overview -=============================================================================== - -The results module provides a centralized collection of structured dataclass -objects representing search results, inventory objects, and content documents. -This module serves as the foundation for type-safe operations across all -interface layers while maintaining clean separation between data representation -and business logic. - -Design Principles -=============================================================================== - -Architectural Foundation -------------------------------------------------------------------------------- - -**Centralized Type Definitions** - All result-related dataclasses reside in a single module to ensure consistency - and prevent circular dependencies between processor, search, and function - modules. - -**Immutable Data Structures** - All result objects inherit from ``__.immut.DataclassObject`` following project - practices for thread safety and predictable behavior in concurrent operations. - -**Universal Object Interface** - All inventory processors return ``InventoryObject`` instances rather than - format-specific dictionaries, providing type safety and enabling consistent - search operations across different inventory formats. - -**Complete Source Attribution** - Every result object includes complete provenance information enabling - debugging, caching optimization, and future multi-source operations without - requiring separate tracking mechanisms. - -**Clean Separation of Concerns** - Inventory objects represent pure documentation metadata without search-specific - fields. Search results wrap inventory objects with relevance scoring. This - separation allows inventory objects to be reused across different search - contexts and enables search-independent operations. - -**Self-Rendering Object Architecture** - All result objects implement standardized rendering methods for different - output formats, encapsulating domain-specific formatting knowledge within - the objects themselves rather than external formatting functions. - -Core Object Definitions -=============================================================================== - -Universal Inventory Object -------------------------------------------------------------------------------- - -.. code-block:: python - - class InventoryObject( __.immut.DataclassObject ): - ''' Universal inventory object with complete source attribution. - - Represents a single documentation object from any inventory source - with standardized fields, format-specific metadata container, and - self-formatting capabilities where each processor creates objects - that know how to render their own specifics data. - ''' - - # Universal identification fields - name: __.typx.Annotated[ - str, __.ddoc.Doc( "Primary object identifier from inventory source." ) ] - uri: __.typx.Annotated[ - str, __.ddoc.Doc( "Relative URI to object documentation content." ) ] - inventory_type: __.typx.Annotated[ - str, __.ddoc.Doc( "Inventory format identifier (e.g., sphinx_objects_inv)." ) ] - location_url: __.typx.Annotated[ - str, __.ddoc.Doc( "Complete URL to inventory location for attribution." ) ] - - # Optional display enhancement - display_name: __.typx.Annotated[ - __.typx.Optional[ str ], - __.ddoc.Doc( "Human-readable name if different from name." ) ] = None - - # Format-specific metadata container - specifics: __.typx.Annotated[ - __.immut.Dictionary[ str, __.typx.Any ], - __.ddoc.Doc( "Format-specific metadata (domain, role, priority, etc.)." ) - ] = __.dcls.field( default_factory = __.immut.Dictionary ) - - - @property - def effective_display_name( self ) -> str: - ''' Returns display_name if available, otherwise falls back to name. ''' - - # Self-formatting capabilities (processor-provided formatters) - def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Renders complete object as JSON-compatible dictionary. ''' - - def render_as_markdown( - self, /, *, - reveal_internals: __.typx.Annotated[ - bool, - __.ddoc.Doc( ''' - Controls whether implementation-specific details (internal field names, - version numbers, priority scores) are included. When False, only - user-facing information is shown. - ''' ) - ] = True, - ) -> tuple[ str, ... ]: - ''' Renders complete object as Markdown lines for display. '''' - - -**Universal Fields** -- ``name``: Primary object identifier from inventory location -- ``uri``: Relative URI to object documentation content -- ``inventory_type``: Format identifier (e.g., "sphinx_objects_inv", "mkdocs_search_index") -- ``location_url``: Complete URL to inventory location for debugging and caching - -**Format-Specific Metadata** -- ``specifics``: Immutable dictionary containing processor-specific fields -- Sphinx objects include: ``domain``, ``role``, ``priority``, ``inventory_project``, ``inventory_version`` -- MkDocs objects include: ``object_type`` (content previews handled by structure processors) - -Search Result Objects -------------------------------------------------------------------------------- - -.. code-block:: python - - class SearchResult( __.immut.DataclassObject ): - ''' Search result with inventory object and match metadata. ''' - - inventory_object: __.typx.Annotated[ - InventoryObject, __.ddoc.Doc( "Matched inventory object with metadata." ) ] - score: __.typx.Annotated[ - float, __.ddoc.Doc( "Search relevance score (0.0-1.0)." ) ] - match_reasons: __.typx.Annotated[ - tuple[ str, ... ], - __.ddoc.Doc( "Detailed reasons for search match." ) ] - - @classmethod - def from_inventory_object( - cls, - inventory_object: InventoryObject, *, - score: float, - match_reasons: __.cabc.Sequence[ str ], - ) -> __.typx.Self: - ''' Creates search result from inventory object with scoring. ''' - -Content and Documentation Objects -------------------------------------------------------------------------------- - -.. code-block:: python - - class ContentDocument( __.immut.DataclassObject ): - ''' Documentation content with extracted metadata and content identification. ''' - - inventory_object: __.typx.Annotated[ - InventoryObject, __.ddoc.Doc( "Location inventory object for this content." ) ] - content_id: __.typx.Annotated[ - str, __.ddoc.Doc( "Deterministic identifier for content retrieval." ) ] - description: __.typx.Annotated[ - str, __.ddoc.Doc( "Extracted object description or summary." ) ] = '' - documentation_url: __.typx.Annotated[ - str, __.ddoc.Doc( "Complete URL to full documentation page." ) ] = '' - - # Structure processor metadata - extraction_metadata: __.typx.Annotated[ - __.immut.Dictionary[ str, __.typx.Any ], - __.ddoc.Doc( "Metadata from structure processor extraction." ) - ] = __.dcls.field( default_factory = __.immut.Dictionary ) - - @property - def has_meaningful_content( self ) -> bool: - ''' Returns True if document contains useful extracted content. ''' - - def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Renders complete document as JSON-compatible dictionary. ''' - - def render_as_markdown( - self, /, *, - reveal_internals: bool = True, - ) -> tuple[ str, ... ]: - ''' Renders complete document as Markdown lines for display. '''' - -Content Identification System -------------------------------------------------------------------------------- - -The content ID system enables browse-then-extract workflows by providing stable identifiers for documentation content objects. Content IDs are deterministic identifiers that allow users to first query with truncated results for previews, then extract full content for specific objects. - -**Content ID Generation Strategy** - -Content IDs use deterministic object identification: ``base64(location + ":" + object_name)`` - -**Design Benefits:** - -- **Stateless Architecture**: Content IDs are self-contained, requiring no session storage -- **Stable Identification**: Same object always generates same ID regardless of query timing -- **Human-Debuggable**: IDs can be decoded to understand referenced objects -- **Performance**: No expensive computation or state tracking required - -**Usage Pattern:** - -.. code-block:: python - - # Stage 1: Browse with previews - generates content IDs for all results - preview_result = await query_content( - auxdata, location, term, lines_max = 5 ) - - # Stage 2: Extract full content using content ID from preview - full_result = await query_content( - auxdata, location, term, - content_id = preview_result.documents[0].content_id, - lines_max = 100 ) - -**Interface Integration:** - -The content_id parameter extends the existing query_content function: - -- **Without content_id**: Returns multiple ContentDocument objects with content IDs populated -- **With content_id**: Filters to single matching ContentDocument with full content -- **Error Handling**: Invalid content IDs raise ProcessorInavailability exceptions - -This design transforms query_content from a simple search function into a flexible content navigation tool while maintaining complete backward compatibility and stateless operation. - -Query Metadata Objects -=============================================================================== - -Search and Operation Metadata -------------------------------------------------------------------------------- - -.. code-block:: python - - class SearchMetadata( __.immut.DataclassObject ): - ''' Search operation metadata and performance statistics. ''' - - results_count: __.typx.Annotated[ - int, __.ddoc.Doc( "Number of results returned to user." ) ] - results_max: __.typx.Annotated[ - int, __.ddoc.Doc( "Maximum results requested by user." ) ] - matches_total: __.typx.Annotated[ - __.typx.Optional[ int ], - __.ddoc.Doc( "Total matching objects before limit applied." ) ] = None - search_time_ms: __.typx.Annotated[ - __.typx.Optional[ int ], - __.ddoc.Doc( "Search execution time in milliseconds." ) ] = None - - @property - def results_truncated( self ) -> bool: - ''' Returns True if results were limited by results_max. ''' - - def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Renders search metadata as JSON-compatible dictionary. '''' - - class InventoryLocationInfo( __.immut.DataclassObject ): - ''' Information about detected inventory location and processor. ''' - - inventory_type: __.typx.Annotated[ - str, __.ddoc.Doc( "Inventory format type identifier." ) ] - location_url: __.typx.Annotated[ - str, __.ddoc.Doc( "Complete URL to inventory location." ) ] - processor_name: __.typx.Annotated[ - str, __.ddoc.Doc( "Name of processor handling this location." ) ] - confidence: __.typx.Annotated[ - float, __.ddoc.Doc( "Detection confidence score (0.0-1.0)." ) ] - object_count: __.typx.Annotated[ - int, __.ddoc.Doc( "Total objects available in this inventory." ) ] - - def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Renders location info as JSON-compatible dictionary. ''' - -Detection Result Objects -------------------------------------------------------------------------------- - -.. code-block:: python - - class Detection( __.immut.DataclassObject ): - ''' Processor detection information with confidence scoring. ''' - - processor_name: __.typx.Annotated[ - str, __.ddoc.Doc( "Name of the processor that can handle this location." ) ] - confidence: __.typx.Annotated[ - float, __.ddoc.Doc( "Detection confidence score (0.0-1.0)." ) ] - processor_type: __.typx.Annotated[ - str, __.ddoc.Doc( "Type of processor (inventory, structure)." ) ] - detection_metadata: __.typx.Annotated[ - __.immut.Dictionary[ str, __.typx.Any ], - __.ddoc.Doc( "Processor-specific detection metadata." ) - ] = __.dcls.field( default_factory = __.immut.Dictionary ) - - class DetectionsResult( __.immut.DataclassObject ): - ''' Detection results with processor selection and timing metadata. ''' - - source: __.typx.Annotated[ - str, __.ddoc.Doc( "Primary location URL for detection operation." ) ] - detections: __.typx.Annotated[ - tuple[ Detection, ... ], - __.ddoc.Doc( "All processor detections found for location." ) ] - detection_optimal: __.typx.Annotated[ - __.typx.Optional[ Detection ], - __.ddoc.Doc( "Best detection result based on confidence scoring." ) ] - time_detection_ms: __.typx.Annotated[ - int, __.ddoc.Doc( "Detection operation time in milliseconds." ) ] - - def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Renders detection results as JSON-compatible dictionary. ''' - - def render_as_markdown( - self, /, *, - reveal_internals: bool = True, - ) -> tuple[ str, ... ]: - ''' Renders detection results as Markdown lines for display. '''' - -Processor Survey Result Objects -------------------------------------------------------------------------------- - -.. code-block:: python - - class ProcessorInfo( __.immut.DataclassObject ): - ''' Information about a processor and its capabilities. ''' - - processor_name: __.typx.Annotated[ - str, __.ddoc.Doc( "Name of the processor for identification." ) ] - processor_type: __.typx.Annotated[ - str, __.ddoc.Doc( "Type of processor (inventory, structure)." ) ] - capabilities: __.typx.Annotated[ - __.interfaces.ProcessorCapabilities, - __.ddoc.Doc( "Complete capability description for processor." ) ] - - def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Renders processor info as JSON-compatible dictionary. ''' - - def render_as_markdown( - self, /, *, - reveal_internals: bool = True, - ) -> tuple[ str, ... ]: - ''' Renders processor info as Markdown lines for display. ''' - - class ProcessorsSurveyResult( __.immut.DataclassObject ): - ''' Survey results listing available processors and capabilities. ''' - - genus: __.typx.Annotated[ - __.interfaces.ProcessorGenera, - __.ddoc.Doc( "Processor genus that was surveyed (inventory or structure)." ) ] - filter_name: __.typx.Annotated[ - __.typx.Optional[ str ], - __.ddoc.Doc( "Optional processor name filter applied to survey." ) ] = None - processors: __.typx.Annotated[ - tuple[ ProcessorInfo, ... ], - __.ddoc.Doc( "Available processors matching survey criteria." ) ] - survey_time_ms: __.typx.Annotated[ - int, __.ddoc.Doc( "Survey operation time in milliseconds." ) ] - - def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Renders survey results as JSON-compatible dictionary. ''' - - def render_as_markdown( - self, /, *, - reveal_internals: bool = True, - ) -> tuple[ str, ... ]: - ''' Renders survey results as Markdown lines for display. '''' - -Error Handling Objects -------------------------------------------------------------------------------- - -The error handling architecture supports both structured error responses for API boundaries and self-rendering exceptions for natural Python exception flow. This dual approach enables clean function signatures while maintaining structured error information across interface layers. - -**Self-Rendering Exception Base Classes** - -.. code-block:: python - - class Omniexception( __.immut.Object, BaseException ): - ''' Base for all exceptions raised by package API. ''' - - class Omnierror( Omniexception, Exception ): - ''' Base for error exceptions with self-rendering capability. ''' - - @__.abc.abstractmethod - def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Renders exception as JSON-compatible dictionary. ''' - - @__.abc.abstractmethod - def render_as_markdown( self ) -> tuple[ str, ... ]: - ''' Renders exception as Markdown lines for display. ''' - -**Domain-Specific Self-Rendering Exceptions** - -.. code-block:: python - - class ProcessorInavailability( Omnierror, RuntimeError ): - ''' No processor found to handle source. ''' - - def __init__( - self, - source: __.typx.Annotated[ - str, __.ddoc.Doc( "Source URL that could not be processed." ) ], - genus: __.Absential[ str ] = __.absent, - query: __.Absential[ str ] = __.absent, - ): ... - - def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Renders processor unavailability as JSON-compatible dictionary. ''' - - class InventoryInaccessibility( Omnierror, RuntimeError ): - ''' Inventory location cannot be accessed. ''' - - def __init__( - self, - location: __.typx.Annotated[ - str, __.ddoc.Doc( "Inventory location URL." ) ], - cause: __.typx.Annotated[ - __.typx.Optional[ BaseException ], - __.ddoc.Doc( "Underlying exception that caused inaccessibility." ) - ] = None, - ): ... - - class InventoryInvalidity( Omnierror, ValueError ): - ''' Inventory data format is invalid or corrupted. ''' - - def __init__( - self, - location: __.typx.Annotated[ - str, __.ddoc.Doc( "Inventory location URL." ) ], - details: __.typx.Annotated[ - str, __.ddoc.Doc( "Description of invalidity." ) - ], - ): ... - - -Complete Query Results -------------------------------------------------------------------------------- - -.. code-block:: python - - class InventoryQueryResult( __.immut.DataclassObject ): - ''' Complete result structure for inventory queries. ''' - - location: __.typx.Annotated[ - str, __.ddoc.Doc( "Primary location URL for this query." ) ] - query: __.typx.Annotated[ - str, __.ddoc.Doc( "Search term or query string used." ) ] - objects: __.typx.Annotated[ - tuple[ InventoryObject, ... ], - __.ddoc.Doc( "Inventory objects matching search criteria." ) ] - search_metadata: __.typx.Annotated[ - SearchMetadata, __.ddoc.Doc( "Search execution and result metadata." ) ] - inventory_locations: __.typx.Annotated[ - tuple[ InventoryLocationInfo, ... ], - __.ddoc.Doc( "Information about inventory locations used." ) ] - - def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Renders inventory query result as JSON-compatible dictionary. ''' - - def render_as_markdown( - self, /, *, - reveal_internals: bool = True, - ) -> tuple[ str, ... ]: - ''' Renders inventory query result as Markdown lines for display. ''' - - class ContentQueryResult( __.immut.DataclassObject ): - ''' Complete result structure for content queries. ''' - - location: __.typx.Annotated[ - str, __.ddoc.Doc( "Primary location URL for this query." ) ] - query: __.typx.Annotated[ - str, __.ddoc.Doc( "Search term or query string used." ) ] - documents: __.typx.Annotated[ - tuple[ ContentDocument, ... ], - __.ddoc.Doc( "Documentation content for matching objects." ) ] - search_metadata: __.typx.Annotated[ - SearchMetadata, __.ddoc.Doc( "Search execution and result metadata." ) ] - inventory_locations: __.typx.Annotated[ - tuple[ InventoryLocationInfo, ... ], - __.ddoc.Doc( "Information about inventory locations used." ) ] - - def render_as_json( - self, /, *, - lines_max: __.typx.Optional[ int ] = None, - ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Renders content query result as JSON-compatible dictionary with optional content truncation. ''' - - def render_as_markdown( - self, /, *, - reveal_internals: bool = True, - lines_max: __.typx.Annotated[ - __.typx.Optional[ int ], - __.ddoc.Doc( "Maximum lines to display per content result." ) - ] = None, - ) -> tuple[ str, ... ]: - ''' Renders content query result as Markdown lines for display. ''' - -Processor Integration Design -=============================================================================== - -Enhanced Base Classes -------------------------------------------------------------------------------- - -The processor layer integrates with structured objects through updated return types: - -.. code-block:: python - - # processors.py - Enhanced base class - class InventoryDetection( Detection ): - ''' Enhanced base class returning structured objects. ''' - - @__.abc.abstractmethod - async def filter_inventory( - self, - auxdata: __.ApplicationGlobals, - location: str, /, *, - filters: __.cabc.Mapping[ str, __.typx.Any ], - details: __.InventoryQueryDetails = ( - __.InventoryQueryDetails.Documentation ), - ) -> tuple[ InventoryObject, ... ]: - ''' Returns structured inventory objects instead of dictionaries. ''' - -Processor Object Formatting -------------------------------------------------------------------------------- - -Each processor provides consistent object formatting: - -.. code-block:: python - - # Sphinx processor formatting - def format_inventory_object( - sphinx_object: __.typx.Any, - inventory: __.typx.Any, - location_url: str, - ) -> InventoryObject: - ''' Formats Sphinx inventory object with complete attribution. ''' - - return InventoryObject( - name = sphinx_object.name, - uri = sphinx_object.uri, - inventory_type = 'sphinx_objects_inv', - location_url = location_url, - display_name = ( - sphinx_object.dispname - if sphinx_object.dispname != '-' - else None ), - specifics = __.immut.Dictionary( - domain = sphinx_object.domain, - role = sphinx_object.role, - priority = sphinx_object.priority, - inventory_project = inventory.project, - inventory_version = inventory.version ) ) - - # MkDocs processor formatting - def format_inventory_object( - mkdocs_document: __.cabc.Mapping[ str, __.typx.Any ], - location_url: str, - ) -> InventoryObject: - ''' Formats MkDocs search index document with attribution. ''' - - typed_doc = dict( mkdocs_document ) - location = str( typed_doc.get( 'location', '' ) ) - title = str( typed_doc.get( 'title', '' ) ) - - return InventoryObject( - name = title, - uri = location, - inventory_type = 'mkdocs_search_index', - location_url = location_url, - specifics = __.immut.Dictionary( - domain = 'page', - role = 'doc', - priority = '1', - object_type = 'page' ) ) - -Functions Layer Integration -=============================================================================== - -Enhanced Business Logic Functions -------------------------------------------------------------------------------- - -The functions module provides clean business logic functions using natural exception flow with self-rendering exceptions: - -.. code-block:: python - - # functions.py - Clean signatures with exception-based error handling - async def query_inventory( - auxdata: __.ApplicationGlobals, - location: __.typx.Annotated[ str, __.ddoc.Fname( 'location argument' ) ], - term: str, /, *, - processor_name: __.Absential[ str ] = __.absent, - search_behaviors: __.SearchBehaviors = _search_behaviors_default, - filters: __.cabc.Mapping[ str, __.typx.Any ] = _filters_default, - details: __.InventoryQueryDetails = ( - __.InventoryQueryDetails.Documentation ), - results_max: int = 5, - ) -> InventoryQueryResult: - ''' Returns structured inventory query results. Raises domain exceptions on error. ''' - - async def query_content( - auxdata: __.ApplicationGlobals, - location: __.typx.Annotated[ str, __.ddoc.Fname( 'location argument' ) ], - term: str, /, *, - processor_name: __.Absential[ str ] = __.absent, - search_behaviors: __.SearchBehaviors = _search_behaviors_default, - filters: __.cabc.Mapping[ str, __.typx.Any ] = _filters_default, - content_id: __.Absential[ str ] = __.absent, - results_max: int = 10, - lines_max: __.typx.Optional[ int ] = None, - ) -> ContentQueryResult: - ''' Returns structured content query results. When content_id provided, returns single matching document. Raises domain exceptions on error. ''' - - async def detect( - auxdata: __.ApplicationGlobals, - location: __.typx.Annotated[ str, __.ddoc.Fname( 'location argument' ) ], /, *, - processor_name: __.Absential[ str ] = __.absent, - processor_types: __.cabc.Sequence[ str ] = ( 'inventory', 'structure' ), - ) -> DetectionsResult: - ''' Returns structured detection results with processor selection and timing. ''' - - async def survey_processors( - auxdata: __.ApplicationGlobals, /, - genus: __.interfaces.ProcessorGenera, - name: __.typx.Optional[ str ] = None, - ) -> ProcessorsSurveyResult: - ''' Returns structured survey results listing available processors and capabilities. ''' - - -Error Handling Patterns -------------------------------------------------------------------------------- - -The system uses **self-rendering exceptions** for natural Python error flow with clean function signatures and consistent error presentation across interface layers. - -**Self-Rendering Exception Pattern** - -Functions use natural exception flow with domain-specific self-rendering exceptions: - -.. code-block:: python - - # Business logic functions with clean signatures - async def query_inventory( - auxdata: __.ApplicationGlobals, - location: str, - term: str, /, *, - search_behaviors: __.SearchBehaviors = _search_behaviors_default, - filters: __.cabc.Mapping[ str, __.typx.Any ] = _filters_default, - details: __.InventoryQueryDetails = __.InventoryQueryDetails.Documentation, - results_max: int = 5, - ) -> InventoryQueryResult: - ''' Returns structured inventory query results. Raises domain exceptions on error. ''' - - # Processor layer raises self-rendering exceptions - class SphinxInventoryProcessor: - async def query_inventory( - self, filters: __.cabc.Mapping[ str, __.typx.Any ], - details: __.InventoryQueryDetails - ) -> tuple[ __.InventoryObject, ... ]: - try: - inventory = extract_inventory( base_url ) - return tuple( format_objects( inventory, filters ) ) - except ConnectionError as exc: - raise InventoryInaccessibility( location = url, cause = exc ) - except ParseError as exc: - raise InventoryInvalidity( location = url, details = str( exc ) ) - -**Interface Layer Exception Handling** - -Interface layers use Aspect-Oriented Programming (AOP) patterns with decorators: - -.. code-block:: python - - # MCP Server - Exception interception decorator signature - def intercept_errors( func ) -> __.cabc.Callable: - ''' Intercepts package exceptions and renders them as JSON for MCP. ''' - - @intercept_errors - async def query_inventory_mcp( location: str, term: str, ... ): - ''' Searches object inventory by name with fuzzy matching. ''' - - # CLI Layer - Parameterized exception handling decorator signature - def intercept_errors( - stream: __.typx.TextIO, - display_format: __.DisplayFormat - ) -> __.cabc.Callable: - ''' Creates decorator to intercept package exceptions and render for CLI. ''' - - -Search Engine Integration -=============================================================================== - -Enhanced Search Result Objects -------------------------------------------------------------------------------- - -.. code-block:: python - - # search.py - Enhanced to work with structured objects - def filter_by_name( - objects: __.cabc.Sequence[ InventoryObject ], - term: str, /, *, - match_mode: __.MatchMode = __.MatchMode.Fuzzy, - fuzzy_threshold: int = 50, - ) -> tuple[ SearchResult, ... ]: - ''' Enhanced search filtering returning structured results. ''' - -Self-Rendering Architecture -------------------------------------------------------------------------------- - -**Universal Rendering Interface** -All structured result objects implement standardized rendering methods: - -.. code-block:: python - - # Universal rendering interface for all result objects - def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Renders object as JSON-compatible immutable dictionary. ''' - - def render_as_markdown( - self, /, *, - reveal_internals: bool = True - ) -> tuple[ str, ... ]: - ''' Renders object as Markdown lines for CLI display. ''' - -**Domain-Specific Rendering Implementation** -Each object encapsulates its own formatting logic: - -.. code-block:: python - - # InventoryObject rendering example - def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Returns JSON-compatible dictionary with domain knowledge. ''' - result = __.immut.Dictionary( - name = self.name, - uri = self.uri, - inventory_type = self.inventory_type, - location_url = self.location_url, - display_name = self.display_name, - effective_display_name = self.effective_display_name, - ) - # Merge with domain-specific formatting logic - return result.union( self.specifics ) - - def render_as_markdown( - self, /, *, reveal_internals: bool = True - ) -> tuple[ str, ... ]: - ''' Returns Markdown lines using processor-specific formatting. ''' - lines = [ f"### `{self.effective_display_name}`" ] - # Domain-specific formatting logic implemented by processors - return tuple( lines ) - -Validation and Type Safety -=============================================================================== - -Object Validation Strategy -------------------------------------------------------------------------------- - -Validation of result objects is implemented at object initialization -through ``__post_init__`` methods when validation is needed. This ensures -that invalid objects cannot be constructed and provides fail-fast behavior -with guaranteed valid state. - -Objects own their validity invariants through initialization-time validation -rather than relying on external validation functions. - -Module Organization -=============================================================================== - -File Structure and Imports -------------------------------------------------------------------------------- - -.. code-block:: python - - # results.py - Core results module - from . import __ - - # Core result objects - class InventoryObject( __.immut.DataclassObject ): ... - class SearchResult( __.immut.DataclassObject ): ... - class ContentDocument( __.immut.DataclassObject ): ... - - # Metadata objects - class SearchMetadata( __.immut.DataclassObject ): ... - class InventoryLocationInfo( __.immut.DataclassObject ): ... - - # Complete query results - class InventoryQueryResult( __.immut.DataclassObject ): ... - class ContentQueryResult( __.immut.DataclassObject ): ... - class DetectionsResult( __.immut.DataclassObject ): ... - - # Survey results - class ProcessorInfo( __.immut.DataclassObject ): ... - class ProcessorsSurveyResult( __.immut.DataclassObject ): ... - - - # Serialization support - def serialize_for_json( ... ): ... - - # Type aliases (at end to avoid forward references) - InventoryObjects: __.typx.TypeAlias = __.cabc.Sequence[ InventoryObject ] - SearchResults: __.typx.TypeAlias = __.cabc.Sequence[ SearchResult ] - ContentDocuments: __.typx.TypeAlias = __.cabc.Sequence[ ContentDocument ] - - -.. code-block:: python - - # exceptions.py - Self-rendering exception hierarchy - from . import __ - - # Base exception hierarchy - class Omniexception( __.immut.Object, BaseException ): ... - class Omnierror( Omniexception, Exception ): ... - - # Domain-specific exceptions with self-rendering capabilities - class ProcessorInavailability( Omnierror, RuntimeError ): ... - class InventoryInaccessibility( Omnierror, RuntimeError ): ... - class InventoryInvalidity( Omnierror, ValueError ): ... - class ContentInaccessibility( Omnierror, RuntimeError ): ... - class ContentInvalidity( Omnierror, ValueError ): ... - -Presentation Layer Integration -=============================================================================== - -CLI and Renderers Integration -------------------------------------------------------------------------------- - -The self-rendering architecture enables clean separation between business logic -and presentation concerns: - -**Presentation vs Business Logic Separation** -- **Objects handle domain logic**: ``result.render_as_json()`` -- **CLI coordinators handle presentation**: truncation, formatting, display helpers -- **MCP server uses objects directly**: no CLI-specific presentation layer - -**Direct Self-Rendering Architecture** -Objects handle all presentation directly through self-rendering methods, eliminating -the need for external presentation coordination layers. - -Integration Benefits -=============================================================================== - -**Clean Function Signatures** -- Natural exception flow eliminates verbose union return types -- Business logic functions have clean success-case signatures -- Type annotations reflect actual success types without error boilerplate -- Function signatures become more readable and maintainable - -**Type Safety and IDE Support** -- Compile-time validation of object structure and field access -- Full IDE autocompletion and refactoring support -- Static analysis capabilities for detecting field usage -- Exception type hierarchy provides structured error catching patterns - -**Self-Rendering Architecture** -- Exceptions handle their own presentation logic through render methods -- Objects encapsulate format-specific knowledge within themselves -- Clean separation between business logic and presentation concerns -- Consistent error display across CLI and MCP interfaces without duplication - -**Aspect-Oriented Error Handling** -- Interface layers use decorators for cross-cutting error handling concerns -- Business logic remains pure with no error marshaling overhead -- Single point of error presentation control per interface layer -- Exception handling behavior easily modified without touching business functions - -**Domain-Specific Rendering** -- Processors provide domain expertise through object rendering methods -- Extensible rendering without modifying CLI or interface layers -- Complete error context preservation from point of failure to presentation -- Self-contained formatting logic reduces coupling between layers - -**Complete Source Attribution** -- Full provenance tracking for every inventory object -- Enhanced debugging capabilities with location-specific metadata -- Foundation for future multi-source aggregation capabilities -- Exception objects maintain complete failure context - -**Consistency and Maintainability** -- Unified interface across all inventory processor types -- Clear separation between universal and format-specific data -- Predictable object structure for interface layers -- Error handling complexity isolated to exception classes and decorators - -**Performance and Scalability** -- Immutable objects enable safe concurrent access -- Structural sharing reduces memory overhead -- Efficient serialization for network transmission -- Exception-based flow avoids creating error objects for success cases -- Domain-specific rendering optimizations contained within objects - -This results module design provides a robust foundation for type-safe operations -across all system components while maintaining clean architectural boundaries -and enabling future enhancements through structured object capabilities and -self-rendering architecture. \ No newline at end of file diff --git a/documentation/architecture/openspec/specs/caching/spec.md b/documentation/architecture/openspec/specs/caching/spec.md new file mode 100644 index 0000000..c3db4ae --- /dev/null +++ b/documentation/architecture/openspec/specs/caching/spec.md @@ -0,0 +1,36 @@ +# Caching + +## Purpose +The Caching capability ensures high performance and reduced network usage by storing retrieved inventories and content locally. + +## Requirements + +### Requirement: Caching System +The system SHALL implement intelligent caching for inventories and content. + +Priority: High + +#### Scenario: Cache Hit +- **WHEN** a user requests a resource that was recently accessed +- **THEN** the system returns the cached version +- **AND** no network request is made + +### Requirement: Cache Invalidation +The system SHALL enforce appropriate TTL and invalidation strategies. + +Priority: High + +#### Scenario: Cache Expiry +- **WHEN** a resource's TTL has expired +- **THEN** the system fetches a fresh copy from the network +- **AND** updates the cache + +### Requirement: Efficiency +The system SHALL use a memory-efficient caching strategy. + +Priority: High + +#### Scenario: Large Inventories +- **WHEN** large inventories are cached +- **THEN** the system manages memory usage effectively +- **AND** prevents excessive consumption diff --git a/documentation/architecture/openspec/specs/cli/spec.md b/documentation/architecture/openspec/specs/cli/spec.md new file mode 100644 index 0000000..14db317 --- /dev/null +++ b/documentation/architecture/openspec/specs/cli/spec.md @@ -0,0 +1,55 @@ +# CLI Interface + +## Purpose +The CLI (Command Line Interface) provides human developers with direct access to documentation search and extraction capabilities. It serves as a testing ground for the engine and a standalone tool for offline documentation access. + +## Requirements + +### Requirement: CLI Implementation +The system SHALL implement a human-usable command-line interface. + +Priority: High + +#### Scenario: Basic Usage +- **WHEN** a user runs the `librovore` command +- **THEN** help text is displayed showing available commands + +### Requirement: Inventory Query Command +The CLI SHALL provide a command for searching documentation inventories. + +Priority: High + +#### Scenario: Searching from CLI +- **WHEN** a user runs `librovore search ` +- **THEN** the system searches configured inventories +- **AND** displays matching results in a human-readable table + +### Requirement: Content Extraction Command +The CLI SHALL provide a command for extracting full content. + +Priority: High + +#### Scenario: Extracting Content +- **WHEN** a user runs `librovore extract ` +- **THEN** the system downloads and processes the page +- **AND** outputs the clean Markdown to stdout or a file + +### Requirement: Output Formats +The CLI SHALL support multiple output formats (JSON, Markdown). + +Priority: High + +#### Scenario: JSON Output +- **WHEN** a user runs a command with `--format json` +- **THEN** the output is strictly valid JSON +- **AND** suitable for piping to tools like `jq` + +### Requirement: Configuration Support +The CLI SHALL support configuration files. + +Priority: High + +#### Scenario: Loading Config +- **WHEN** the CLI starts +- **THEN** it looks for a configuration file +- **AND** applies settings for inventories, cache, etc. diff --git a/documentation/architecture/designs/inventory-processors.rst b/documentation/architecture/openspec/specs/inventory-processing/design.md similarity index 54% rename from documentation/architecture/designs/inventory-processors.rst rename to documentation/architecture/openspec/specs/inventory-processing/design.md index c637503..8b1615f 100644 --- a/documentation/architecture/designs/inventory-processors.rst +++ b/documentation/architecture/openspec/specs/inventory-processing/design.md @@ -1,112 +1,87 @@ -.. vim: set fileencoding=utf-8: -.. -*- coding: utf-8 -*- -.. +--------------------------------------------------------------------------+ - | | - | Licensed under the Apache License, Version 2.0 (the "License"); | - | you may not use this file except in compliance with the License. | - | You may obtain a copy of the License at | - | | - | http://www.apache.org/licenses/LICENSE-2.0 | - | | - | Unless required by applicable law or agreed to in writing, software | - | distributed under the License is distributed on an "AS IS" BASIS, | - | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | - | See the License for the specific language governing permissions and | - | limitations under the License. | - | | - +--------------------------------------------------------------------------+ - - -******************************************************************************* -Inventory Processors Architecture -******************************************************************************* - -Overview -=============================================================================== - -Inventory processors extract and provide object inventories from documentation -sources, enabling discovery and search operations across different documentation -formats. These processors form the foundation of librovore's inventory-based -architecture, converting format-specific inventory data into universal -``InventoryObject`` instances. - -**Role in librovore architecture**: Inventory processors serve as the primary -interface between external documentation sources and librovore's search and -discovery operations. They enable format-agnostic inventory operations while +# Inventory Processors Architecture + +## Overview + +Inventory processors extract and provide object inventories from documentation +sources, enabling discovery and search operations across different documentation +formats. These processors form the foundation of librovore's inventory-based +architecture, converting format-specific inventory data into universal +`InventoryObject` instances. + +**Role in librovore architecture**: Inventory processors serve as the primary +interface between external documentation sources and librovore's search and +discovery operations. They enable format-agnostic inventory operations while maintaining complete source attribution and metadata preservation. -**Relationship to structure processors**: Inventory processors discover and -enumerate documentation objects, while structure processors extract content -from those objects. The two processor types work together through capability-based -filtering to ensure inventory objects are only sent to compatible structure +**Relationship to structure processors**: Inventory processors discover and +enumerate documentation objects, while structure processors extract content +from those objects. The two processor types work together through capability-based +filtering to ensure inventory objects are only sent to compatible structure processors. -**Universal object interface principles**: All inventory processors return -``InventoryObject`` instances regardless of source format, providing type safety, -consistent search operations, and multi-source aggregation capabilities. The -universal interface isolates format differences within processor implementations +**Universal object interface principles**: All inventory processors return +`InventoryObject` instances regardless of source format, providing type safety, +consistent search operations, and multi-source aggregation capabilities. The +universal interface isolates format differences within processor implementations while enabling uniform operations across all inventory types. -Architecture Patterns -=============================================================================== +## Architecture Patterns -Universal Inventory Object Interface -------------------------------------------------------------------------------- +### Universal Inventory Object Interface -**Decision**: All inventory processors return ``InventoryObject`` instances +**Decision**: All inventory processors return `InventoryObject` instances rather than format-specific dictionaries. -**Rationale**: Provides type safety, enables consistent search operations, -and supports multi-source aggregation capabilities. The universal interface +**Rationale**: Provides type safety, enables consistent search operations, +and supports multi-source aggregation capabilities. The universal interface isolates format differences within processor implementations. -**Impact**: Processors become responsible for complete source attribution -and metadata normalization, while search and ranking operations work +**Impact**: Processors become responsible for complete source attribution +and metadata normalization, while search and ranking operations work uniformly across all inventory types. -The universal interface follows a consistent dataflow pattern across all +The universal interface follows a consistent dataflow pattern across all processor types: -.. code-block:: text - - External Inventory Source - │ - ▼ - ┌─────────────────────┐ - │ Detection Phase │ ◄─── Confidence scoring - └─────────────────────┘ URL derivation - │ - ▼ - ┌─────────────────────┐ - │ Loading Phase │ ◄─── Raw data retrieval - └─────────────────────┘ Format validation - │ - ▼ - ┌─────────────────────┐ - │ Transformation │ ◄─── Format-specific parsing - │ Phase │ Universal object creation - └─────────────────────┘ - │ - ▼ - ┌─────────────────────┐ - │ Filtering Phase │ ◄─── Criteria application - └─────────────────────┘ Results ranking - │ - ▼ - Universal InventoryObject Collection - -Source Attribution Strategy -------------------------------------------------------------------------------- - -**Decision**: Every inventory object includes complete provenance information +```text +External Inventory Source + │ + ▼ +┌─────────────────────┐ +│ Detection Phase │ ◄─── Confidence scoring +└─────────────────────┘ URL derivation + │ + ▼ +┌─────────────────────┐ +│ Loading Phase │ ◄─── Raw data retrieval +└─────────────────────┘ Format validation + │ + ▼ +┌─────────────────────┐ +│ Transformation │ ◄─── Format-specific parsing +│ Phase │ Universal object creation +└─────────────────────┘ + │ + ▼ +┌─────────────────────┐ +│ Filtering Phase │ ◄─── Criteria application +└─────────────────────┘ Results ranking + │ + ▼ +Universal InventoryObject Collection +``` + +### Source Attribution Strategy + +**Decision**: Every inventory object includes complete provenance information including processor type, location URL, and format-specific metadata. -**Rationale**: Enables debugging, caching optimization, and future multi-source -operations. Complete attribution allows the system to understand object +**Rationale**: Enables debugging, caching optimization, and future multi-source +operations. Complete attribution allows the system to understand object origins without maintaining separate tracking mechanisms. -**Impact**: Processors must provide consistent metadata extraction and URL -normalization. Format-specific details are preserved in the ``specifics`` +**Impact**: Processors must provide consistent metadata extraction and URL +normalization. Format-specific details are preserved in the `specifics` container without affecting universal operations. Source attribution includes: @@ -115,18 +90,17 @@ Source attribution includes: - **Format metadata preservation**: Format-specific details maintained in structured containers - **Provenance tracking**: Full chain of custody from source to object creation -Confidence-Based Detection -------------------------------------------------------------------------------- +### Confidence-Based Detection -**Decision**: Processor detection uses numerical confidence scores rather than +**Decision**: Processor detection uses numerical confidence scores rather than boolean availability checks. -**Rationale**: Allows graceful handling of edge cases where multiple processors -might partially support a documentation source. Provides foundation for +**Rationale**: Allows graceful handling of edge cases where multiple processors +might partially support a documentation source. Provides foundation for processor precedence and quality assessment. -**Impact**: Detection algorithms must provide meaningful confidence -differentiation. The detection system can make informed choices when multiple +**Impact**: Detection algorithms must provide meaningful confidence +differentiation. The detection system can make informed choices when multiple processors are available for a source. Confidence scoring methodology: @@ -135,53 +109,49 @@ Confidence scoring methodology: - **Low confidence (0.5+)**: Partial or potentially problematic inventories that may still be usable - **Below threshold**: Malformed, empty, or incompatible inventories rejected from consideration -Error Handling Patterns -------------------------------------------------------------------------------- +### Error Handling Patterns -**Consistent Error Categories**: All processors handle standard error types with +**Consistent Error Categories**: All processors handle standard error types with uniform reporting and graceful degradation: - **Accessibility Errors**: Network failures, missing resources, permission denials -- **Format Errors**: Invalid inventory structure, parsing failures, unsupported versions +- **Format Errors**: Invalid inventory structure, parsing failures, unsupported versions - **Configuration Errors**: Invalid filter parameters, unsupported operations - **System Errors**: Unexpected failures, resource exhaustion -**Quality Assurance Patterns**: Multi-stage validation from raw data through final -object creation ensures data integrity and provides detailed error context for +**Quality Assurance Patterns**: Multi-stage validation from raw data through final +object creation ensures data integrity and provides detailed error context for debugging inventory processing issues. -Performance Characteristics -------------------------------------------------------------------------------- +### Performance Characteristics -**Detection Caching**: Detection results are cached with appropriate TTL values -to avoid repeated expensive operations while maintaining data freshness for +**Detection Caching**: Detection results are cached with appropriate TTL values +to avoid repeated expensive operations while maintaining data freshness for dynamic documentation sources. -**Inventory Caching**: Raw inventory data caching at the processor level reduces -external service load while ensuring consistent object creation across multiple +**Inventory Caching**: Raw inventory data caching at the processor level reduces +external service load while ensuring consistent object creation across multiple filter operations. -**Object Caching**: Formatted inventory objects may be cached when processing +**Object Caching**: Formatted inventory objects may be cached when processing large inventories with repeated filter operations to improve response times. -**Scalability Considerations**: Processors implement streaming parsing for large -inventories, pagination support for query results, and memory-efficient object +**Scalability Considerations**: Processors implement streaming parsing for large +inventories, pagination support for query results, and memory-efficient object creation patterns to handle documentation sites of varying sizes. -Processor-Provided Formatters System -=============================================================================== +## Processor-Provided Formatters System -Self-Contained Object Approach -------------------------------------------------------------------------------- +### Self-Contained Object Approach -The processor-provided formatters design implements **self-contained inventory objects** -where each inventory processor creates objects that provide formatting intelligence -for their own ``specifics`` fields. This approach co-locates domain knowledge with +The processor-provided formatters design implements **self-contained inventory objects** +where each inventory processor creates objects that provide formatting intelligence +for their own `specifics` fields. This approach co-locates domain knowledge with the processors that create it, making the system truly extensible and maintainable. -**Core Principle**: Each processor knows best how to present its own data. Sphinx -processors understand ``domain``, ``role``, and ``priority`` semantics. MkDocs -processors understand ``content_preview`` and page-based organization. Other +**Core Principle**: Each processor knows best how to present its own data. Sphinx +processors understand `domain`, `role`, and `priority` semantics. MkDocs +processors understand `content_preview` and page-based organization. Other processors have their own field semantics that cannot be predicted centrally. **Architectural Foundation**: @@ -189,8 +159,7 @@ processors have their own field semantics that cannot be predicted centrally. - **Domain Knowledge Co-location**: Objects understand their own field semantics and presentation requirements - **Extensibility Without Core Changes**: New inventory processors create objects that inherently know how to render themselves -Domain Knowledge Co-location -------------------------------------------------------------------------------- +### Domain Knowledge Co-location Domain knowledge remains with the processors and objects that understand the data: @@ -199,115 +168,109 @@ Domain knowledge remains with the processors and objects that understand the dat - **Evolution Together**: Data structures and presentation logic evolve in tandem - **No External Dependencies**: Objects render themselves without requiring external formatting registries -Interface Specifications -------------------------------------------------------------------------------- +### Interface Specifications -The ``InventoryObject`` class provides self-formatting capabilities through methods -that each processor implements to render format-specific data. See the +The `InventoryObject` class provides self-formatting capabilities through methods +that each processor implements to render format-specific data. See the `results-module-design` document for complete interface specifications. -.. code-block:: python +```python +class InventoryObject( __.immut.DataclassObject ): + ''' Universal inventory object with self-formatting capabilities. ''' - class InventoryObject( __.immut.DataclassObject ): - ''' Universal inventory object with self-formatting capabilities. ''' - - def render_specifics_markdown( - self, /, *, - show_technical: __.typx.Annotated[ bool, __.ddoc.Doc( '...' ) ] = True - ) -> tuple[ str, ... ]: - ''' Renders specifics as Markdown lines for CLI display. ''' - - def render_specifics_json( self ) -> dict[ str, __.typx.Any ]: - ''' Renders specifics as JSON-serializable dictionary. ''' + def render_specifics_markdown( + self, /, *, + show_technical: __.typx.Annotated[ bool, __.ddoc.Doc( '...' ) ] = True + ) -> tuple[ str, ... ]: + ''' Renders specifics as Markdown lines for CLI display. ''' -CLI and JSON Integration Patterns -------------------------------------------------------------------------------- + def render_specifics_json( self ) -> dict[ str, __.typx.Any ]: + ''' Renders specifics as JSON-serializable dictionary. ''' +``` + +### CLI and JSON Integration Patterns The CLI layer integrates with self-formatting objects through standardized interfaces: -.. code-block:: python - - # CLI integration signatures - def _append_inventory_metadata( - lines: __.cabc.MutableSequence[ str ], - inventory_object: __.cabc.Mapping[ str, __.typx.Any ] - ) -> None: - ''' Appends inventory metadata using object self-formatting. ''' - - def _append_content_description( - lines: __.cabc.MutableSequence[ str ], - document: __.cabc.Mapping[ str, __.typx.Any ], - inventory_object: __.cabc.Mapping[ str, __.typx.Any ], - ) -> None: - ''' Appends content description with standard fallbacks. ''' +```python +# CLI integration signatures +def _append_inventory_metadata( + lines: __.cabc.MutableSequence[ str ], + inventory_object: __.cabc.Mapping[ str, __.typx.Any ] +) -> None: + ''' Appends inventory metadata using object self-formatting. ''' + +def _append_content_description( + lines: __.cabc.MutableSequence[ str ], + document: __.cabc.Mapping[ str, __.typx.Any ], + inventory_object: __.cabc.Mapping[ str, __.typx.Any ], +) -> None: + ''' Appends content description with standard fallbacks. ''' +``` Serialization supports self-formatting objects: -.. code-block:: python +```python +# Serialization signatures +def serialize_for_json( obj: __.typx.Any ) -> __.typx.Any: + ''' Serialization supporting self-formatting objects. ''' - # Serialization signatures - def serialize_for_json( obj: __.typx.Any ) -> __.typx.Any: - ''' Serialization supporting self-formatting objects. ''' - - def _serialize_dataclass_for_json( obj: __.typx.Any ) -> dict[ str, __.typx.Any ]: - ''' Serializes dataclass objects using render_specifics_json when available. ''' +def _serialize_dataclass_for_json( obj: __.typx.Any ) -> dict[ str, __.typx.Any ]: + ''' Serializes dataclass objects using render_specifics_json when available. ''' +``` -Example Implementation Patterns -------------------------------------------------------------------------------- +### Example Implementation Patterns Each processor creates objects that understand format-specific rendering: -**Sphinx-specific rendering**: Sphinx inventory objects implement rendering that -shows role and domain information directly, uses Sphinx terminology that users -understand, and includes source attribution and priority when technical details +**Sphinx-specific rendering**: Sphinx inventory objects implement rendering that +shows role and domain information directly, uses Sphinx terminology that users +understand, and includes source attribution and priority when technical details are requested. -**MkDocs-specific rendering**: MkDocs inventory objects implement rendering that -emphasizes document/page nature, shows navigation context and page hierarchy +**MkDocs-specific rendering**: MkDocs inventory objects implement rendering that +emphasizes document/page nature, shows navigation context and page hierarchy when available, and consistently displays document type and page structure. -Detection and Discovery -=============================================================================== +## Detection and Discovery -Detection Interface Contracts -------------------------------------------------------------------------------- +### Detection Interface Contracts -All inventory processors implement standardized detection interfaces that provide +All inventory processors implement standardized detection interfaces that provide consistent behavior across different inventory formats: -.. code-block:: python - - class InventoryDetection( Detection ): - ''' Base class for inventory processor detection. ''' - - @__.typx.abc.abstractmethod - async def detect_async( - self, - location: str, /, *, - auxdata: __.state.Globals - ) -> DetectionResult: - ''' Detects inventory availability with confidence scoring. ''' - - @__.typx.abc.abstractmethod - def format_inventory_object( - self, - source_data: __.typx.Any, - location_url: str, /, *, - auxiliary_data: __.typx.Optional[ __.typx.Any ] = None, - ) -> InventoryObject: - ''' Formats source data into inventory object with self-formatting capabilities. ''' - -**Detection Contract**: Async detection returning confidence-scored results with +```python +class InventoryDetection( Detection ): + ''' Base class for inventory processor detection. ''' + + @__.typx.abc.abstractmethod + async def detect_async( + self, + location: str, /, *, + auxdata: __.state.Globals + ) -> DetectionResult: + ''' Detects inventory availability with confidence scoring. ''' + + @__.typx.abc.abstractmethod + def format_inventory_object( + self, + source_data: __.typx.Any, + location_url: str, /, *, + auxiliary_data: __.typx.Optional[ __.typx.Any ] = None, + ) -> InventoryObject: + ''' Formats source data into inventory object with self-formatting capabilities. ''' +``` + +**Detection Contract**: Async detection returning confidence-scored results with optional caching of preliminary inventory data for performance optimization. -**Object Creation Contract**: Unified object creation interface that converts -format-specific source data into universal inventory objects with complete +**Object Creation Contract**: Unified object creation interface that converts +format-specific source data into universal inventory objects with complete attribution and self-formatting capabilities. -Confidence Scoring Methodology -------------------------------------------------------------------------------- +### Confidence Scoring Methodology -Confidence scoring provides consistent assessment of inventory source quality +Confidence scoring provides consistent assessment of inventory source quality and processor compatibility: **Scoring Factors**: @@ -316,48 +279,45 @@ and processor compatibility: - **Format Indicators**: Clear markers indicating the expected inventory format - **Accessibility**: Reliable access to inventory data without errors or restrictions -**Consistency Requirements**: All processors use equivalent confidence scales and -assessment criteria to ensure reliable processor selection across different +**Consistency Requirements**: All processors use equivalent confidence scales and +assessment criteria to ensure reliable processor selection across different inventory formats. -**Calibration Standards**: Regular validation against known good and problematic +**Calibration Standards**: Regular validation against known good and problematic inventory sources ensures confidence scores remain meaningful and comparable. -Processor Selection Patterns -------------------------------------------------------------------------------- +### Processor Selection Patterns -The detection system provides optimal processor selection based on confidence +The detection system provides optimal processor selection based on confidence scores and capability matching: **Selection Algorithm**: 1. **Confidence Ranking**: Primary selection based on detection confidence scores -2. **Capability Matching**: Secondary filtering based on required operation capabilities +2. **Capability Matching**: Secondary filtering based on required operation capabilities 3. **Performance Characteristics**: Consideration of processor performance profiles 4. **Precedence Rules**: Explicit precedence handling for overlapping processor capabilities -**Multi-Processor Scenarios**: When multiple processors detect inventory sources, -the system applies consistent selection logic while maintaining user experience +**Multi-Processor Scenarios**: When multiple processors detect inventory sources, +the system applies consistent selection logic while maintaining user experience predictability. -Cache Integration Strategy -------------------------------------------------------------------------------- +### Cache Integration Strategy Caching strategy optimizes performance while maintaining data freshness: -**Detection Result Caching**: Confidence-scored detection results cached with +**Detection Result Caching**: Confidence-scored detection results cached with TTL management to avoid repeated expensive detection operations. -**Preliminary Data Caching**: Detection processes may cache preliminary inventory +**Preliminary Data Caching**: Detection processes may cache preliminary inventory data when it can be reused for subsequent processing operations. -**Cache Invalidation**: TTL expiration and explicit invalidation triggers ensure +**Cache Invalidation**: TTL expiration and explicit invalidation triggers ensure cached data remains current with source changes. -**Memory Management**: Cache size limits and LRU eviction policies prevent +**Memory Management**: Cache size limits and LRU eviction policies prevent memory exhaustion during extended operation periods. -Error Handling for Detection Failures -------------------------------------------------------------------------------- +### Error Handling for Detection Failures Robust error handling ensures graceful degradation when detection fails: @@ -367,328 +327,308 @@ Robust error handling ensures graceful degradation when detection fails: - **Format Errors**: Unexpected inventory structure, parsing failures, version incompatibilities - **Resource Errors**: Memory exhaustion, disk space issues, system resource limitations -**Recovery Strategies**: Automatic retry with exponential backoff, graceful degradation +**Recovery Strategies**: Automatic retry with exponential backoff, graceful degradation to alternative processors, and comprehensive error logging for debugging support. -Base Interfaces and Protocols -=============================================================================== +## Base Interfaces and Protocols -InventoryDetection Abstract Base Class -------------------------------------------------------------------------------- +### InventoryDetection Abstract Base Class -The ``InventoryDetection`` abstract base class provides the foundation for all +The `InventoryDetection` abstract base class provides the foundation for all inventory processor implementations: -.. code-block:: python - - class InventoryDetection( Detection ): - ''' Base class providing unified inventory processor interface. ''' - - @property - @__.typx.abc.abstractmethod - def processor_class( self ) -> type[ InventoryProcessor ]: - ''' Returns the processor class for this detection result. ''' - - @property - @__.typx.abc.abstractmethod - def capabilities( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: - ''' Returns processor capability information. ''' - -Universal Interface Contracts -------------------------------------------------------------------------------- - -All inventory processors implement identical interface contracts to ensure +```python +class InventoryDetection( Detection ): + ''' Base class providing unified inventory processor interface. ''' + + @property + @__.typx.abc.abstractmethod + def processor_class( self ) -> type[ InventoryProcessor ]: + ''' Returns the processor class for this detection result. ''' + + @property + @__.typx.abc.abstractmethod + def capabilities( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Returns processor capability information. ''' +``` + +### Universal Interface Contracts + +All inventory processors implement identical interface contracts to ensure consistent behavior and interoperability: -**Detection Interface**: Standardized async detection with confidence scoring, +**Detection Interface**: Standardized async detection with confidence scoring, capability advertisement, and optional preliminary data caching. -**Processing Interface**: Consistent inventory acquisition, query operations, +**Processing Interface**: Consistent inventory acquisition, query operations, and filtering capabilities across all processor implementations. -**Object Creation Interface**: Unified object formatting method signatures that +**Object Creation Interface**: Unified object formatting method signatures that create self-formatting inventory objects with complete source attribution. -Core Processing Methods -------------------------------------------------------------------------------- +### Core Processing Methods All inventory processors implement standardized processing methods: -.. code-block:: python - - class InventoryProcessor( __.abc.ABC ): - ''' Base class for inventory processors. ''' - - @__.typx.abc.abstractmethod - async def query_inventory( - self, - term: __.Absential[ str ] = __.absent, *, - filters: __.cabc.Mapping[ str, __.typx.Any ] = __.immut.Dictionary( ), - details: __.InventoryQueryDetails = __.InventoryQueryDetails.Documentation, - results_max: int = 1000, - ) -> tuple[ InventoryObject, ... ]: - ''' Returns inventory objects matching search and filter criteria. - - When term is absent and filters are empty or trivial, - returns complete inventory (equivalent to acquire_inventory). - When term is present or filters contain constraints, - returns filtered subset limited by results_max. - ''' +```python +class InventoryProcessor( __.abc.ABC ): + ''' Base class for inventory processors. ''' + + @__.typx.abc.abstractmethod + async def query_inventory( + self, + term: __.Absential[ str ] = __.absent, *, + filters: __.cabc.Mapping[ str, __.typx.Any ] = __.immut.Dictionary( ), + details: __.InventoryQueryDetails = __.InventoryQueryDetails.Documentation, + results_max: int = 1000, + ) -> tuple[ InventoryObject, ... ]: + ''' Returns inventory objects matching search and filter criteria. + + When term is absent and filters are empty or trivial, + returns complete inventory (equivalent to acquire_inventory). + When term is present or filters contain constraints, + returns filtered subset limited by results_max. + ''' +``` **Contract Specifications**: -- ``query_inventory`` serves dual purpose: complete inventory retrieval and filtering -- Absent term with empty/trivial filters returns entire inventory +- `query_inventory` serves dual purpose: complete inventory retrieval and filtering +- Absent term with empty/trivial filters returns entire inventory - Present term or non-trivial filters return matching subset limited by results_max - Search and filtering occur at processor level using format-specific knowledge - Results include both structural filtering and name-based search capabilities -format_inventory_object Unified Signature -------------------------------------------------------------------------------- +### format_inventory_object Unified Signature -The unified ``format_inventory_object`` signature ensures consistent object creation +The unified `format_inventory_object` signature ensures consistent object creation across all processor implementations: -.. code-block:: python - - @__.typx.abc.abstractmethod - def format_inventory_object( - self, - source_data: __.typx.Any, - location_url: str, /, *, - auxiliary_data: __.typx.Optional[ __.typx.Any ] = None, - ) -> InventoryObject: - ''' Formats source data into inventory object with self-formatting capabilities. - - Args: - source_data: Format-specific source data (Sphinx object, MkDocs document, etc.) - location_url: Complete URL to inventory location for attribution - auxiliary_data: Additional context data (inventory metadata, etc.) - ''' - -**Parameter Standardization**: Consistent parameter names, types, and semantics +```python +@__.typx.abc.abstractmethod +def format_inventory_object( + self, + source_data: __.typx.Any, + location_url: str, /, *, + auxiliary_data: __.typx.Optional[ __.typx.Any ] = None, +) -> InventoryObject: + ''' Formats source data into inventory object with self-formatting capabilities. + + Args: + source_data: Format-specific source data (Sphinx object, MkDocs document, etc.) + location_url: Complete URL to inventory location for attribution + auxiliary_data: Additional context data (inventory metadata, etc.) + ''' +``` + +**Parameter Standardization**: Consistent parameter names, types, and semantics across all processor implementations eliminate interface confusion. -**Type Safety**: Strong typing ensures compile-time validation of processor +**Type Safety**: Strong typing ensures compile-time validation of processor implementations and caller code. -**Extensibility**: Optional auxiliary data parameter provides extension point +**Extensibility**: Optional auxiliary data parameter provides extension point for processor-specific enhancements without breaking interface compatibility. -Capability Advertisement Patterns -------------------------------------------------------------------------------- +### Capability Advertisement Patterns Processors advertise their capabilities through standardized metadata: -.. code-block:: python +```python +class ProcessorCapabilities( __.immut.DataclassObject ): + ''' Processor capability advertisement. ''' - class ProcessorCapabilities( __.immut.DataclassObject ): - ''' Processor capability advertisement. ''' - - supported_inventory_types: frozenset[ str ] - supported_filters: frozenset[ str ] - performance_characteristics: __.immut.Dictionary[ str, __.typx.Any ] - operational_constraints: __.immut.Dictionary[ str, __.typx.Any ] + supported_inventory_types: frozenset[ str ] + supported_filters: frozenset[ str ] + performance_characteristics: __.immut.Dictionary[ str, __.typx.Any ] + operational_constraints: __.immut.Dictionary[ str, __.typx.Any ] +``` -**Capability Discovery**: Dynamic capability discovery enables system adaptation +**Capability Discovery**: Dynamic capability discovery enables system adaptation to available processors and their operational characteristics. -**Filter Advertisement**: Processors advertise supported filter types, enabling +**Filter Advertisement**: Processors advertise supported filter types, enabling validation of user requests before processing begins. -**Performance Profiles**: Capability information includes performance characteristics +**Performance Profiles**: Capability information includes performance characteristics for operation planning and resource allocation. -Validation and Type Safety -------------------------------------------------------------------------------- +### Validation and Type Safety Strong validation ensures system reliability and provides clear error feedback: -**Interface Validation**: Compile-time and runtime validation of processor +**Interface Validation**: Compile-time and runtime validation of processor implementations against abstract base class contracts. -**Data Validation**: Multi-stage validation from raw inventory data through +**Data Validation**: Multi-stage validation from raw inventory data through final object creation with detailed error context. -**Type Safety**: Comprehensive type annotations enable static analysis and +**Type Safety**: Comprehensive type annotations enable static analysis and provide clear interface contracts for processor implementers. -**Error Propagation**: Structured error handling with detailed context information +**Error Propagation**: Structured error handling with detailed context information supports debugging and system monitoring. -Implementation Outline -=============================================================================== +## Implementation Outline -Processor-Specific Data Source Handling Patterns -------------------------------------------------------------------------------- +### Processor-Specific Data Source Handling Patterns -Inventory processors handle diverse data source formats through specialized +Inventory processors handle diverse data source formats through specialized parsing and validation strategies: -**Data Source Diversity**: Processors accommodate various inventory formats including +**Data Source Diversity**: Processors accommodate various inventory formats including binary files, JSON documents, XML structures, and custom text formats. -**Parsing Strategies**: Format-appropriate parsing techniques including streaming -parsers for large files, validation schemas for structured data, and error +**Parsing Strategies**: Format-appropriate parsing techniques including streaming +parsers for large files, validation schemas for structured data, and error recovery mechanisms for malformed inputs. -**Performance Optimization**: Memory-efficient processing techniques including -lazy loading, incremental parsing, and selective data extraction based on +**Performance Optimization**: Memory-efficient processing techniques including +lazy loading, incremental parsing, and selective data extraction based on query requirements. -Format-Specific Object Creation Strategies -------------------------------------------------------------------------------- +### Format-Specific Object Creation Strategies -Object creation strategies vary by inventory format while maintaining universal +Object creation strategies vary by inventory format while maintaining universal output consistency: -**Metadata Normalization**: Translation of format-specific metadata into universal +**Metadata Normalization**: Translation of format-specific metadata into universal object fields while preserving format-specific details in structured containers. -**Attribution Strategies**: Consistent source attribution patterns that capture -complete provenance information including processor type, source location, and +**Attribution Strategies**: Consistent source attribution patterns that capture +complete provenance information including processor type, source location, and format-specific identifiers. -**Self-Formatting Integration**: Object creation includes formatting method -implementation that understands format-specific semantics and presentation +**Self-Formatting Integration**: Object creation includes formatting method +implementation that understands format-specific semantics and presentation requirements. -Detection Methodology and Validation Approaches -------------------------------------------------------------------------------- +### Detection Methodology and Validation Approaches -Detection implementations use format-appropriate validation and confidence +Detection implementations use format-appropriate validation and confidence assessment techniques: -**Probe Strategies**: Sequential or parallel probing of standard and alternative +**Probe Strategies**: Sequential or parallel probing of standard and alternative inventory locations using format-specific URL patterns. -**Validation Criteria**: Format-appropriate structural validation including +**Validation Criteria**: Format-appropriate structural validation including schema compliance, content quality assessment, and compatibility verification. -**Confidence Calibration**: Consistent confidence scoring based on validation +**Confidence Calibration**: Consistent confidence scoring based on validation results, content quality metrics, and format-specific quality indicators. -Content Integration and Search Patterns -------------------------------------------------------------------------------- +### Content Integration and Search Patterns Integration with search and content systems through standardized interfaces: -**Search Integration**: Universal object interfaces enable format-agnostic -search operations while preserving format-specific search capabilities through +**Search Integration**: Universal object interfaces enable format-agnostic +search operations while preserving format-specific search capabilities through metadata containers. -**Content Coordination**: Capability-based filtering ensures inventory objects +**Content Coordination**: Capability-based filtering ensures inventory objects are only processed by compatible structure processors for content extraction. -**Multi-Source Coordination**: Source attribution enables tracking and coordination +**Multi-Source Coordination**: Source attribution enables tracking and coordination across multiple inventory sources for comprehensive documentation coverage. -Performance Optimization Strategies -------------------------------------------------------------------------------- +### Performance Optimization Strategies Performance optimization approaches tailored to inventory processing characteristics: -**Caching Strategies**: Multi-level caching including detection results, raw +**Caching Strategies**: Multi-level caching including detection results, raw inventory data, and formatted objects with appropriate TTL management. -**Lazy Loading**: Deferred processing of inventory data until required by +**Lazy Loading**: Deferred processing of inventory data until required by specific operations to minimize initial load times. -**Batch Processing**: Efficient batch operations for large inventory processing +**Batch Processing**: Efficient batch operations for large inventory processing tasks with memory management and progress tracking. -Scalability and Extension Considerations -------------------------------------------------------------------------------- +### Scalability and Extension Considerations Design patterns support system scalability and future enhancement: -**Memory Management**: Bounded memory usage through streaming processing, +**Memory Management**: Bounded memory usage through streaming processing, pagination, and selective data loading based on operational requirements. -**Processor Extensibility**: Clear extension points for new inventory formats +**Processor Extensibility**: Clear extension points for new inventory formats through abstract base class implementation and capability advertisement. -**Configuration Management**: Flexible configuration systems supporting +**Configuration Management**: Flexible configuration systems supporting processor-specific parameters and operational tuning. -Example Implementation Skeletons -------------------------------------------------------------------------------- +### Example Implementation Skeletons **Sphinx Processor Outline**: -- ``objects.inv`` binary file handling with decompression and parsing +- `objects.inv` binary file handling with decompression and parsing - Domain/role semantic understanding for object categorization - Priority-based object ranking and presentation - Cross-reference resolution for documentation linking - Theme-independent inventory processing **MkDocs Processor Outline**: -- ``search_index.json`` file handling with page-level extraction +- `search_index.json` file handling with page-level extraction - Content preview generation from embedded text - Navigation context extraction from page hierarchy - Alternative format support for theme-specific variations - Hybrid content strategy coordination -Extension Points and Future Processors -=============================================================================== +## Extension Points and Future Processors -Plugin Architecture Patterns -------------------------------------------------------------------------------- +### Plugin Architecture Patterns -Consistent processor interfaces enable third-party inventory processors through +Consistent processor interfaces enable third-party inventory processors through well-defined extension patterns: -**Interface Compliance**: New processors implement standard abstract base classes +**Interface Compliance**: New processors implement standard abstract base classes with consistent method signatures and behavioral contracts. -**Capability Integration**: Processor capability advertisement enables system +**Capability Integration**: Processor capability advertisement enables system integration without core code modifications. -**Registration Mechanisms**: Dynamic processor discovery and registration through +**Registration Mechanisms**: Dynamic processor discovery and registration through plugin management systems or configuration-based registration. -Custom Processor Development -------------------------------------------------------------------------------- +### Custom Processor Development Clear development patterns support custom inventory processor creation: -**Development Guidelines**: Comprehensive documentation of interface requirements, +**Development Guidelines**: Comprehensive documentation of interface requirements, performance expectations, and integration patterns. -**Testing Frameworks**: Standardized testing patterns and validation suites +**Testing Frameworks**: Standardized testing patterns and validation suites for processor development and verification. -**Reference Implementations**: Well-documented reference processors demonstrate +**Reference Implementations**: Well-documented reference processors demonstrate implementation patterns and best practices. -Capability Evolution Support -------------------------------------------------------------------------------- +### Capability Evolution Support System design accommodates processor capability enhancement over time: -**Backward Compatibility**: Interface evolution strategies that maintain compatibility +**Backward Compatibility**: Interface evolution strategies that maintain compatibility with existing processors while enabling enhanced functionality. -**Capability Versioning**: Version management for processor capabilities enabling +**Capability Versioning**: Version management for processor capabilities enabling gradual system enhancement and feature adoption. -**Feature Negotiation**: Dynamic feature negotiation between system components +**Feature Negotiation**: Dynamic feature negotiation between system components based on advertised processor capabilities. -Performance Optimization Strategies -------------------------------------------------------------------------------- +### Performance Optimization Strategies Extension points support continued performance optimization: -**Custom Caching**: Processor-specific caching strategies optimized for particular +**Custom Caching**: Processor-specific caching strategies optimized for particular inventory formats and access patterns. -**Parallel Processing**: Opportunities for parallel inventory processing with +**Parallel Processing**: Opportunities for parallel inventory processing with appropriate synchronization and coordination mechanisms. -**Resource Management**: Adaptive resource allocation based on processor +**Resource Management**: Adaptive resource allocation based on processor characteristics and operational requirements. -This inventory processor architecture provides a comprehensive foundation for -format-agnostic inventory operations while maintaining clean separation between -universal interfaces and format-specific implementations. The design supports -extensibility, performance optimization, and consistent user experience across -diverse documentation source formats. \ No newline at end of file +This inventory processor architecture provides a comprehensive foundation for +format-agnostic inventory operations while maintaining clean separation between +universal interfaces and format-specific implementations. The design supports +extensibility, performance optimization, and consistent user experience across +diverse documentation source formats. diff --git a/documentation/architecture/openspec/specs/inventory-processing/spec.md b/documentation/architecture/openspec/specs/inventory-processing/spec.md new file mode 100644 index 0000000..3f14f96 --- /dev/null +++ b/documentation/architecture/openspec/specs/inventory-processing/spec.md @@ -0,0 +1,55 @@ +# Inventory Processing + +## Purpose +The Inventory Processing capability extracts and provides object inventories from documentation sources, enabling discovery and search operations across different documentation formats (Sphinx, MkDocs, Pydoctor, Rustdoc). + +## Requirements + +### Requirement: Sphinx Support +The system SHALL provide full support for Sphinx documentation sites. + +Priority: Critical + +#### Scenario: Processing Sphinx Site +- **WHEN** a Sphinx site URL is provided +- **THEN** the system parses the `objects.inv` file +- **AND** correctly identifies cross-references and content structure + +### Requirement: MkDocs Support +The system SHALL provide full support for MkDocs sites, specifically with `mkdocstrings`. + +Priority: Critical + +#### Scenario: Processing MkDocs Site +- **WHEN** an MkDocs site URL is provided +- **THEN** the system parses the inventory +- **AND** extracts content from Material for MkDocs theme structures + +### Requirement: Pydoctor Support +The system SHALL provide full support for Pydoctor documentation sites. + +Priority: Critical + +#### Scenario: Processing Pydoctor Site +- **WHEN** a Pydoctor site URL is provided +- **THEN** the system parses the inventory +- **AND** extracts content from Pydoctor-generated HTML + +### Requirement: Rustdoc Support +The system SHALL provide full support for Rustdoc documentation sites. + +Priority: Critical + +#### Scenario: Processing Rustdoc Site +- **WHEN** a Rustdoc site URL is provided +- **THEN** the system parses the `search-index.js` +- **AND** extracts content from Rustdoc-generated HTML + +### Requirement: Extensibility +The system SHALL provide a plugin architecture for additional processors. + +Priority: Low + +#### Scenario: Adding a Plugin +- **WHEN** a developer implements the processor interface +- **THEN** the system discovers and uses the new processor diff --git a/documentation/architecture/openspec/specs/mcp-server/spec.md b/documentation/architecture/openspec/specs/mcp-server/spec.md new file mode 100644 index 0000000..6681e76 --- /dev/null +++ b/documentation/architecture/openspec/specs/mcp-server/spec.md @@ -0,0 +1,55 @@ +# MCP Server + +## Purpose +The MCP (Model Context Protocol) Server interface enables AI agents to programmatically discover, search, and extract technical documentation. It acts as the primary bridge between AI systems and the documentation engine. + +## Requirements + +### Requirement: MCP Server Implementation +The system SHALL implement a complete MCP server with FastMCP framework. + +Priority: Critical + +#### Scenario: AI Agent Connection +- **WHEN** an AI agent connects to the MCP server +- **THEN** the server accepts the connection +- **AND** exposes available tools (query_inventory, query_content, summarize_inventory) + +### Requirement: JSON Schema Generation +The server SHALL generate JSON schemas for all tool parameters. + +Priority: Critical + +#### Scenario: Tool Discovery +- **WHEN** an AI agent requests the list of available tools +- **THEN** the server returns the tool list +- **AND** includes valid JSON schemas for all parameters + +### Requirement: Query Inventory Tool +The server SHALL implement a `query_inventory` tool for searching documentation objects. + +Priority: Critical + +#### Scenario: Searching Inventory +- **WHEN** the agent calls `query_inventory` with a search term +- **THEN** the server searches the loaded inventories +- **AND** returns a list of matching objects + +### Requirement: Query Content Tool +The server SHALL implement a `query_content` tool for retrieving full text. + +Priority: Critical + +#### Scenario: Fetching Content +- **WHEN** the agent calls `query_content` with a URL or object ID +- **THEN** the server retrieves the content +- **AND** returns it in clean Markdown format + +### Requirement: Summarize Inventory Tool +The server SHALL implement a `summarize_inventory` tool for high-level overview. + +Priority: Critical + +#### Scenario: Inventory Summary +- **WHEN** the agent calls `summarize_inventory` for a site +- **THEN** the server provides statistics and top-level structure of the documentation diff --git a/documentation/architecture/openspec/specs/processor-detection/design.md b/documentation/architecture/openspec/specs/processor-detection/design.md new file mode 100644 index 0000000..8ba8cce --- /dev/null +++ b/documentation/architecture/openspec/specs/processor-detection/design.md @@ -0,0 +1,399 @@ +# Processor Detection System Design + +## Overview + +The processor detection system provides automated selection of appropriate +inventory and structure processors for documentation sources. The design +implements confidence-based scoring with TTL-based caching to balance +performance with accuracy and data freshness. + +This document focuses on the orchestration layer that coordinates processor +selection across processor genera (inventory vs. structure processors), while +detailed processor-specific detection patterns are covered in the respective +processor architecture documents. + +## Architecture + +### Design Principles + +**Genus-Based Separation** + Inventory processors and structure processors operate in separate detection + pipelines, allowing independent evolution and different selection criteria. + Each genus maintains its own cache and processor registry. + +**Confidence-Based Selection** + Processors return numerical confidence scores (0.0-1.0). Only processors + exceeding `CONFIDENCE_THRESHOLD_MINIMUM` (0.5) are considered, with highest + confidence and registration order as stable tiebreaker. + +**Immutable Data Structures** + All detection results use immutable containers (`__.immut.Dictionary`, + `tuple`) following project practices for thread safety and predictable + behavior. + +**Wide Parameter, Narrow Return Pattern** + Public functions accept abstract base classes for parameters and return + specific concrete types, following established project practices. + +### Component Structure + +**Detection Orchestration** (`detection.py`) + Central coordination of processor selection across inventory and structure + genera. Provides both high-level convenience functions and low-level + extensible functions for custom processor mappings. + +**Cache Management** + TTL-based caching system with lazy expiration cleanup. Separate cache + instances per processor genus enable different configuration and evolution + patterns. + +**Processor Integration** + Abstract base classes in `processors.py` define detection contracts. + Format-specific implementations in `inventories/` and `structures/` + subpackages provide concrete detection logic. + +## Processor Genera System + +### ProcessorGenera Enumeration + +The system defines distinct processor genera that operate independently: + +```python +class ProcessorGenera( __.typx.Enum ): + ''' Enumeration of processor genera for detection orchestration. ''' + + Inventory = 'inventory' # Inventory object extraction processors + Structure = 'structure' # Content extraction processors +``` + +**Inventory Processors**: Extract object inventories from documentation sources, +providing discovery and search capabilities across different documentation formats. +Detailed architecture covered in `inventory-design.md`. + +**Structure Processors**: Extract content from documentation pages, transforming +HTML into structured documents for search and analysis. Detailed architecture +covered in `structure-design.md`. + +### Genus-Specific Detection Pipelines + +Each processor genus maintains independent detection infrastructure: + +**Separate Cache Instances**: Each genus has dedicated cache management with +genus-appropriate TTL values and eviction strategies. + +**Independent Processor Registries**: Processor registration and discovery +operates independently per genus, enabling different processor lifecycle management. + +**Genus-Specific Selection Logic**: Processor selection algorithms can differ +between genera based on their operational characteristics and requirements. + +**Separate Error Handling**: Each genus implements error handling appropriate +to its operational context and failure modes. + +## Interface Specifications + +### Primary Detection Functions + +```python +async def detect( + auxdata: _state.Globals, + source: str, /, + genus: _interfaces.ProcessorGenera, *, + processor_name: __.Absential[ str ] = __.absent, +) -> _processors.Detection + +async def detect_inventory( + auxdata: _state.Globals, + source: str, /, *, + processor_name: __.Absential[ str ] = __.absent, +) -> _processors.InventoryDetection + +async def detect_structure( + auxdata: _state.Globals, + source: str, /, *, + processor_name: __.Absential[ str ] = __.absent, +) -> _processors.StructureDetection +``` + +**Contract:** +- Returns highest-confidence processor detection above threshold +- Raises `ProcessorInavailability` if no suitable processor found +- Bypasses detection when specific `processor_name` provided +- Maintains detection results in genus-specific cache + +### Cache Access Functions + +```python +async def access_detections( + auxdata: _state.Globals, + source: str, /, *, + genus: _interfaces.ProcessorGenera +) -> tuple[ + _processors.DetectionsByProcessor, + __.Absential[ _processors.Detection ] +] + +async def access_detections_ll( + auxdata: _state.Globals, + source: str, /, *, + cache: DetectionsCache, + processors: __.cabc.Mapping[ str, _processors.Processor ], +) -> tuple[ + _processors.DetectionsByProcessor, + __.Absential[ _processors.Detection ] +] +``` + +**Contract:** +- Returns all processor detections plus optimal selection +- Executes fresh detection if cache miss or expiration +- Low-level variant accepts arbitrary processor mapping for extensibility +- Never raises exceptions; returns `__.absent` for missing optimal detection + +## Data Structures + +### Detection Cache Design + +```python +class DetectionsCacheEntry( __.immut.DataclassObject ): + detections: __.cabc.Mapping[ str, _processors.Detection ] + timestamp: float + ttl: int + + @property + def detection_optimal( self ) -> __.Absential[ _processors.Detection ] + + def invalid( self, current_time: float ) -> bool + +class DetectionsCache( __.immut.DataclassObject ): + ttl: int = 3600 + _entries: dict[ str, DetectionsCacheEntry ] = __.dcls.field( + default_factory = dict[ str, DetectionsCacheEntry ] ) + + def access_detections( + self, source: str + ) -> __.Absential[ _processors.DetectionsByProcessor ] + + def access_detection_optimal( + self, source: str + ) -> __.Absential[ _processors.Detection ] + + def add_entry( + self, source: str, detections: _processors.DetectionsByProcessor + ) -> __.typx.Self +``` + +**Design Features:** +- TTL-based expiration with configurable timeouts per cache instance +- Lazy cleanup on access operations to minimize overhead +- Pre-computed optimal selection stored in cache entries +- Method chaining support through `__.typx.Self` returns + +### Type Aliases + +```python +DetectionsByProcessor: __.typx.TypeAlias = __.cabc.Mapping[ + str, _processors.Detection ] +``` + +**Purpose:** Provides semantic clarity for function signatures and return types +while maintaining wide parameter acceptance patterns. + +## Behavioral Contracts + +### Processor Selection Contract + +**Selection Algorithm:** +1. Execute all processors in genus-specific registry on source +2. Filter results to confidence >= `CONFIDENCE_THRESHOLD_MINIMUM` (0.5) +3. Select highest confidence; use registration order for ties +4. Return `__.absent` if no processors meet confidence threshold + +**Error Handling:** +- Individual processor detection failures are logged but not propagated +- Failed processors are excluded from selection consideration +- Selection continues with remaining successful processors + +### Cache Management Contract + +**Cache Population:** +- Fresh detection triggered on cache miss or TTL expiration +- All genus processors executed in parallel (future enhancement) +- Results cached regardless of optimal selection success + +**Cache Access:** +- Thread-safe read operations using immutable data structures +- Expired entries removed lazily on access +- Missing or expired entries trigger fresh processor execution + +**TTL Management:** +- Configurable per-cache instance (default: 3600 seconds) +- Based on cache entry creation timestamp +- Independent expiration per source URL + +## Extension Points + +### Processor Genus Extension + +**Adding New Processor Types:** +1. Extend `ProcessorGenera` enumeration in `interfaces.py` +2. Add genus-specific cache instance in `detection.py` +3. Update genus dispatch in `access_detections` function +4. Register processors in genus-specific registry + +**Processor Implementation Requirements:** +- Implement `detect` method returning confidence-scored `Detection` +- Handle detection failures gracefully (should not raise exceptions) +- Return confidence score in range 0.0-1.0 +- Provide processor capabilities metadata + +### Cache Strategy Extension + +**Custom Cache Implementations:** +- `DetectionsCache` interface supports alternative implementations +- Size-based eviction strategies can be added via subclassing +- Different TTL strategies per processor type or source pattern +- External cache stores (Redis, etc.) through interface compliance + +**Performance Optimization:** +- Parallel processor execution via async fanout (marked TODO) +- Processor-specific timeout configuration +- Cache warming strategies for frequently accessed sources + +## Error Handling Design + +### Structured Error Response System + +The system implements a structured error response pattern where the functions layer +handles all processor detection exceptions and returns user-friendly structured +responses. This design eliminates error interpretation at interface layers while +providing consistent, actionable error messaging. + +**Response Structure:** + +```python +ErrorResponse: __.typx.TypeAlias = __.immut.Dictionary[ str, __.typx.Any ] + +def _produce_inventory_error_response( + source: str, + attempted_patterns: __.Absential[ __.cabc.Sequence[ str ] ] = __.absent +) -> ErrorResponse + +def _produce_structure_error_response( source: str ) -> ErrorResponse + +def _produce_generic_error_response( + source: str, genus: str +) -> ErrorResponse +``` + +**Error Response Content:** +- Structured responses include error type, user-friendly title, detailed message +- Actionable suggestions provided based on specific failure scenarios +- Clear distinction between inventory and structure detection failures +- Pre-formatted messages eliminate interface layer error interpretation + +### Automatic URL Pattern Extension + +The detection system implements universal URL pattern extension that applies to +all processor types. When detection fails at the original URL, the system +automatically probes common documentation site patterns before reporting failure. + +**Universal Pattern Extension:** +- Applies to both inventory and structure processors uniformly +- Documentation content location affects both inventory files and content uniformly +- Common patterns include `/en/latest/`, `/latest/`, `/main/`, etc. +- Working URLs are cached in global redirects mapping for future operations + +**Redirects Cache Integration:** + +```python +_url_redirects_cache: dict[ str, str ] # original_url → working_url + +def normalize_location( location: str ) -> str +``` + +**Transparent URL Resolution:** +- All operations automatically use working URLs from redirects cache +- Users receive actual working URLs as canonical source in responses +- Cache updates ensure consistent URL usage across all subsequent operations + +### Exception Hierarchy + +**Core Exceptions:** + +```python +class ProcessorInavailability( Omnierror, RuntimeError ): + ''' No processor found to handle source. ''' + + def __init__( + self, source: str, genus: str, + attempted_processors: __.cabc.Sequence[ str ] + ) + +class DetectionFailure( Omnierror, RuntimeError ): + ''' Processor detection operation failed. ''' + + def __init__( + self, source: str, genus: str, + processor_errors: __.cabc.Mapping[ str, Exception ] + ) +``` + +**Error Propagation:** +- Individual processor failures are caught and logged, not propagated upward +- Functions layer catches all detection exceptions and produces structured responses +- Interface layers receive pre-formatted error information, never raw exceptions + +## Multiple Inventory Handling Strategy + +### Processor Precedence Design + +When multiple inventory processors successfully detect inventory sources for the +same documentation site, the system applies a precedence-based selection strategy +to maintain consistency and user predictability. + +**Detection Precedence Order:** +1. **Sphinx Inventory Processor** (`objects.inv` files) +2. **MkDocs Inventory Processor** (`search_index.json` files) +3. **Future processors** in registration order + +**Precedence Selection Algorithm:** + +```python +def select_optimal_detection( + detections: __.cabc.Mapping[ str, _processors.Detection ] +) -> __.Absential[ _processors.Detection ]: + ''' Selects optimal detection using precedence and confidence. ''' + # 1. Filter detections meeting confidence threshold + # 2. Apply processor precedence order for qualified detections + # 3. Use highest confidence as tiebreaker within same precedence level + # 4. Return __.absent if no detections meet threshold +``` + +**Design Rationale:** +- **Consistency**: Predictable processor selection across documentation sites +- **Granularity**: Sphinx inventories provide API-level symbol granularity +- **Completeness**: MkDocs search indices provide page-level content coverage +- **Extensibility**: Registration order precedence supports future processor types + +### Inventory Content Coordination + +For sites with multiple detected inventories, the system coordinates content +operations to leverage the selected inventory processor while maintaining +architectural separation between inventory and structure processing. + +**Content Operation Coordination:** +- Selected inventory processor determines object enumeration and filtering +- Structure processors operate independently on content extraction +- Content queries use inventory-selected URIs to guide structure processor operations +- No cross-processor inventory merging to maintain architectural boundaries + +**Cache Strategy:** +- Detection cache stores all successful detections per processor type +- Optimal detection selection cached separately from individual processor results +- Cache entries track processor precedence decisions for consistency +- TTL expiration applies uniformly to all cached detection results + +This detection system design provides robust, extensible automated processor +selection while maintaining clean architectural boundaries between processor +genera and established project practices compliance. diff --git a/documentation/architecture/openspec/specs/processor-detection/spec.md b/documentation/architecture/openspec/specs/processor-detection/spec.md new file mode 100644 index 0000000..a25a9be --- /dev/null +++ b/documentation/architecture/openspec/specs/processor-detection/spec.md @@ -0,0 +1,16 @@ +# Processor Detection + +## Purpose +The Processor Detection capability provides automated selection of appropriate inventory and structure processors for documentation sources. + +## Requirements + +### Requirement: Processor Detection +The system SHALL automatically detect the appropriate processor. + +Priority: High + +#### Scenario: Auto-detection +- **WHEN** a user provides a URL without specifying the type +- **THEN** the system analyzes the site (robots.txt, files) +- **AND** selects the correct processor diff --git a/documentation/architecture/openspec/specs/search/design.md b/documentation/architecture/openspec/specs/search/design.md new file mode 100644 index 0000000..5a7bf37 --- /dev/null +++ b/documentation/architecture/openspec/specs/search/design.md @@ -0,0 +1,846 @@ +# Results Module Design + +## Overview + +The results module provides a centralized collection of structured dataclass +objects representing search results, inventory objects, and content documents. +This module serves as the foundation for type-safe operations across all +interface layers while maintaining clean separation between data representation +and business logic. + +## Design Principles + +### Architectural Foundation + +**Centralized Type Definitions** + All result-related dataclasses reside in a single module to ensure consistency + and prevent circular dependencies between processor, search, and function + modules. + +**Immutable Data Structures** + All result objects inherit from `__.immut.DataclassObject` following project + practices for thread safety and predictable behavior in concurrent operations. + +**Universal Object Interface** + All inventory processors return `InventoryObject` instances rather than + format-specific dictionaries, providing type safety and enabling consistent + search operations across different inventory formats. + +**Complete Source Attribution** + Every result object includes complete provenance information enabling + debugging, caching optimization, and future multi-source operations without + requiring separate tracking mechanisms. + +**Clean Separation of Concerns** + Inventory objects represent pure documentation metadata without search-specific + fields. Search results wrap inventory objects with relevance scoring. This + separation allows inventory objects to be reused across different search + contexts and enables search-independent operations. + +**Self-Rendering Object Architecture** + All result objects implement standardized rendering methods for different + output formats, encapsulating domain-specific formatting knowledge within + the objects themselves rather than external formatting functions. + +## Core Object Definitions + +### Universal Inventory Object + +```python +class InventoryObject( __.immut.DataclassObject ): + ''' Universal inventory object with complete source attribution. + + Represents a single documentation object from any inventory source + with standardized fields, format-specific metadata container, and + self-formatting capabilities where each processor creates objects + that know how to render their own specifics data. + ''' + + # Universal identification fields + name: __.typx.Annotated[ + str, __.ddoc.Doc( "Primary object identifier from inventory source." ) ] + uri: __.typx.Annotated[ + str, __.ddoc.Doc( "Relative URI to object documentation content." ) ] + inventory_type: __.typx.Annotated[ + str, __.ddoc.Doc( "Inventory format identifier (e.g., sphinx_objects_inv)." ) ] + location_url: __.typx.Annotated[ + str, __.ddoc.Doc( "Complete URL to inventory location for attribution." ) ] + + # Optional display enhancement + display_name: __.typx.Annotated[ + __.typx.Optional[ str ], + __.ddoc.Doc( "Human-readable name if different from name." ) ] = None + + # Format-specific metadata container + specifics: __.typx.Annotated[ + __.immut.Dictionary[ str, __.typx.Any ], + __.ddoc.Doc( "Format-specific metadata (domain, role, priority, etc.)." ) + ] = __.dcls.field( default_factory = __.immut.Dictionary ) + + + @property + def effective_display_name( self ) -> str: + ''' Returns display_name if available, otherwise falls back to name. ''' + + # Self-formatting capabilities (processor-provided formatters) + def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Renders complete object as JSON-compatible dictionary. ''' + + def render_as_markdown( + self, /, *, + reveal_internals: __.typx.Annotated[ + bool, + __.ddoc.Doc( ''' + Controls whether implementation-specific details (internal field names, + version numbers, priority scores) are included. When False, only + user-facing information is shown. + ''' ) + ] = True, + ) -> tuple[ str, ... ]: + ''' Renders complete object as Markdown lines for display. ''' +``` + +**Universal Fields** +- `name`: Primary object identifier from inventory location +- `uri`: Relative URI to object documentation content +- `inventory_type`: Format identifier (e.g., "sphinx_objects_inv", "mkdocs_search_index") +- `location_url`: Complete URL to inventory location for debugging and caching + +**Format-Specific Metadata** +- `specifics`: Immutable dictionary containing processor-specific fields +- Sphinx objects include: `domain`, `role`, `priority`, `inventory_project`, `inventory_version` +- MkDocs objects include: `object_type` (content previews handled by structure processors) + +### Search Result Objects + +```python +class SearchResult( __.immut.DataclassObject ): + ''' Search result with inventory object and match metadata. ''' + + inventory_object: __.typx.Annotated[ + InventoryObject, __.ddoc.Doc( "Matched inventory object with metadata." ) ] + score: __.typx.Annotated[ + float, __.ddoc.Doc( "Search relevance score (0.0-1.0)." ) ] + match_reasons: __.typx.Annotated[ + tuple[ str, ... ], + __.ddoc.Doc( "Detailed reasons for search match." ) ] + + @classmethod + def from_inventory_object( + cls, + inventory_object: InventoryObject, *, + score: float, + match_reasons: __.cabc.Sequence[ str ], + ) -> __.typx.Self: + ''' Creates search result from inventory object with scoring. ''' +``` + +### Content and Documentation Objects + +```python +class ContentDocument( __.immut.DataclassObject ): + ''' Documentation content with extracted metadata and content identification. ''' + + inventory_object: __.typx.Annotated[ + InventoryObject, __.ddoc.Doc( "Location inventory object for this content." ) ] + content_id: __.typx.Annotated[ + str, __.ddoc.Doc( "Deterministic identifier for content retrieval." ) ] + description: __.typx.Annotated[ + str, __.ddoc.Doc( "Extracted object description or summary." ) ] = '' + documentation_url: __.typx.Annotated[ + str, __.ddoc.Doc( "Complete URL to full documentation page." ) ] = '' + + # Structure processor metadata + extraction_metadata: __.typx.Annotated[ + __.immut.Dictionary[ str, __.typx.Any ], + __.ddoc.Doc( "Metadata from structure processor extraction." ) + ] = __.dcls.field( default_factory = __.immut.Dictionary ) + + @property + def has_meaningful_content( self ) -> bool: + ''' Returns True if document contains useful extracted content. ''' + + def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Renders complete document as JSON-compatible dictionary. ''' + + def render_as_markdown( + self, /, *, + reveal_internals: bool = True, + ) -> tuple[ str, ... ]: + ''' Renders complete document as Markdown lines for display. ''' +``` + +### Content Identification System + +The content ID system enables browse-then-extract workflows by providing stable identifiers for documentation content objects. Content IDs are deterministic identifiers that allow users to first query with truncated results for previews, then extract full content for specific objects. + +**Content ID Generation Strategy** + +Content IDs use deterministic object identification: `base64(location + ":" + object_name)` + +**Design Benefits:** + +- **Stateless Architecture**: Content IDs are self-contained, requiring no session storage +- **Stable Identification**: Same object always generates same ID regardless of query timing +- **Human-Debuggable**: IDs can be decoded to understand referenced objects +- **Performance**: No expensive computation or state tracking required + +**Usage Pattern:** + +```python +# Stage 1: Browse with previews - generates content IDs for all results +preview_result = await query_content( + auxdata, location, term, lines_max = 5 ) + +# Stage 2: Extract full content using content ID from preview +full_result = await query_content( + auxdata, location, term, + content_id = preview_result.documents[0].content_id, + lines_max = 100 ) +``` + +**Interface Integration:** + +The content_id parameter extends the existing query_content function: + +- **Without content_id**: Returns multiple ContentDocument objects with content IDs populated +- **With content_id**: Filters to single matching ContentDocument with full content +- **Error Handling**: Invalid content IDs raise ProcessorInavailability exceptions + +This design transforms query_content from a simple search function into a flexible content navigation tool while maintaining complete backward compatibility and stateless operation. + +## Query Metadata Objects + +### Search and Operation Metadata + +```python +class SearchMetadata( __.immut.DataclassObject ): + ''' Search operation metadata and performance statistics. ''' + + results_count: __.typx.Annotated[ + int, __.ddoc.Doc( "Number of results returned to user." ) ] + results_max: __.typx.Annotated[ + int, __.ddoc.Doc( "Maximum results requested by user." ) ] + matches_total: __.typx.Annotated[ + __.typx.Optional[ int ], + __.ddoc.Doc( "Total matching objects before limit applied." ) ] = None + search_time_ms: __.typx.Annotated[ + __.typx.Optional[ int ], + __.ddoc.Doc( "Search execution time in milliseconds." ) ] = None + + @property + def results_truncated( self ) -> bool: + ''' Returns True if results were limited by results_max. ''' + + def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Renders search metadata as JSON-compatible dictionary. ''' + +class InventoryLocationInfo( __.immut.DataclassObject ): + ''' Information about detected inventory location and processor. ''' + + inventory_type: __.typx.Annotated[ + str, __.ddoc.Doc( "Inventory format type identifier." ) ] + location_url: __.typx.Annotated[ + str, __.ddoc.Doc( "Complete URL to inventory location." ) ] + processor_name: __.typx.Annotated[ + str, __.ddoc.Doc( "Name of processor handling this location." ) ] + confidence: __.typx.Annotated[ + float, __.ddoc.Doc( "Detection confidence score (0.0-1.0)." ) ] + object_count: __.typx.Annotated[ + int, __.ddoc.Doc( "Total objects available in this inventory." ) ] + + def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Renders location info as JSON-compatible dictionary. ''' +``` + +### Detection Result Objects + +```python +class Detection( __.immut.DataclassObject ): + ''' Processor detection information with confidence scoring. ''' + + processor_name: __.typx.Annotated[ + str, __.ddoc.Doc( "Name of the processor that can handle this location." ) ] + confidence: __.typx.Annotated[ + float, __.ddoc.Doc( "Detection confidence score (0.0-1.0)." ) ] + processor_type: __.typx.Annotated[ + str, __.ddoc.Doc( "Type of processor (inventory, structure)." ) ] + detection_metadata: __.typx.Annotated[ + __.immut.Dictionary[ str, __.typx.Any ], + __.ddoc.Doc( "Processor-specific detection metadata." ) + ] = __.dcls.field( default_factory = __.immut.Dictionary ) + +class DetectionsResult( __.immut.DataclassObject ): + ''' Detection results with processor selection and timing metadata. ''' + + source: __.typx.Annotated[ + str, __.ddoc.Doc( "Primary location URL for detection operation." ) ] + detections: __.typx.Annotated[ + tuple[ Detection, ... ], + __.ddoc.Doc( "All processor detections found for location." ) ] + detection_optimal: __.typx.Annotated[ + __.typx.Optional[ Detection ], + __.ddoc.Doc( "Best detection result based on confidence scoring." ) ] + time_detection_ms: __.typx.Annotated[ + int, __.ddoc.Doc( "Detection operation time in milliseconds." ) ] + + def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Renders detection results as JSON-compatible dictionary. ''' + + def render_as_markdown( + self, /, *, + reveal_internals: bool = True, + ) -> tuple[ str, ... ]: + ''' Renders detection results as Markdown lines for display. ''' +``` + +### Processor Survey Result Objects + +```python +class ProcessorInfo( __.immut.DataclassObject ): + ''' Information about a processor and its capabilities. ''' + + processor_name: __.typx.Annotated[ + str, __.ddoc.Doc( "Name of the processor for identification." ) ] + processor_type: __.typx.Annotated[ + str, __.ddoc.Doc( "Type of processor (inventory, structure)." ) ] + capabilities: __.typx.Annotated[ + __.interfaces.ProcessorCapabilities, + __.ddoc.Doc( "Complete capability description for processor." ) ] + + def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Renders processor info as JSON-compatible dictionary. ''' + + def render_as_markdown( + self, /, *, + reveal_internals: bool = True, + ) -> tuple[ str, ... ]: + ''' Renders processor info as Markdown lines for display. ''' + +class ProcessorsSurveyResult( __.immut.DataclassObject ): + ''' Survey results listing available processors and capabilities. ''' + + genus: __.typx.Annotated[ + __.interfaces.ProcessorGenera, + __.ddoc.Doc( "Processor genus that was surveyed (inventory or structure)." ) ] + filter_name: __.typx.Annotated[ + __.typx.Optional[ str ], + __.ddoc.Doc( "Optional processor name filter applied to survey." ) ] = None + processors: __.typx.Annotated[ + tuple[ ProcessorInfo, ... ], + __.ddoc.Doc( "Available processors matching survey criteria." ) ] + survey_time_ms: __.typx.Annotated[ + int, __.ddoc.Doc( "Survey operation time in milliseconds." ) ] + + def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Renders survey results as JSON-compatible dictionary. ''' + + def render_as_markdown( + self, /, *, + reveal_internals: bool = True, + ) -> tuple[ str, ... ]: + ''' Renders survey results as Markdown lines for display. ''' +``` + +### Error Handling Objects + +The error handling architecture supports both structured error responses for API boundaries and self-rendering exceptions for natural Python exception flow. This dual approach enables clean function signatures while maintaining structured error information across interface layers. + +**Self-Rendering Exception Base Classes** + +```python +class Omniexception( __.immut.Object, BaseException ): + ''' Base for all exceptions raised by package API. ''' + +class Omnierror( Omniexception, Exception ): + ''' Base for error exceptions with self-rendering capability. ''' + + @__.abc.abstractmethod + def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Renders exception as JSON-compatible dictionary. ''' + + @__.abc.abstractmethod + def render_as_markdown( self ) -> tuple[ str, ... ]: + ''' Renders exception as Markdown lines for display. ''' +``` + +**Domain-Specific Self-Rendering Exceptions** + +```python +class ProcessorInavailability( Omnierror, RuntimeError ): + ''' No processor found to handle source. ''' + + def __init__( + self, + source: __.typx.Annotated[ + str, __.ddoc.Doc( "Source URL that could not be processed." ) ], + genus: __.Absential[ str ] = __.absent, + query: __.Absential[ str ] = __.absent, + ): ... + + def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Renders processor unavailability as JSON-compatible dictionary. ''' + +class InventoryInaccessibility( Omnierror, RuntimeError ): + ''' Inventory location cannot be accessed. ''' + + def __init__( + self, + location: __.typx.Annotated[ + str, __.ddoc.Doc( "Inventory location URL." ) ], + cause: __.typx.Annotated[ + __.typx.Optional[ BaseException ], + __.ddoc.Doc( "Underlying exception that caused inaccessibility." ) + ] = None, + ): ... + +class InventoryInvalidity( Omnierror, ValueError ): + ''' Inventory data format is invalid or corrupted. ''' + + def __init__( + self, + location: __.typx.Annotated[ + str, __.ddoc.Doc( "Inventory location URL." ) ], + details: __.typx.Annotated[ + str, __.ddoc.Doc( "Description of invalidity." ) + ], + ): ... +``` + +### Complete Query Results + +```python +class InventoryQueryResult( __.immut.DataclassObject ): + ''' Complete result structure for inventory queries. ''' + + location: __.typx.Annotated[ + str, __.ddoc.Doc( "Primary location URL for this query." ) ] + query: __.typx.Annotated[ + str, __.ddoc.Doc( "Search term or query string used." ) ] + objects: __.typx.Annotated[ + tuple[ InventoryObject, ... ], + __.ddoc.Doc( "Inventory objects matching search criteria." ) ] + search_metadata: __.typx.Annotated[ + SearchMetadata, __.ddoc.Doc( "Search execution and result metadata." ) ] + inventory_locations: __.typx.Annotated[ + tuple[ InventoryLocationInfo, ... ], + __.ddoc.Doc( "Information about inventory locations used." ) ] + + def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Renders inventory query result as JSON-compatible dictionary. ''' + + def render_as_markdown( + self, /, *, + reveal_internals: bool = True, + ) -> tuple[ str, ... ]: + ''' Renders inventory query result as Markdown lines for display. ''' + +class ContentQueryResult( __.immut.DataclassObject ): + ''' Complete result structure for content queries. ''' + + location: __.typx.Annotated[ + str, __.ddoc.Doc( "Primary location URL for this query." ) ] + query: __.typx.Annotated[ + str, __.ddoc.Doc( "Search term or query string used." ) ] + documents: __.typx.Annotated[ + tuple[ ContentDocument, ... ], + __.ddoc.Doc( "Documentation content for matching objects." ) ] + search_metadata: __.typx.Annotated[ + SearchMetadata, __.ddoc.Doc( "Search execution and result metadata." ) ] + inventory_locations: __.typx.Annotated[ + tuple[ InventoryLocationInfo, ... ], + __.ddoc.Doc( "Information about inventory locations used." ) ] + + def render_as_json( + self, /, *, + lines_max: __.typx.Optional[ int ] = None, + ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Renders content query result as JSON-compatible dictionary with optional content truncation. ''' + + def render_as_markdown( + self, /, *, + reveal_internals: bool = True, + lines_max: __.typx.Annotated[ + __.typx.Optional[ int ], + __.ddoc.Doc( "Maximum lines to display per content result." ) + ] = None, + ) -> tuple[ str, ... ]: + ''' Renders content query result as Markdown lines for display. ''' +``` + +## Processor Integration Design + +### Enhanced Base Classes + +The processor layer integrates with structured objects through updated return types: + +```python +# processors.py - Enhanced base class +class InventoryDetection( Detection ): + ''' Enhanced base class returning structured objects. ''' + + @__.abc.abstractmethod + async def filter_inventory( + self, + auxdata: __.ApplicationGlobals, + location: str, /, *, + filters: __.cabc.Mapping[ str, __.typx.Any ], + details: __.InventoryQueryDetails = ( + __.InventoryQueryDetails.Documentation ), + ) -> tuple[ InventoryObject, ... ]: + ''' Returns structured inventory objects instead of dictionaries. ''' +``` + +### Processor Object Formatting + +Each processor provides consistent object formatting: + +```python +# Sphinx processor formatting +def format_inventory_object( + sphinx_object: __.typx.Any, + inventory: __.typx.Any, + location_url: str, +) -> InventoryObject: + ''' Formats Sphinx inventory object with complete attribution. ''' + + return InventoryObject( + name = sphinx_object.name, + uri = sphinx_object.uri, + inventory_type = 'sphinx_objects_inv', + location_url = location_url, + display_name = ( + sphinx_object.dispname + if sphinx_object.dispname != '-' + else None ), + specifics = __.immut.Dictionary( + domain = sphinx_object.domain, + role = sphinx_object.role, + priority = sphinx_object.priority, + inventory_project = inventory.project, + inventory_version = inventory.version ) ) + +# MkDocs processor formatting +def format_inventory_object( + mkdocs_document: __.cabc.Mapping[ str, __.typx.Any ], + location_url: str, +) -> InventoryObject: + ''' Formats MkDocs search index document with attribution. ''' + + typed_doc = dict( mkdocs_document ) + location = str( typed_doc.get( 'location', '' ) ) + title = str( typed_doc.get( 'title', '' ) ) + + return InventoryObject( + name = title, + uri = location, + inventory_type = 'mkdocs_search_index', + location_url = location_url, + specifics = __.immut.Dictionary( + domain = 'page', + role = 'doc', + priority = '1', + object_type = 'page' ) ) +``` + +## Functions Layer Integration + +### Enhanced Business Logic Functions + +The functions module provides clean business logic functions using natural exception flow with self-rendering exceptions: + +```python +# functions.py - Clean signatures with exception-based error handling +async def query_inventory( + auxdata: __.ApplicationGlobals, + location: __.typx.Annotated[ str, __.ddoc.Fname( 'location argument' ) ], + term: str, /, *, + processor_name: __.Absential[ str ] = __.absent, + search_behaviors: __.SearchBehaviors = _search_behaviors_default, + filters: __.cabc.Mapping[ str, __.typx.Any ] = _filters_default, + details: __.InventoryQueryDetails = ( + __.InventoryQueryDetails.Documentation ), + results_max: int = 5, +) -> InventoryQueryResult: + ''' Returns structured inventory query results. Raises domain exceptions on error. ''' + +async def query_content( + auxdata: __.ApplicationGlobals, + location: __.typx.Annotated[ str, __.ddoc.Fname( 'location argument' ) ], + term: str, /, *, + processor_name: __.Absential[ str ] = __.absent, + search_behaviors: __.SearchBehaviors = _search_behaviors_default, + filters: __.cabc.Mapping[ str, __.typx.Any ] = _filters_default, + content_id: __.Absential[ str ] = __.absent, + results_max: int = 10, + lines_max: __.typx.Optional[ int ] = None, +) -> ContentQueryResult: + ''' Returns structured content query results. When content_id provided, returns single matching document. Raises domain exceptions on error. ''' + +async def detect( + auxdata: __.ApplicationGlobals, + location: __.typx.Annotated[ str, __.ddoc.Fname( 'location argument' ) ], /, *, + processor_name: __.Absential[ str ] = __.absent, + processor_types: __.cabc.Sequence[ str ] = ( 'inventory', 'structure' ), +) -> DetectionsResult: + ''' Returns structured detection results with processor selection and timing. ''' + +async def survey_processors( + auxdata: __.ApplicationGlobals, /, + genus: __.interfaces.ProcessorGenera, + name: __.typx.Optional[ str ] = None, +) -> ProcessorsSurveyResult: + ''' Returns structured survey results listing available processors and capabilities. ''' +``` + +### Error Handling Patterns + +The system uses **self-rendering exceptions** for natural Python error flow with clean function signatures and consistent error presentation across interface layers. + +**Self-Rendering Exception Pattern** + +Functions use natural exception flow with domain-specific self-rendering exceptions: + +```python +# Business logic functions with clean signatures +async def query_inventory( + auxdata: __.ApplicationGlobals, + location: str, + term: str, /, *, + search_behaviors: __.SearchBehaviors = _search_behaviors_default, + filters: __.cabc.Mapping[ str, __.typx.Any ] = _filters_default, + details: __.InventoryQueryDetails = __.InventoryQueryDetails.Documentation, + results_max: int = 5, +) -> InventoryQueryResult: + ''' Returns structured inventory query results. Raises domain exceptions on error. ''' + +# Processor layer raises self-rendering exceptions +class SphinxInventoryProcessor: + async def query_inventory( + self, filters: __.cabc.Mapping[ str, __.typx.Any ], + details: __.InventoryQueryDetails + ) -> tuple[ __.InventoryObject, ... ]: + try: + inventory = extract_inventory( base_url ) + return tuple( format_objects( inventory, filters ) ) + except ConnectionError as exc: + raise InventoryInaccessibility( location = url, cause = exc ) + except ParseError as exc: + raise InventoryInvalidity( location = url, details = str( exc ) ) +``` + +**Interface Layer Exception Handling** + +Interface layers use Aspect-Oriented Programming (AOP) patterns with decorators: + +```python +# MCP Server - Exception interception decorator signature +def intercept_errors( func ) -> __.cabc.Callable: + ''' Intercepts package exceptions and renders them as JSON for MCP. ''' + +@intercept_errors +async def query_inventory_mcp( location: str, term: str, ... ): + ''' Searches object inventory by name with fuzzy matching. ''' + +# CLI Layer - Parameterized exception handling decorator signature +def intercept_errors( + stream: __.typx.TextIO, + display_format: __.DisplayFormat +) -> __.cabc.Callable: + ''' Creates decorator to intercept package exceptions and render for CLI. ''' +``` + +## Search Engine Integration + +### Enhanced Search Result Objects + +```python +# search.py - Enhanced to work with structured objects +def filter_by_name( + objects: __.cabc.Sequence[ InventoryObject ], + term: str, /, *, + match_mode: __.MatchMode = __.MatchMode.Fuzzy, + fuzzy_threshold: int = 50, +) -> tuple[ SearchResult, ... ]: + ''' Enhanced search filtering returning structured results. ''' +``` + +### Self-Rendering Architecture + +**Universal Rendering Interface** +All structured result objects implement standardized rendering methods: + +```python +# Universal rendering interface for all result objects +def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Renders object as JSON-compatible immutable dictionary. ''' + +def render_as_markdown( + self, /, *, + reveal_internals: bool = True +) -> tuple[ str, ... ]: + ''' Renders object as Markdown lines for CLI display. ''' +``` + +**Domain-Specific Rendering Implementation** +Each object encapsulates its own formatting logic: + +```python +# InventoryObject rendering example +def render_as_json( self ) -> __.immut.Dictionary[ str, __.typx.Any ]: + ''' Returns JSON-compatible dictionary with domain knowledge. ''' + result = __.immut.Dictionary( + name = self.name, + uri = self.uri, + inventory_type = self.inventory_type, + location_url = self.location_url, + display_name = self.display_name, + effective_display_name = self.effective_display_name, + ) + # Merge with domain-specific formatting logic + return result.union( self.specifics ) + +def render_as_markdown( + self, /, *, reveal_internals: bool = True +) -> tuple[ str, ... ]: + ''' Returns Markdown lines using processor-specific formatting. ''' + lines = [ f"### `{self.effective_display_name}`" ] + # Domain-specific formatting logic implemented by processors + return tuple( lines ) +``` + +## Validation and Type Safety + +### Object Validation Strategy + +Validation of result objects is implemented at object initialization +through `__post_init__` methods when validation is needed. This ensures +that invalid objects cannot be constructed and provides fail-fast behavior +with guaranteed valid state. + +Objects own their validity invariants through initialization-time validation +rather than relying on external validation functions. + +## Module Organization + +### File Structure and Imports + +```python +# results.py - Core results module +from . import __ + +# Core result objects +class InventoryObject( __.immut.DataclassObject ): ... +class SearchResult( __.immut.DataclassObject ): ... +class ContentDocument( __.immut.DataclassObject ): ... + +# Metadata objects +class SearchMetadata( __.immut.DataclassObject ): ... +class InventoryLocationInfo( __.immut.DataclassObject ): ... + +# Complete query results +class InventoryQueryResult( __.immut.DataclassObject ): ... +class ContentQueryResult( __.immut.DataclassObject ): ... +class DetectionsResult( __.immut.DataclassObject ): ... + +# Survey results +class ProcessorInfo( __.immut.DataclassObject ): ... +class ProcessorsSurveyResult( __.immut.DataclassObject ): ... + + +# Serialization support +def serialize_for_json( ... ): ... + +# Type aliases (at end to avoid forward references) +InventoryObjects: __.typx.TypeAlias = __.cabc.Sequence[ InventoryObject ] +SearchResults: __.typx.TypeAlias = __.cabc.Sequence[ SearchResult ] +ContentDocuments: __.typx.TypeAlias = __.cabc.Sequence[ ContentDocument ] +``` + +```python +# exceptions.py - Self-rendering exception hierarchy +from . import __ + +# Base exception hierarchy +class Omniexception( __.immut.Object, BaseException ): ... +class Omnierror( Omniexception, Exception ): ... + +# Domain-specific exceptions with self-rendering capabilities +class ProcessorInavailability( Omnierror, RuntimeError ): ... +class InventoryInaccessibility( Omnierror, RuntimeError ): ... +class InventoryInvalidity( Omnierror, ValueError ): ... +class ContentInaccessibility( Omnierror, RuntimeError ): ... +class ContentInvalidity( Omnierror, ValueError ): ... +``` + +## Presentation Layer Integration + +### CLI and Renderers Integration + +The self-rendering architecture enables clean separation between business logic +and presentation concerns: + +**Presentation vs Business Logic Separation** +- **Objects handle domain logic**: `result.render_as_json()` +- **CLI coordinators handle presentation**: truncation, formatting, display helpers +- **MCP server uses objects directly**: no CLI-specific presentation layer + +**Direct Self-Rendering Architecture** +Objects handle all presentation directly through self-rendering methods, eliminating +the need for external presentation coordination layers. + +## Integration Benefits + +**Clean Function Signatures** +- Natural exception flow eliminates verbose union return types +- Business logic functions have clean success-case signatures +- Type annotations reflect actual success types without error boilerplate +- Function signatures become more readable and maintainable + +**Type Safety and IDE Support** +- Compile-time validation of object structure and field access +- Full IDE autocompletion and refactoring support +- Static analysis capabilities for detecting field usage +- Exception type hierarchy provides structured error catching patterns + +**Self-Rendering Architecture** +- Exceptions handle their own presentation logic through render methods +- Objects encapsulate format-specific knowledge within themselves +- Clean separation between business logic and presentation concerns +- Consistent error display across CLI and MCP interfaces without duplication + +**Aspect-Oriented Error Handling** +- Interface layers use decorators for cross-cutting error handling concerns +- Business logic remains pure with no error marshaling overhead +- Single point of error presentation control per interface layer +- Exception handling behavior easily modified without touching business functions + +**Domain-Specific Rendering** +- Processors provide domain expertise through object rendering methods +- Extensible rendering without modifying CLI or interface layers +- Complete error context preservation from point of failure to presentation +- Self-contained formatting logic reduces coupling between layers + +**Complete Source Attribution** +- Full provenance tracking for every inventory object +- Enhanced debugging capabilities with location-specific metadata +- Foundation for future multi-source aggregation capabilities +- Exception objects maintain complete failure context + +**Consistency and Maintainability** +- Unified interface across all inventory processor types +- Clear separation between universal and format-specific data +- Predictable object structure for interface layers +- Error handling complexity isolated to exception classes and decorators + +**Performance and Scalability** +- Immutable objects enable safe concurrent access +- Structural sharing reduces memory overhead +- Efficient serialization for network transmission +- Exception-based flow avoids creating error objects for success cases +- Domain-specific rendering optimizations contained within objects + +This results module design provides a robust foundation for type-safe operations +across all system components while maintaining clean architectural boundaries +and enabling future enhancements through structured object capabilities and +self-rendering architecture. diff --git a/documentation/architecture/openspec/specs/search/spec.md b/documentation/architecture/openspec/specs/search/spec.md new file mode 100644 index 0000000..649a27b --- /dev/null +++ b/documentation/architecture/openspec/specs/search/spec.md @@ -0,0 +1,40 @@ +# Search + +## Purpose +The Search capability provides flexible and powerful mechanisms to query the ingested documentation, supporting various matching strategies to help users find relevant information quickly. + +## Requirements + +### Requirement: Search Modes +The system SHALL support multiple search modes (Fuzzy, Exact, Regex). + +Priority: Critical + +#### Scenario: Fuzzy Search +- **WHEN** a user searches with a potentially misspelled term +- **THEN** the system uses fuzzy matching +- **AND** returns relevant results based on similarity + +#### Scenario: Regex Search +- **WHEN** a user searches with a regular expression +- **THEN** the system matches patterns against the inventory +- **AND** returns matching objects + +### Requirement: Filtering +The system SHALL support filtering by domain, role, and custom properties. + +Priority: Critical + +#### Scenario: Filter by Domain +- **WHEN** a user searches with a domain filter (e.g., `py:class`) +- **THEN** only objects matching that domain are returned + +### Requirement: Scope +The system SHALL search across inventory objects and full content. + +Priority: Critical + +#### Scenario: Full Content Search +- **WHEN** a user requests a full content search +- **THEN** the system searches the text body of the documentation +- **AND** returns pages containing the term diff --git a/documentation/architecture/designs/structure-processors.rst b/documentation/architecture/openspec/specs/structure-processing/design.md similarity index 63% rename from documentation/architecture/designs/structure-processors.rst rename to documentation/architecture/openspec/specs/structure-processing/design.md index c1712e3..8bdaad5 100644 --- a/documentation/architecture/designs/structure-processors.rst +++ b/documentation/architecture/openspec/specs/structure-processing/design.md @@ -1,553 +1,503 @@ -.. vim: set fileencoding=utf-8: -.. -*- coding: utf-8 -*- -.. +--------------------------------------------------------------------------+ - | | - | Licensed under the Apache License, Version 2.0 (the "License"); | - | you may not use this file except in compliance with the License. | - | You may obtain a copy of the License at | - | | - | http://www.apache.org/licenses/LICENSE-2.0 | - | | - | Unless required by applicable law or agreed to in writing, software | - | distributed under the License is distributed on an "AS IS" BASIS, | - | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | - | See the License for the specific language governing permissions and | - | limitations under the License. | - | | - +--------------------------------------------------------------------------+ - - -******************************************************************************* -Structure Processors Architecture -******************************************************************************* - -Overview -=============================================================================== - -Structure processors extract content from documentation pages and transform it -into structured documents suitable for search and analysis. These processors -form the content extraction layer of librovore's architecture, working with +# Structure Processors Architecture + +## Overview + +Structure processors extract content from documentation pages and transform it +into structured documents suitable for search and analysis. These processors +form the content extraction layer of librovore's architecture, working with inventory objects to provide comprehensive documentation access. -**Role in content extraction pipeline**: Structure processors serve as the -bridge between inventory object discovery and actual content access. They -convert HTML documentation pages into structured, searchable content while +**Role in content extraction pipeline**: Structure processors serve as the +bridge between inventory object discovery and actual content access. They +convert HTML documentation pages into structured, searchable content while preserving semantic information and cross-references. -**Relationship to inventory processors**: Structure processors consume inventory -objects created by inventory processors, using the object metadata and URIs -to guide content extraction. The relationship is mediated through capability-based -filtering to ensure inventory objects are only processed by compatible structure +**Relationship to inventory processors**: Structure processors consume inventory +objects created by inventory processors, using the object metadata and URIs +to guide content extraction. The relationship is mediated through capability-based +filtering to ensure inventory objects are only sent to compatible structure processors. -**Inventory-type awareness principles**: Structure processors advertise their -compatibility with specific inventory types and content formats, enabling -intelligent routing of extraction requests. This prevents processing failures -and optimizes extraction quality by matching processors to their optimal +**Inventory-type awareness principles**: Structure processors advertise their +compatibility with specific inventory types and content formats, enabling +intelligent routing of extraction requests. This prevents processing failures +and optimizes extraction quality by matching processors to their optimal content sources. -Architecture Patterns -=============================================================================== +## Architecture Patterns -Content Extraction Dataflow -------------------------------------------------------------------------------- +### Content Extraction Dataflow -Structure processors follow a consistent content extraction pipeline regardless +Structure processors follow a consistent content extraction pipeline regardless of the underlying documentation format: -.. code-block:: text - - InventoryObject Input - │ - ▼ - ┌─────────────────────┐ - │ Capability │ ◄─── Inventory type validation - │ Filtering │ Processor compatibility check - └─────────────────────┘ - │ - ▼ - ┌─────────────────────┐ - │ URL Construction │ ◄─── Context-aware URI building - └─────────────────────┘ Base URL resolution - │ - ▼ - ┌─────────────────────┐ - │ Content Retrieval │ ◄─── HTTP fetching - └─────────────────────┘ Caching integration - │ - ▼ - ┌─────────────────────┐ - │ HTML Processing │ ◄─── Theme-specific extraction - └─────────────────────┘ Content identification - │ - ▼ - ┌─────────────────────┐ - │ Document Creation │ ◄─── Markdown conversion - └─────────────────────┘ Metadata preservation - │ - ▼ - ContentDocument Output - -**Capability Filtering**: Initial filtering ensures that a structure processor -only receives compatible inventory objects based on inventory type and content +```text +InventoryObject Input + │ + ▼ +┌─────────────────────┐ +│ Capability │ ◄─── Inventory type validation +│ Filtering │ Processor compatibility check +└─────────────────────┘ + │ + ▼ +┌─────────────────────┐ +│ URL Construction │ ◄─── Context-aware URI building +└─────────────────────┘ Base URL resolution + │ + ▼ +┌─────────────────────┐ +│ Content Retrieval │ ◄─── HTTP fetching +└─────────────────────┘ Caching integration + │ + ▼ +┌─────────────────────┐ +│ HTML Processing │ ◄─── Theme-specific extraction +└─────────────────────┘ Content identification + │ + ▼ +┌─────────────────────┐ +│ Document Creation │ ◄─── Markdown conversion +└─────────────────────┘ Metadata preservation + │ + ▼ +ContentDocument Output +``` + +**Capability Filtering**: Initial filtering ensures that a structure processor +only receives compatible inventory objects based on inventory type and content format requirements. -**URL Construction**: Context-aware URL building that incorporates base documentation +**URL Construction**: Context-aware URL building that incorporates base documentation URLs, relative paths from inventory objects, and processor-specific URL patterns. -**Content Retrieval**: Standardized HTTP content fetching with caching integration, +**Content Retrieval**: Standardized HTTP content fetching with caching integration, error handling, and retry mechanisms for robust content access. -**HTML Processing**: Format-specific HTML content extraction that understands +**HTML Processing**: Format-specific HTML content extraction that understands documentation theme layouts, navigation structures, and content organization patterns. -**Document Creation**: Transformation of extracted content into structured documents +**Document Creation**: Transformation of extracted content into structured documents with preserved metadata, cross-references, and searchable text content. -URL Construction Patterns -------------------------------------------------------------------------------- +### URL Construction Patterns -Structure processors implement intelligent URL construction that accommodates +Structure processors implement intelligent URL construction that accommodates diverse documentation site organizations: -**Base URL Resolution**: Consistent handling of documentation base URLs with -support for subdirectory installations, CDN distributions, and alternative +**Base URL Resolution**: Consistent handling of documentation base URLs with +support for subdirectory installations, CDN distributions, and alternative hosting arrangements. -**Relative Path Handling**: Proper resolution of inventory object URIs relative -to documentation base URLs with consideration for URL encoding, fragment +**Relative Path Handling**: Proper resolution of inventory object URIs relative +to documentation base URLs with consideration for URL encoding, fragment identifiers, and query parameters. -**Context-Aware Construction**: URL building that considers documentation site -structure including version-specific paths, language variants, and theme-specific +**Context-Aware Construction**: URL building that considers documentation site +structure including version-specific paths, language variants, and theme-specific URL patterns. -**Fallback Strategies**: Alternative URL construction approaches when primary +**Fallback Strategies**: Alternative URL construction approaches when primary patterns fail, including probe-based discovery and heuristic URL derivation. -HTML Processing and Conversion -------------------------------------------------------------------------------- +### HTML Processing and Conversion -Content extraction accommodates the diversity of documentation site layouts +Content extraction accommodates the diversity of documentation site layouts and themes through adaptive processing strategies: -**Theme Recognition**: Detection of documentation themes and frameworks to +**Theme Recognition**: Detection of documentation themes and frameworks to optimize content extraction for specific layout patterns and markup conventions. -**Content Identification**: Intelligent identification of main content areas -within documentation pages, distinguishing content from navigation, advertising, +**Content Identification**: Intelligent identification of main content areas +within documentation pages, distinguishing content from navigation, advertising, and decorative elements. -**Semantic Preservation**: Extraction that preserves semantic markup including +**Semantic Preservation**: Extraction that preserves semantic markup including headings, code blocks, cross-references, and structured content elements. -**Cleanup and Normalization**: Content sanitization that removes theme-specific +**Cleanup and Normalization**: Content sanitization that removes theme-specific artifacts while preserving essential formatting and structural information. -Error Handling and Fallback Strategies -------------------------------------------------------------------------------- +### Error Handling and Fallback Strategies Robust error handling ensures graceful degradation when content extraction fails: -**Network Error Handling**: Comprehensive handling of connection failures, +**Network Error Handling**: Comprehensive handling of connection failures, timeouts, HTTP errors, and DNS resolution problems with appropriate retry logic. -**Content Format Errors**: Graceful handling of unexpected HTML structure, +**Content Format Errors**: Graceful handling of unexpected HTML structure, malformed markup, JavaScript-dependent content, and theme-specific layout issues. -**Extraction Failures**: Fallback strategies when primary content extraction +**Extraction Failures**: Fallback strategies when primary content extraction fails, including alternative parsing approaches and degraded content extraction modes. -**Quality Assessment**: Content quality validation to detect extraction failures +**Quality Assessment**: Content quality validation to detect extraction failures and provide meaningful error feedback to users and calling systems. -Performance Optimization -------------------------------------------------------------------------------- +### Performance Optimization -Performance optimization strategies address the inherent latency of web content +Performance optimization strategies address the inherent latency of web content retrieval and processing: -**Caching Integration**: Multi-level caching including HTTP response caching, +**Caching Integration**: Multi-level caching including HTTP response caching, processed content caching, and metadata caching with appropriate TTL management. -**Batch Processing**: Efficient batch content extraction for multiple inventory +**Batch Processing**: Efficient batch content extraction for multiple inventory objects with connection reuse, parallel processing, and resource management. -**Selective Extraction**: Content extraction optimization based on query requirements, +**Selective Extraction**: Content extraction optimization based on query requirements, including partial content extraction and progressive loading strategies. -**Resource Management**: Memory and network resource management during extended +**Resource Management**: Memory and network resource management during extended content extraction operations with appropriate limits and throttling mechanisms. -Capability Advertisement System -=============================================================================== +## Capability Advertisement System -Processor Capability Specifications -------------------------------------------------------------------------------- +### Processor Capability Specifications -Structure processors advertise their capabilities through comprehensive metadata +Structure processors advertise their capabilities through comprehensive metadata that enables intelligent processor selection and inventory object routing: -.. code-block:: python +```python +class StructureProcessorCapabilities( __.immut.DataclassObject ): + ''' Comprehensive capability advertisement for structure processors. ''' - class StructureProcessorCapabilities( __.immut.DataclassObject ): - ''' Comprehensive capability advertisement for structure processors. ''' - - supported_inventory_types: frozenset[ str ] - supported_content_formats: frozenset[ str ] - theme_compatibility: __.immut.Dictionary[ str, __.typx.Any ] - extraction_features: frozenset[ str ] - performance_characteristics: __.immut.Dictionary[ str, __.typx.Any ] - operational_constraints: __.immut.Dictionary[ str, __.typx.Any ] + supported_inventory_types: frozenset[ str ] + supported_content_formats: frozenset[ str ] + theme_compatibility: __.immut.Dictionary[ str, __.typx.Any ] + extraction_features: frozenset[ str ] + performance_characteristics: __.immut.Dictionary[ str, __.typx.Any ] + operational_constraints: __.immut.Dictionary[ str, __.typx.Any ] +``` -**Supported Inventory Types**: Clear declaration of compatible inventory object +**Supported Inventory Types**: Clear declaration of compatible inventory object types, enabling precise routing of extraction requests to appropriate processors. -**Content Format Support**: Advertisement of supported content formats including +**Content Format Support**: Advertisement of supported content formats including HTML variants, theme-specific markup, and special content handling capabilities. -**Theme Compatibility**: Detailed theme compatibility information including +**Theme Compatibility**: Detailed theme compatibility information including supported themes, version constraints, and theme-specific optimization features. -**Extraction Features**: Comprehensive feature advertisement including content +**Extraction Features**: Comprehensive feature advertisement including content types, metadata extraction, cross-reference handling, and special processing capabilities. -Inventory-Type Filtering -------------------------------------------------------------------------------- +### Inventory-Type Filtering -Capability-based filtering ensures inventory objects are only processed by +Capability-based filtering ensures inventory objects are only processed by compatible structure processors: -**Compatibility Validation**: Pre-processing validation that inventory objects +**Compatibility Validation**: Pre-processing validation that inventory objects match processor capability requirements before attempting content extraction. -**Graceful Rejection**: Clear error reporting when inventory objects are +**Graceful Rejection**: Clear error reporting when inventory objects are incompatible with processor capabilities, preventing processing failures. -**Multi-Processor Scenarios**: Intelligent processor selection when multiple +**Multi-Processor Scenarios**: Intelligent processor selection when multiple processors support the same inventory type but with different capability profiles. -**Capability Negotiation**: Dynamic capability matching that considers both +**Capability Negotiation**: Dynamic capability matching that considers both inventory object metadata and processor capabilities for optimal pairing. -Confidence Scoring for Content Extraction -------------------------------------------------------------------------------- +### Confidence Scoring for Content Extraction -Structure processors provide confidence scoring for content extraction operations +Structure processors provide confidence scoring for content extraction operations similar to inventory processor detection confidence: -**Extraction Confidence**: Assessment of likely extraction success based on +**Extraction Confidence**: Assessment of likely extraction success based on inventory object metadata, processor capabilities, and historical performance data. -**Content Quality Prediction**: Estimation of expected content quality based +**Content Quality Prediction**: Estimation of expected content quality based on documentation source characteristics and processor optimization profiles. -**Resource Requirements**: Confidence scoring includes resource requirement +**Resource Requirements**: Confidence scoring includes resource requirement estimates for capacity planning and operation prioritization. -**Success Probability**: Statistical confidence metrics based on processor +**Success Probability**: Statistical confidence metrics based on processor performance history and inventory object characteristics. -Dynamic Capability Discovery -------------------------------------------------------------------------------- +### Dynamic Capability Discovery The system supports dynamic capability discovery for adaptive processor selection: -**Runtime Capability Assessment**: Dynamic evaluation of processor capabilities +**Runtime Capability Assessment**: Dynamic evaluation of processor capabilities based on current system state, resource availability, and operational constraints. -**Capability Evolution**: Support for capability enhancement over time without +**Capability Evolution**: Support for capability enhancement over time without requiring system reconfiguration or static capability declarations. -**Feature Detection**: Automatic detection of processor features and capabilities +**Feature Detection**: Automatic detection of processor features and capabilities through interface introspection and runtime testing. -**Performance Profiling**: Dynamic performance characteristic assessment based +**Performance Profiling**: Dynamic performance characteristic assessment based on operational history and current system conditions. -Integration with Detection System -------------------------------------------------------------------------------- +### Integration with Detection System -Structure processor capabilities integrate seamlessly with the broader detection +Structure processor capabilities integrate seamlessly with the broader detection and processor selection system: -**Unified Selection Logic**: Consistent processor selection algorithms that -consider both inventory and structure processor capabilities for end-to-end +**Unified Selection Logic**: Consistent processor selection algorithms that +consider both inventory and structure processor capabilities for end-to-end operation planning. -**Cache Integration**: Capability information caching with appropriate invalidation +**Cache Integration**: Capability information caching with appropriate invalidation strategies to optimize repeated processor selection operations. -**Error Propagation**: Structured error handling that provides clear feedback +**Error Propagation**: Structured error handling that provides clear feedback about capability mismatches and processor selection failures. -**Monitoring Integration**: Capability-based monitoring and alerting for processor +**Monitoring Integration**: Capability-based monitoring and alerting for processor availability, performance degradation, and operational issues. -Base Interfaces and Protocols -=============================================================================== +## Base Interfaces and Protocols -StructureDetection Abstract Base Class -------------------------------------------------------------------------------- +### StructureDetection Abstract Base Class -The ``StructureDetection`` abstract base class provides the foundation for all +The `StructureDetection` abstract base class provides the foundation for all structure processor implementations: -.. code-block:: python - - class StructureDetection( Detection ): - ''' Base class for structure processor detection and capability advertisement. ''' - - @property - @__.typx.abc.abstractmethod - def processor_class( self ) -> type[ StructureProcessor ]: - ''' Returns the structure processor class for this detection result. ''' - - @__.typx.abc.abstractmethod - async def get_capabilities( - self, - auxdata: __.state.Globals - ) -> StructureProcessorCapabilities: - ''' Returns comprehensive processor capability information. ''' - - @__.typx.abc.abstractmethod - async def extract_contents_typed( - self, - inventory_objects: __.cabc.Sequence[ InventoryObject ], - base_url: str, /, *, - auxdata: __.state.Globals, - filters: __.cabc.Mapping[ str, __.typx.Any ] = __.immut.Dictionary( ), - lines_max: __.typx.Optional[ int ] = None, - ) -> __.cabc.Sequence[ ContentDocument ]: - ''' Extracts content from inventory objects with full type safety. ''' - -extract_contents_typed Interface -------------------------------------------------------------------------------- - -The ``extract_contents_typed`` method provides the primary content extraction +```python +class StructureDetection( Detection ): + ''' Base class for structure processor detection and capability advertisement. ''' + + @property + @__.typx.abc.abstractmethod + def processor_class( self ) -> type[ StructureProcessor ]: + ''' Returns the structure processor class for this detection result. ''' + + @__.typx.abc.abstractmethod + async def get_capabilities( + self, + auxdata: __.state.Globals + ) -> StructureProcessorCapabilities: + ''' Returns comprehensive processor capability information. ''' + + @__.typx.abc.abstractmethod + async def extract_contents_typed( + self, + inventory_objects: __.cabc.Sequence[ InventoryObject ], + base_url: str, /, *, + auxdata: __.state.Globals, + filters: __.cabc.Mapping[ str, __.typx.Any ] = __.immut.Dictionary( ), + lines_max: __.typx.Optional[ int ] = None, + ) -> __.cabc.Sequence[ ContentDocument ]: + ''' Extracts content from inventory objects with full type safety. ''' +``` + +### extract_contents_typed Interface + +The `extract_contents_typed` method provides the primary content extraction interface with comprehensive parameter support: **Parameter Specifications**: -- ``inventory_objects``: Sequence of inventory objects for content extraction -- ``base_url``: Documentation base URL for context-aware URL construction -- ``auxdata``: System global state for caching and configuration access -- ``filters``: Optional filtering parameters for selective content extraction -- ``lines_max``: Optional limit on content length for truncation control +- `inventory_objects`: Sequence of inventory objects for content extraction +- `base_url`: Documentation base URL for context-aware URL construction +- `auxdata`: System global state for caching and configuration access +- `filters`: Optional filtering parameters for selective content extraction +- `lines_max`: Optional limit on content length for truncation control -**Return Value**: Sequence of ``ContentDocument`` instances with extracted +**Return Value**: Sequence of `ContentDocument` instances with extracted content, metadata, and attribution information. -**Error Handling**: Method implementations handle extraction failures gracefully +**Error Handling**: Method implementations handle extraction failures gracefully with detailed error reporting and partial success capabilities. -get_capabilities Method Specifications -------------------------------------------------------------------------------- - -The ``get_capabilities`` method provides dynamic capability advertisement: +### get_capabilities Method Specifications -.. code-block:: python +The `get_capabilities` method provides dynamic capability advertisement: - async def get_capabilities( - self, - auxdata: __.state.Globals - ) -> StructureProcessorCapabilities: - ''' Returns comprehensive processor capability information. ''' +```python +async def get_capabilities( + self, + auxdata: __.state.Globals +) -> StructureProcessorCapabilities: + ''' Returns comprehensive processor capability information. ''' +``` -**Dynamic Assessment**: Capability information may be assessed dynamically +**Dynamic Assessment**: Capability information may be assessed dynamically based on system state, configuration, and runtime conditions. -**Comprehensive Coverage**: Returned capabilities include all information +**Comprehensive Coverage**: Returned capabilities include all information necessary for processor selection, inventory object routing, and operation planning. -**Caching Considerations**: Implementations may cache capability information +**Caching Considerations**: Implementations may cache capability information when assessment is expensive, with appropriate invalidation strategies. -URL Construction Abstractions -------------------------------------------------------------------------------- +### URL Construction Abstractions -Structure processors implement URL construction through standardized abstractions +Structure processors implement URL construction through standardized abstractions that handle diverse documentation site organizations: -.. code-block:: python - - class URLConstructor: - ''' Abstract URL construction interface for structure processors. ''' - - @__.typx.abc.abstractmethod - def construct_content_url( - self, - inventory_object: InventoryObject, - base_url: str, /, *, - context: __.typx.Optional[ __.cabc.Mapping[ str, __.typx.Any ] ] = None - ) -> str: - ''' Constructs content URL from inventory object and base URL. ''' - -**Context-Aware Construction**: URL construction considers documentation site +```python +class URLConstructor: + ''' Abstract URL construction interface for structure processors. ''' + + @__.typx.abc.abstractmethod + def construct_content_url( + self, + inventory_object: InventoryObject, + base_url: str, /, *, + context: __.typx.Optional[ __.cabc.Mapping[ str, __.typx.Any ] ] = None + ) -> str: + ''' Constructs content URL from inventory object and base URL. ''' +``` + +**Context-Aware Construction**: URL construction considers documentation site context including version information, language variants, and theme-specific patterns. -**Fallback Support**: URL construction abstractions support fallback strategies +**Fallback Support**: URL construction abstractions support fallback strategies when primary URL patterns fail or produce invalid results. -**Validation Integration**: URL construction includes validation to detect and +**Validation Integration**: URL construction includes validation to detect and report construction failures before attempting content retrieval. -Content Document Creation Patterns -------------------------------------------------------------------------------- +### Content Document Creation Patterns -Structure processors create ``ContentDocument`` instances through consistent +Structure processors create `ContentDocument` instances through consistent patterns that preserve content structure and metadata: -.. code-block:: python +```python +class ContentDocument( __.immut.DataclassObject ): + ''' Structured document from content extraction. ''' - class ContentDocument( __.immut.DataclassObject ): - ''' Structured document from content extraction. ''' - - title: str - content: str - url: str - inventory_object: InventoryObject - metadata: __.immut.Dictionary[ str, __.typx.Any ] - content_id: str + title: str + content: str + url: str + inventory_object: InventoryObject + metadata: __.immut.Dictionary[ str, __.typx.Any ] + content_id: str +``` -**Metadata Preservation**: Content extraction preserves relevant metadata from +**Metadata Preservation**: Content extraction preserves relevant metadata from both the original HTML content and the associated inventory object. -**Attribution Tracking**: Created documents maintain complete attribution including +**Attribution Tracking**: Created documents maintain complete attribution including source URLs, inventory object references, and extraction processor information. -**Content Structuring**: Extracted content is structured to preserve semantic +**Content Structuring**: Extracted content is structured to preserve semantic information including headings, code blocks, and cross-references. -Detection and Processor Selection -=============================================================================== +## Detection and Processor Selection -Structure Processor Detection Patterns -------------------------------------------------------------------------------- +### Structure Processor Detection Patterns -Structure processor detection follows consistent patterns that enable reliable +Structure processor detection follows consistent patterns that enable reliable processor selection and capability assessment: -**Capability-Based Detection**: Detection primarily focuses on processor capabilities -rather than documentation source probing, since structure processors work with +**Capability-Based Detection**: Detection primarily focuses on processor capabilities +rather than documentation source probing, since structure processors work with inventory objects rather than direct source analysis. -**Inventory Type Matching**: Detection validates processor compatibility with +**Inventory Type Matching**: Detection validates processor compatibility with specific inventory object types before selection for content extraction operations. -**Performance Profiling**: Detection includes performance characteristic assessment +**Performance Profiling**: Detection includes performance characteristic assessment to enable optimal processor selection for specific operational requirements. -**Resource Requirements**: Detection evaluates processor resource requirements +**Resource Requirements**: Detection evaluates processor resource requirements including memory usage, network bandwidth, and processing time expectations. -Detection Confidence Methodology -------------------------------------------------------------------------------- +### Detection Confidence Methodology -Structure processor detection confidence reflects the likelihood of successful +Structure processor detection confidence reflects the likelihood of successful content extraction operations: -**Capability Match Confidence**: Primary confidence factor based on processor +**Capability Match Confidence**: Primary confidence factor based on processor capability alignment with inventory object characteristics and extraction requirements. -**Historical Performance**: Confidence assessment incorporates historical +**Historical Performance**: Confidence assessment incorporates historical performance data for similar inventory objects and extraction patterns. -**Resource Availability**: Confidence scoring considers current system resource +**Resource Availability**: Confidence scoring considers current system resource availability and processor resource requirements. -**Content Accessibility**: Confidence includes assessment of content accessibility +**Content Accessibility**: Confidence includes assessment of content accessibility based on base URL validation and network connectivity testing. -Processor Selection Algorithms -------------------------------------------------------------------------------- +### Processor Selection Algorithms Processor selection algorithms optimize extraction quality and system performance: -**Multi-Criteria Selection**: Selection considers capability matching, performance +**Multi-Criteria Selection**: Selection considers capability matching, performance characteristics, resource requirements, and historical success rates. -**Load Balancing**: Selection algorithms may distribute load across multiple +**Load Balancing**: Selection algorithms may distribute load across multiple compatible processors for improved system performance and reliability. -**Fallback Chains**: Processor selection includes fallback processor identification +**Fallback Chains**: Processor selection includes fallback processor identification for graceful degradation when primary processors fail. -**Context-Aware Selection**: Selection considers extraction context including +**Context-Aware Selection**: Selection considers extraction context including batch size, urgency requirements, and quality expectations. -Cache Integration Strategy -------------------------------------------------------------------------------- +### Cache Integration Strategy Structure processors integrate with system caching for improved performance: -**Detection Result Caching**: Processor detection and capability information +**Detection Result Caching**: Processor detection and capability information cached to optimize repeated selection operations. -**Content Caching**: Extracted content cached at appropriate granularity levels +**Content Caching**: Extracted content cached at appropriate granularity levels including full documents with content identification, and processed metadata. -**URL Construction Caching**: URL construction results cached to avoid repeated +**URL Construction Caching**: URL construction results cached to avoid repeated computation for similar inventory objects. -**Performance Metric Caching**: Processor performance characteristics cached +**Performance Metric Caching**: Processor performance characteristics cached to support selection algorithms without expensive runtime assessment. -Error Propagation Patterns -------------------------------------------------------------------------------- +### Error Propagation Patterns Comprehensive error propagation ensures clear feedback about processing failures: -**Error Classification**: Structured error classification including network errors, +**Error Classification**: Structured error classification including network errors, content format errors, processor capability errors, and system resource errors. -**Context Preservation**: Error reporting preserves complete context including +**Context Preservation**: Error reporting preserves complete context including inventory object information, processor selection rationale, and operational parameters. -**Recovery Suggestions**: Error responses include actionable recovery suggestions +**Recovery Suggestions**: Error responses include actionable recovery suggestions including alternative processors, parameter adjustments, and retry strategies. -**Escalation Patterns**: Error escalation follows consistent patterns from +**Escalation Patterns**: Error escalation follows consistent patterns from processor-specific errors through system-level error handling to user notification. -Content Extraction Patterns -=============================================================================== +## Content Extraction Patterns -HTML Content Retrieval -------------------------------------------------------------------------------- +### HTML Content Retrieval Content retrieval implements robust patterns for accessing documentation content: -**HTTP Client Management**: Efficient HTTP client management with connection +**HTTP Client Management**: Efficient HTTP client management with connection pooling, timeout configuration, and retry logic for reliable content access. -**Authentication Support**: Support for documentation sites requiring authentication +**Authentication Support**: Support for documentation sites requiring authentication including basic authentication, token-based access, and session management. -**Content Negotiation**: HTTP content negotiation to request optimal content +**Content Negotiation**: HTTP content negotiation to request optimal content formats and encoding for efficient processing. -**Caching Integration**: Integration with HTTP caching mechanisms including +**Caching Integration**: Integration with HTTP caching mechanisms including ETag support, conditional requests, and cache validation. -Markdown Conversion Strategies -------------------------------------------------------------------------------- +### Markdown Conversion Strategies HTML-to-Markdown conversion preserves content structure while creating searchable text: -**Semantic Preservation**: Conversion strategies that preserve semantic HTML +**Semantic Preservation**: Conversion strategies that preserve semantic HTML elements including headings, lists, code blocks, and emphasis. -**Cross-Reference Handling**: Intelligent handling of internal links, cross-references, +**Cross-Reference Handling**: Intelligent handling of internal links, cross-references, and documentation navigation elements during conversion. -**Code Block Processing**: Specialized processing of code blocks including +**Code Block Processing**: Specialized processing of code blocks including syntax highlighting preservation and language identification. -**Table Conversion**: Robust table conversion that preserves tabular data structure +**Table Conversion**: Robust table conversion that preserves tabular data structure in Markdown format while handling complex table layouts. -Content Identification System -------------------------------------------------------------------------------- +### Content Identification System Content identification provides stable reference mechanisms for documentation content: @@ -563,153 +513,143 @@ with no requirement for session storage or server-side state management. **Base64 Encoding**: Content IDs use base64 encoding of location and object name combinations for human-debuggable yet compact identifier representation. -Cross-Reference Handling -------------------------------------------------------------------------------- +### Cross-Reference Handling Cross-reference processing preserves documentation navigation and linking: -**Link Resolution**: Resolution of relative links, fragment identifiers, and +**Link Resolution**: Resolution of relative links, fragment identifiers, and cross-documentation references to maintain content connectivity. -**Reference Validation**: Validation of cross-references to identify broken +**Reference Validation**: Validation of cross-references to identify broken links, missing content, and navigation issues. -**Context Preservation**: Preservation of link context including anchor text, +**Context Preservation**: Preservation of link context including anchor text, surrounding content, and semantic relationship information. -**Multi-Document Coordination**: Cross-reference handling across multiple +**Multi-Document Coordination**: Cross-reference handling across multiple documents and documentation sources for comprehensive link resolution. -Theme-Specific Adaptations -------------------------------------------------------------------------------- +### Theme-Specific Adaptations Content extraction adapts to diverse documentation themes and layouts: -**Theme Recognition**: Automatic recognition of documentation themes including +**Theme Recognition**: Automatic recognition of documentation themes including Sphinx themes, MkDocs themes, and custom documentation layouts. -**Layout Adaptation**: Extraction strategies adapted to theme-specific layouts +**Layout Adaptation**: Extraction strategies adapted to theme-specific layouts including content area identification, navigation extraction, and sidebar handling. -**CSS-Based Extraction**: Theme-aware CSS selector strategies for precise +**CSS-Based Extraction**: Theme-aware CSS selector strategies for precise content identification and extraction optimization. -**Fallback Strategies**: Generic extraction strategies for unknown or unsupported +**Fallback Strategies**: Generic extraction strategies for unknown or unsupported themes with graceful degradation and quality assessment. -Implementation Outline -=============================================================================== +## Implementation Outline -HTML Processing and Content Extraction Patterns -------------------------------------------------------------------------------- +### HTML Processing and Content Extraction Patterns -Implementation patterns for robust HTML content processing across diverse +Implementation patterns for robust HTML content processing across diverse documentation sources: -**Parser Selection**: Choice of HTML parsing libraries and strategies based +**Parser Selection**: Choice of HTML parsing libraries and strategies based on performance requirements, error tolerance, and feature support needs. -**Content Area Identification**: Algorithms for identifying main content areas +**Content Area Identification**: Algorithms for identifying main content areas within documentation pages while excluding navigation, advertisements, and decorative elements. -**Markup Sanitization**: Content sanitization strategies that preserve essential +**Markup Sanitization**: Content sanitization strategies that preserve essential formatting while removing theme-specific artifacts and potentially problematic markup. -**Error Recovery**: Robust error recovery during HTML processing including +**Error Recovery**: Robust error recovery during HTML processing including malformed markup handling, encoding issues, and incomplete content scenarios. -Theme-Specific Adaptation Strategies -------------------------------------------------------------------------------- +### Theme-Specific Adaptation Strategies Adaptation approaches for optimizing content extraction across documentation themes: -**Theme Detection**: Strategies for automatically detecting documentation themes +**Theme Detection**: Strategies for automatically detecting documentation themes through CSS analysis, markup patterns, and meta information examination. -**Configuration Management**: Theme-specific configuration management including +**Configuration Management**: Theme-specific configuration management including CSS selectors, extraction rules, and processing parameters. -**Optimization Profiles**: Performance optimization profiles tailored to specific +**Optimization Profiles**: Performance optimization profiles tailored to specific themes including extraction shortcuts, caching strategies, and resource usage patterns. -**Extension Points**: Clear extension mechanisms for adding support for new +**Extension Points**: Clear extension mechanisms for adding support for new themes without modifying core extraction logic. -URI Construction and Resolution Approaches -------------------------------------------------------------------------------- +### URI Construction and Resolution Approaches URI handling strategies that accommodate diverse documentation site organizations: -**Base URL Handling**: Robust base URL resolution including subdirectory installations, +**Base URL Handling**: Robust base URL resolution including subdirectory installations, CDN distributions, and proxy configurations. -**Path Resolution**: Intelligent path resolution that handles relative URLs, +**Path Resolution**: Intelligent path resolution that handles relative URLs, absolute URLs, fragment identifiers, and query parameters. -**Validation Strategies**: URI validation approaches including accessibility +**Validation Strategies**: URI validation approaches including accessibility testing, format validation, and broken link detection. -**Fallback Mechanisms**: Alternative URI construction strategies when primary +**Fallback Mechanisms**: Alternative URI construction strategies when primary approaches fail or produce invalid results. -Content Conversion and Formatting Methodologies -------------------------------------------------------------------------------- +### Content Conversion and Formatting Methodologies Content processing approaches that preserve structure while creating searchable text: -**Conversion Pipelines**: Multi-stage conversion pipelines from HTML to structured +**Conversion Pipelines**: Multi-stage conversion pipelines from HTML to structured content including cleaning, transformation, and validation stages. -**Format Preservation**: Strategies for preserving essential content formatting +**Format Preservation**: Strategies for preserving essential content formatting including code blocks, tables, lists, and emphasis while removing presentation artifacts. -**Metadata Extraction**: Systematic metadata extraction including page titles, +**Metadata Extraction**: Systematic metadata extraction including page titles, headings, cross-references, and semantic markup preservation. -**Quality Assessment**: Content quality assessment metrics including completeness, +**Quality Assessment**: Content quality assessment metrics including completeness, structure preservation, and extraction accuracy measurement. -Cross-Reference Handling Patterns -------------------------------------------------------------------------------- +### Cross-Reference Handling Patterns Cross-reference processing strategies that maintain documentation connectivity: -**Link Discovery**: Comprehensive link discovery including explicit links, +**Link Discovery**: Comprehensive link discovery including explicit links, implicit references, and theme-specific navigation patterns. -**Resolution Algorithms**: Link resolution algorithms that handle relative paths, +**Resolution Algorithms**: Link resolution algorithms that handle relative paths, base URL considerations, and multi-document reference scenarios. -**Validation Approaches**: Cross-reference validation including accessibility +**Validation Approaches**: Cross-reference validation including accessibility testing, target existence verification, and circular reference detection. -**Context Preservation**: Maintenance of link context including anchor text, +**Context Preservation**: Maintenance of link context including anchor text, surrounding content, and semantic relationship information. -Performance Optimization Techniques -------------------------------------------------------------------------------- +### Performance Optimization Techniques Performance optimization strategies for efficient content extraction operations: -**Caching Hierarchies**: Multi-level caching including HTTP response caching, +**Caching Hierarchies**: Multi-level caching including HTTP response caching, processed content caching, and metadata caching with appropriate TTL management. -**Parallel Processing**: Efficient parallel processing patterns for batch +**Parallel Processing**: Efficient parallel processing patterns for batch content extraction with resource management and error handling. -**Resource Management**: Memory and network resource management during extended +**Resource Management**: Memory and network resource management during extended operations including connection pooling, memory limits, and garbage collection optimization. -**Selective Processing**: Content extraction optimization based on requirements +**Selective Processing**: Content extraction optimization based on requirements including partial extraction, progressive loading, and priority-based processing. -Example Implementation Skeletons -------------------------------------------------------------------------------- +### Example Implementation Skeletons **Sphinx Processor Outline**: - HTML document processing with Sphinx theme recognition and layout adaptation - Theme support patterns including ReadTheDocs, Alabaster, and custom themes -- Cross-reference resolution for Sphinx domains and role references +- Cross-reference resolution for Sphinx domains and role references - Code block handling with syntax highlighting and language detection - Search result optimization for Sphinx documentation structure @@ -720,168 +660,157 @@ Example Implementation Skeletons - Content type detection for mixed documentation formats - Theme-specific optimization for popular MkDocs themes -Integration with Inventory Processors -=============================================================================== +## Integration with Inventory Processors -Inventory Object Filtering by Capabilities -------------------------------------------------------------------------------- +### Inventory Object Filtering by Capabilities -Structure processors implement capability-based filtering to ensure inventory +Structure processors implement capability-based filtering to ensure inventory object compatibility: -**Pre-Processing Validation**: Validation of inventory objects against processor +**Pre-Processing Validation**: Validation of inventory objects against processor capabilities before attempting content extraction operations. -**Compatibility Matrices**: Systematic compatibility assessment between inventory +**Compatibility Matrices**: Systematic compatibility assessment between inventory object types and processor capabilities for optimal pairing. -**Graceful Rejection**: Clear error reporting when inventory objects are +**Graceful Rejection**: Clear error reporting when inventory objects are incompatible with processor capabilities, preventing extraction failures. -**Multi-Processor Coordination**: Coordination strategies when multiple processors +**Multi-Processor Coordination**: Coordination strategies when multiple processors can handle the same inventory objects but with different capability profiles. -Content Extraction Coordination -------------------------------------------------------------------------------- +### Content Extraction Coordination Coordination patterns between inventory and structure processors for seamless operation: -**URL Construction Coordination**: Collaboration between inventory object metadata +**URL Construction Coordination**: Collaboration between inventory object metadata and structure processor URL construction logic for accurate content addressing. -**Metadata Integration**: Integration of inventory object metadata with extracted +**Metadata Integration**: Integration of inventory object metadata with extracted content metadata for comprehensive document attribution and context. -**Error Propagation**: Coordinated error handling that provides clear feedback +**Error Propagation**: Coordinated error handling that provides clear feedback about the relationship between inventory object issues and content extraction failures. -**Performance Optimization**: Joint optimization strategies that consider both +**Performance Optimization**: Joint optimization strategies that consider both inventory processing and content extraction for efficient end-to-end operations. -Error Handling When No Compatible Objects -------------------------------------------------------------------------------- +### Error Handling When No Compatible Objects Robust error handling for scenarios where inventory objects cannot be processed: -**Compatibility Assessment**: Clear assessment and reporting of compatibility +**Compatibility Assessment**: Clear assessment and reporting of compatibility issues between inventory objects and available structure processors. -**Alternative Strategies**: Identification of alternative processing approaches +**Alternative Strategies**: Identification of alternative processing approaches including processor capability enhancement or inventory object modification. -**User Feedback**: Comprehensive user feedback about compatibility issues including +**User Feedback**: Comprehensive user feedback about compatibility issues including actionable suggestions for resolution. -**System Monitoring**: Integration with system monitoring for tracking compatibility +**System Monitoring**: Integration with system monitoring for tracking compatibility issues and processor capability gaps. -Multi-Inventory Format Handling -------------------------------------------------------------------------------- +### Multi-Inventory Format Handling Support for processing inventory objects from diverse inventory formats: -**Format-Agnostic Processing**: Structure processor implementations that handle +**Format-Agnostic Processing**: Structure processor implementations that handle inventory objects uniformly regardless of their originating inventory format. -**Format-Specific Optimizations**: Optimization strategies that leverage format-specific +**Format-Specific Optimizations**: Optimization strategies that leverage format-specific metadata while maintaining universal inventory object interface compatibility. -**Cross-Format Coordination**: Coordination strategies when processing inventory +**Cross-Format Coordination**: Coordination strategies when processing inventory objects from multiple different inventory formats in batch operations. -**Quality Consistency**: Maintenance of consistent content extraction quality +**Quality Consistency**: Maintenance of consistent content extraction quality across different inventory formats through standardized processing approaches. -Performance Considerations -------------------------------------------------------------------------------- +### Performance Considerations Performance optimization strategies for inventory and structure processor integration: -**Batch Processing Optimization**: Efficient batch processing that optimizes +**Batch Processing Optimization**: Efficient batch processing that optimizes both inventory object handling and content extraction operations. -**Resource Sharing**: Shared resource utilization including HTTP connections, +**Resource Sharing**: Shared resource utilization including HTTP connections, caching infrastructure, and processing threads. -**Load Distribution**: Load distribution strategies that balance processing +**Load Distribution**: Load distribution strategies that balance processing across available structure processors based on inventory object characteristics. -**Monitoring Integration**: Performance monitoring that tracks end-to-end +**Monitoring Integration**: Performance monitoring that tracks end-to-end operation performance including both inventory processing and content extraction phases. -Extension Points and Future Processors -=============================================================================== +## Extension Points and Future Processors -Custom Structure Processor Development -------------------------------------------------------------------------------- +### Custom Structure Processor Development Clear development patterns support custom structure processor creation: -**Development Guidelines**: Comprehensive documentation of interface requirements, +**Development Guidelines**: Comprehensive documentation of interface requirements, capability advertisement patterns, and integration expectations. -**Testing Frameworks**: Standardized testing patterns including capability +**Testing Frameworks**: Standardized testing patterns including capability validation, content extraction verification, and performance assessment. -**Reference Implementations**: Well-documented reference processors that demonstrate +**Reference Implementations**: Well-documented reference processors that demonstrate implementation patterns, error handling, and optimization strategies. -**Plugin Architecture**: Plugin management systems that support dynamic processor +**Plugin Architecture**: Plugin management systems that support dynamic processor registration, discovery, and lifecycle management. -Theme-Specific Optimizations -------------------------------------------------------------------------------- +### Theme-Specific Optimizations Extension points for theme-specific processor optimizations: -**Theme Detection APIs**: Standardized interfaces for theme detection and +**Theme Detection APIs**: Standardized interfaces for theme detection and capability assessment that support custom theme recognition logic. -**Configuration Extension**: Configuration management systems that support +**Configuration Extension**: Configuration management systems that support theme-specific parameters, extraction rules, and optimization profiles. -**CSS Selector Management**: Extensible CSS selector management for theme-specific +**CSS Selector Management**: Extensible CSS selector management for theme-specific content identification and extraction optimization. -**Performance Profiling**: Theme-specific performance profiling and optimization +**Performance Profiling**: Theme-specific performance profiling and optimization measurement for continuous improvement. -Content Extraction Feature Evolution -------------------------------------------------------------------------------- +### Content Extraction Feature Evolution System design accommodates processor feature enhancement over time: -**Capability Versioning**: Version management for processor capabilities enabling +**Capability Versioning**: Version management for processor capabilities enabling gradual system enhancement and feature adoption. -**Feature Negotiation**: Dynamic feature negotiation between system components +**Feature Negotiation**: Dynamic feature negotiation between system components based on advertised processor capabilities and requirements. -**Backward Compatibility**: Interface evolution strategies that maintain compatibility +**Backward Compatibility**: Interface evolution strategies that maintain compatibility with existing processors while enabling enhanced functionality. -**Extension APIs**: Clear extension APIs for adding new content extraction +**Extension APIs**: Clear extension APIs for adding new content extraction features without requiring core system modifications. -URL Construction Pattern Extension -------------------------------------------------------------------------------- +### URL Construction Pattern Extension Extension points for URL construction pattern enhancement: -**Pattern Registration**: Systems for registering new URL construction patterns +**Pattern Registration**: Systems for registering new URL construction patterns including documentation site types, hosting arrangements, and custom schemes. -**Validation Extension**: Extensible URL validation including custom validation +**Validation Extension**: Extensible URL validation including custom validation rules, accessibility testing, and format verification. -**Fallback Strategy Extension**: Extension points for alternative URL construction +**Fallback Strategy Extension**: Extension points for alternative URL construction strategies when primary patterns fail or produce invalid results. -**Performance Optimization**: URL construction optimization including caching +**Performance Optimization**: URL construction optimization including caching strategies, batch processing, and resource management. -This structure processor architecture provides comprehensive content extraction -capabilities while maintaining clean integration with inventory processors and -supporting extensibility for diverse documentation formats and themes. The -design emphasizes capability-based processing, robust error handling, and -performance optimization for reliable content access across varied documentation sources. \ No newline at end of file +This structure processor architecture provides comprehensive content extraction +capabilities while maintaining clean integration with inventory processors and +supporting extensibility for diverse documentation formats and themes. The +design emphasizes capability-based processing, robust error handling, and +performance optimization for reliable content access across varied documentation sources. diff --git a/documentation/architecture/openspec/specs/structure-processing/spec.md b/documentation/architecture/openspec/specs/structure-processing/spec.md new file mode 100644 index 0000000..76e2cdd --- /dev/null +++ b/documentation/architecture/openspec/specs/structure-processing/spec.md @@ -0,0 +1,16 @@ +# Structure Processing + +## Purpose +The Structure Processing capability extracts content from documentation pages and transforms it into structured documents suitable for search and analysis. + +## Requirements + +### Requirement: Content Quality +The system SHALL ensure high-quality content extraction and formatting. + +Priority: Medium + +#### Scenario: HTML to Markdown +- **WHEN** content is extracted +- **THEN** HTML artifacts and navigation are removed +- **AND** code blocks and formatting are preserved diff --git a/documentation/prd.rst b/documentation/prd.rst deleted file mode 100644 index 9fcb9a6..0000000 --- a/documentation/prd.rst +++ /dev/null @@ -1,327 +0,0 @@ -.. vim: set fileencoding=utf-8: -.. -*- coding: utf-8 -*- -.. +--------------------------------------------------------------------------+ - | | - | Licensed under the Apache License, Version 2.0 (the "License"); | - | you may not use this file except in compliance with the License. | - | You may obtain a copy of the License at | - | | - | http://www.apache.org/licenses/LICENSE-2.0 | - | | - | Unless required by applicable law or agreed to in writing, software | - | distributed under the License is distributed on an "AS IS" BASIS, | - | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | - | See the License for the specific language governing permissions and | - | limitations under the License. | - | | - +--------------------------------------------------------------------------+ - - -******************************************************************************* -Product Requirements Document -******************************************************************************* - -Executive Summary -=============================================================================== - -The project is a dual-purpose tool that provides both an MCP (Model Context Protocol) server and CLI interface for searching and extracting content from static documentation sites. It enables AI agents and human users to efficiently discover, search, and extract relevant information from documentation inventories and full-text content. - -The product targets AI agents needing to access technical documentation during development workflows, as well as human developers seeking efficient documentation search capabilities outside of LLM environments. - -For PRD format and guidance, see the `requirements documentation guide -`_. - -Problem Statement -=============================================================================== - -**Who experiences the problem:** AI agents, LLM developers, and human developers working with complex software ecosystems that rely on external documentation. - -**When and where it occurs:** -- During development when agents need specific API documentation or usage examples -- When searching across multiple documentation sites for related concepts -- When working offline or with limited documentation access -- When existing documentation search mechanisms are inadequate for programmatic access - -**Impact and consequences:** -- AI agents cannot efficiently access up-to-date technical documentation -- Developers waste time manually searching through documentation sites -- Inconsistent access patterns across different documentation systems (Sphinx, MkDocs, etc.) -- Limited advanced search capabilities within documentation ecosystems - -**Current limitations:** -- Static MCP servers only provide file serving without semantic search -- Documentation sites have varying search capabilities and interfaces -- No unified interface for accessing multiple documentation formats -- Limited programmatic access to documentation inventory and cross-references - -Goals and Objectives -=============================================================================== - -**Primary Objectives (Critical):** -1. **Unified Documentation Access**: Provide consistent interface for Sphinx, MkDocs, Pydoctor, and Rustdoc documentation sites -2. **Advanced Search**: Enable fuzzy, exact, and regex-based search across documentation inventories and content -3. **MCP Integration**: Seamless integration with AI agents through Model Context Protocol -4. **Performance**: Fast response times with intelligent caching for frequently accessed documentation - -**Secondary Objectives (High Priority):** -1. **Extensibility**: Plugin architecture supporting additional documentation formats -2. **CLI Usability**: Human-usable command-line interface for testing and standalone use -3. **Content Quality**: High-quality HTML-to-Markdown conversion preserving code blocks and formatting -4. **Developer Experience**: Clear error messages, helpful diagnostics, and robust error handling - -**Success Metrics:** -- Sub-second response times for cached inventory queries -- Support for 90%+ of popular Sphinx, MkDocs, Pydoctor, and Rustdoc sites -- Clean markdown output with preserved code formatting -- Successful integration with major MCP clients -- 90%+ test coverage with comprehensive edge case handling - -Target Users -=============================================================================== - -**Primary Users - AI Agents/LLM Systems:** -- **Technical Context**: Programmatic access through MCP protocol -- **Needs**: Structured documentation access, search capabilities, content extraction -- **Usage Pattern**: Automated queries during development assistance -- **Environment**: Integration with Claude Code, other MCP-enabled systems - -**Secondary Users - Developer Tool Creators:** -- **Technical Context**: Python developers building documentation tools -- **Needs**: Extensible plugin system, clean APIs, reliable performance -- **Usage Pattern**: Integration into larger development workflows -- **Environment**: CI/CD systems, development toolchains - -**Tertiary Users - Human Developers:** -- **Technical Context**: Command-line proficient, working with multiple documentation sites -- **Needs**: Fast search across documentation, offline access capabilities -- **Usage Pattern**: Occasional direct CLI usage for testing or when LLM unavailable -- **Environment**: Local development environments, terminal-based workflows - -Functional Requirements -=============================================================================== - -**REQ-001: MCP Server Implementation (Critical)** -- **Priority**: Critical -- **Description**: Implement complete MCP server with FastMCP framework -- **User Story**: As an AI agent, I want to connect to the system via MCP so that I can programmatically access documentation -- **Acceptance Criteria**: - - - Server responds to MCP client connections - - Implements query_inventory tool - - Implements query_content tool - - Implements summarize_inventory tool - - Supports restart functionality for development - - JSON schema generation for all tool parameters - -**REQ-002: Sphinx Documentation Processing (Critical)** -- **Priority**: Critical -- **Description**: Full support for Sphinx documentation sites including inventory parsing and content extraction -- **User Story**: As a user, I want to search Sphinx documentation sites so that I can find API references and usage examples -- **Acceptance Criteria**: - - - Parse objects.inv files from Sphinx sites - - Extract HTML content and convert to clean Markdown - - Support major Sphinx themes (Furo, ReadTheDocs, pydoctheme) - - Handle cross-references and object relationships - - Preserve code block formatting and syntax highlighting hints - -**REQ-003: MkDocs Documentation Processing (Critical)** -- **Priority**: Critical -- **Description**: Full support for MkDocs sites with mkdocstrings integration -- **User Story**: As a user, I want to search MkDocs documentation so that I can access API documentation generated by mkdocstrings -- **Acceptance Criteria**: - - - Parse objects.inv files from mkdocstrings-enabled MkDocs sites - - Extract content from Material for MkDocs theme - - Convert HTML to Markdown with language-aware code blocks - - Handle mkdocstrings-specific content structure - - Filter out navigation and UI elements during extraction - -**REQ-004: Pydoctor Documentation Processing (Critical)** -- **Priority**: Critical -- **Description**: Full support for Pydoctor documentation sites -- **User Story**: As a user, I want to search Pydoctor documentation so that I can access API documentation for Twisted and other Zope-stack projects -- **Acceptance Criteria**: - - - Parse objects.inv files from Pydoctor sites - - Extract content from Pydoctor-generated HTML - - Convert HTML to Markdown with language-aware code blocks - - Handle Pydoctor-specific content structure - - Filter out navigation and UI elements during extraction - -**REQ-005: Rustdoc Documentation Processing (Critical)** -- **Priority**: Critical -- **Description**: Full support for Rustdoc documentation sites -- **User Story**: As a user, I want to search Rustdoc documentation so that I can access API documentation for Rust crates -- **Acceptance Criteria**: - - - Parse search-index.js files from Rustdoc sites - - Extract content from Rustdoc-generated HTML - - Convert HTML to Markdown with language-aware code blocks - - Handle Rustdoc-specific content structure - - Filter out navigation and UI elements during extraction - -**REQ-006: Search Functionality (Critical)** -- **Priority**: Critical -- **Description**: Multiple search modes with configurable behavior -- **User Story**: As a user, I want to search documentation using different matching strategies so that I can find relevant content efficiently -- **Acceptance Criteria**: - - - Fuzzy search with configurable threshold (default 50) - - Exact string matching - - Regular expression search - - Search across inventory objects and full content - - Filtering by domain, role, and custom processor filters - - Configurable result limits and detail levels - -**REQ-007: Caching System (High)** -- **Priority**: High -- **Description**: Intelligent caching to improve performance and reduce network requests -- **User Story**: As a user, I want fast response times for repeated queries so that my workflow is not interrupted -- **Acceptance Criteria**: - - - Cache downloaded inventories with TTL - - Cache extracted content with appropriate invalidation - - Memory-efficient caching strategy - - Cache hit/miss metrics for optimization - - Configurable cache settings - -**REQ-008: CLI Interface (High)** -- **Priority**: High -- **Description**: Human-usable command-line interface for testing and standalone use -- **User Story**: As a developer, I want to test librovore functionality from the command line so that I can validate behavior and debug issues -- **Acceptance Criteria**: - - - Commands for inventory querying, content search, and summarization - - JSON and Markdown output formats - - Comprehensive help text and error messages - - Support for all MCP server capabilities - - Configuration file support for frequent use cases - -**REQ-009: Processor Detection (High)** -- **Priority**: High -- **Description**: Automatic detection of appropriate processor for given documentation site -- **User Story**: As a user, I want the system to automatically determine the correct processor so that I don't need to specify the documentation type -- **Acceptance Criteria**: - - - Detect Sphinx sites by robots.txt and objects.inv presence - - Detect MkDocs sites with mkdocstrings by objects.inv and site structure - - Detect Pydoctor sites by objects.inv and site structure - - Detect Rustdoc sites by search-index.js and site structure - - Graceful fallback when detection is ambiguous - - Clear error messages when no suitable processor is found - - Confidence scoring for processor selection - -**REQ-010: Content Quality (Medium)** -- **Priority**: Medium -- **Description**: High-quality content extraction and formatting -- **User Story**: As a user, I want extracted content to be clean and well-formatted so that it's easily readable and usable -- **Acceptance Criteria**: - - - Remove HTML artifacts and navigation elements - - Preserve code block structure and language hints - - Maintain proper whitespace and formatting - - Convert HTML tables to Markdown tables - - Handle images and media references appropriately - -**REQ-011: Error Handling (Medium)** -- **Priority**: Medium -- **Description**: Robust error handling and user feedback -- **User Story**: As a user, I want clear error messages when something goes wrong so that I can understand and resolve issues -- **Acceptance Criteria**: - - - Graceful handling of network failures - - Validation of input parameters with helpful messages - - Fallback strategies for partially available documentation - - Detailed logging for debugging purposes - - Recovery from temporary service unavailability - -**REQ-012: Plugin Architecture Foundation (Low)** -- **Priority**: Low -- **Description**: Extensible architecture for additional documentation processors -- **User Story**: As a tool developer, I want to extend the system with custom processors so that I can support additional documentation formats -- **Acceptance Criteria**: - - - Abstract base classes for processors - - Plugin discovery mechanism - - Documentation for plugin development - - Example plugin implementation - - Backward compatibility guarantees - -Non-Functional Requirements -=============================================================================== - -**Scalability Requirements:** -- Handle inventories with 10,000+ objects -- Support documentation sites with 1,000+ pages -- Efficient memory usage for large content extraction -- Configurable resource limits to prevent abuse - -**Reliability Requirements:** -- Graceful degradation when documentation sites are unavailable -- Automatic retry with exponential backoff for network failures -- Recovery from corrupted cache data -- Consistent behavior across different operating systems - -**Security Requirements:** -- No execution of untrusted code from documentation sites -- Safe handling of potentially malicious HTML content -- Input validation for all user-provided parameters -- Protection against resource exhaustion attacks - -**Usability Requirements:** -- Clear, actionable error messages -- Comprehensive CLI help text -- JSON output compatible with standard tools (jq, etc.) -- Markdown output suitable for human reading -- Minimal configuration required for basic operation - -**Compatibility Requirements:** -- Python 3.10+ support -- MCP protocol compliance -- Support for major documentation hosting platforms (GitHub Pages, ReadTheDocs, etc.) -- Cross-platform operation (Linux, macOS, Windows) - -Constraints and Assumptions -=============================================================================== - -**Technical Constraints:** -- Must use Python for implementation (existing codebase) -- Must comply with MCP protocol specifications -- Cannot modify remote documentation sites or require site-specific changes -- Limited to documentation formats that provide machine-readable inventories - -**Regulatory Constraints:** -- Must respect robots.txt directives -- Must not overwhelm documentation sites with excessive requests -- Must handle rate limiting appropriately - -**Assumptions:** -- Target documentation sites will continue supporting objects.inv format -- Network connectivity available for accessing remote documentation -- Documentation sites follow standard patterns for content organization -- Users have appropriate permissions to access target documentation sites - -Out of Scope -=============================================================================== - -**Excluded Features:** -- Real-time synchronization with documentation source repositories -- Modification or annotation of documentation content -- Full-text indexing of documentation sites without inventories -- Support for documentation formats without machine-readable inventories -- Authentication mechanisms for private documentation sites -- Multi-user collaboration features -- Web-based user interface -- Integration with version control systems -- Automated documentation generation -- Support for multimedia content (videos, audio) -- Advanced analytics or usage tracking -- Integration with specific IDE plugins (beyond MCP) - -**Future Considerations:** -- OpenAPI/Swagger processor support -- GraphQL schema introspection -- Enhanced relationship mapping between documentation objects -- Interactive CLI browser mode -- Multi-site search aggregation \ No newline at end of file