Skip to content

System Architecture

Pipeline Overview

The system converts natural language queries into precise SPARQL queries grounded in OWL + SHACL domain ontologies. The pipeline is designed around one core principle: the LLM fills structured slots — it never writes SPARQL. Compiler metadata is derived from the schema graph via schema-queries.ts, which builds a CompilerVocab with property/domain mappings, shape groups, and Range2D properties.

A second pipeline runs alongside the compiler: at warmup, reference-index.ts discovers every cross-asset reference signature (sourceClass, predicatePath, targetClass) by BFS over typed instances; metadata-index.ts snapshots the per-asset shape-group facets and computes per-domain aggregates. Together they back the WP3 traceability layer — per-row predicate-chain breadcrumbs, the multi-hop /traceability lineage endpoint, and the /metadata/{asset,aggregate} facet endpoints.

Module Boundaries

The application is a pnpm monorepo with Turborepo orchestration. Each package has a single responsibility and clear dependency direction.

Package Responsibilities

PackageModuleRole
@ontology-search/coreconfig/Zod-validated env config
logging/Structured JSON logger with correlation IDs
errors/Shared error types and base classes
@ontology-search/sparqloxigraph-store.tsOxigraph WASM wrapper, SPARQL execution
remote-store.tsHTTP client for any SPARQL 1.1 endpoint
cached-store.tsLRU query cache decorator (wraps either store)
cache.tsLRU cache implementation
policy.tsQuery validation policies
@ontology-search/ontologywarmup.tsLoads instance TTL data at startup
paths.tsResolves project root and ontology file paths
domain-registry.tsDomain lookups and registration
vocabulary-index.tsVocabulary indexing for property → domain mapping
@ontology-search/searchschema-loader.tsLoads 45 OWL+SHACL files into <urn:graph:schema>
schema-queries.tsGraph-driven SPARQL helpers for domains, references, and shape groups
property-paths.tsDiscovers predicate chains from SHACL (ontology-agnostic)
vocabulary-extractor.tsSPARQL-based extraction of sh:in enums + numeric props
reference-index.tsWP3: BFS-discovered (source, predicate, target) reference signatures
metadata-index.tsWP3: per-asset facet snapshot + per-domain aggregate distribution
compiler.tscompileSlots + compileSlotsWithTrace (the trace variant binds intermediate JOIN vars for per-row lineage)
sparql-validator.tsPost-compilation SPARQL syntax validation
service.tsOrchestrates init → interpret → compile → execute
factory.tsService factory and dependency wiring
slots.tsSearchSlots type definitions (flattened — no location/license slots; new references slot)
data-loader.tsLoads sample TTL + JSON-LD files for dev/test
init.tsInitialization sequence
@ontology-search/llmprompt-builder.tsAuto-generates LLM system prompt from raw SHACL
slot-validator.tsPost-LLM validation: tokenised fuzzy match, multi-domain correction, gap enrichment
agent/copilot-agent.tsCopilot SDK agent path + investigation tools
agent/index.tsVercel AI SDK agent path (5 providers; toolChoice forces submit_slots)
agent/submission-router.tsDispatches the LLM submission to the appropriate post-processing pipeline
agent/run-slot-pipeline.tsValidates slots, enriches gaps, and emits honest dropped-reference gaps
agent/tools.tssubmit_slots tool definition
agent/investigation-tools.ts5 schema discovery tools (kept available; rarely used now)
@ontology-search/apiroutes/search.tsHono SSE streaming endpoint (search + refine)
routes/traceability.tsWP3: GET /traceability?asset=<iri>&depth=N — multi-hop lineage walk
routes/metadata.tsWP3: GET /metadata/asset (per-asset facets) and /metadata/aggregate (per-domain stats)
routes/stats.tsStatistics endpoint
warmup.tsStartup orchestration (load ontology, init store, warm reference + metadata indices)
@ontology-search/testinghelpers/Shared test utilities (mock logger, fixtures)

Dependency Rules

Packages follow a strict layered dependency direction — no circular dependencies allowed:

  • core has zero workspace dependencies — it is the shared foundation
  • sparql and ontology depend only on core
  • search depends on core, sparql, and ontology
  • llm depends on core, ontology, and search
  • Apps (api, web) compose packages — packages never depend on apps
  • testing provides shared test utilities — not used in production code

Ontology-Agnostic Design

The system is designed to work with any set of OWL + SHACL ontologies — no hardcoded domain names, predicates, or class IRIs exist in production code. All structure is discovered at runtime from the schema graph.

Graph-Driven Discovery

ComponentWhat It DiscoversSource
Domain RegistryAsset types (hdmap, scenario, ...)rdfs:subClassOf + sh:targetClass
Property PathsPredicate chains (asset → leaf)sh:property / sh:node traversal
Schema QueriesShape groups, cross-domain refsSPARQL against <urn:graph:schema>
Investigation ToolsAnything — LLM explores at runtimeAd-hoc SPARQL SELECT

RDF Reasoning Capability

The LLM has 5 investigation tools that query the schema graph using SPARQL. This enables runtime ontology exploration beyond the static prompt — the LLM can verify concepts, discover relationships, and explore property hierarchies before filling slots.

See: Generic Design for the full architecture of the ontology-agnostic approach, property path discovery, and RDF reasoning capabilities.

Data Flow (Swim Lane)

Request Phase

Execution Phase

Security Model

The system is designed with defense-in-depth — no single layer failure can produce arbitrary queries:

LLM Never Writes SPARQL

The agent fills structured slots via a single tool (submit_slots). The compiler generates SPARQL deterministically. No prompt injection can produce arbitrary queries.

Slot Validation

Every filter value is validated against sh:in vocabulary from the ontology. Unknown values are rejected or fuzzy-matched to the nearest valid term. Domain mismatches are corrected automatically.

Zod Validation

All API inputs are validated with Zod schemas. Configuration is validated at startup. No untyped data flows through the system.