Skip to content

Parser API reference

The parser package exposes one supported entry point — parse(...) — plus the Pydantic models that describe the parsed tree. Anything else listed here is a documented internal you may import if you need to build custom tooling on top.

Public entry point

parse

parse(
    source: str | Path,
    rules: str | Path | None = None,
    lang: str = "en",
    *,
    strip_prefix: str | None = None,
    nlp_cache_dir: Path = DEFAULT_NLP_CACHE_DIR,
    unmatched_namespace: str | None = None
) -> APIModel

Parse an OpenAPI 3.x document into an APIModel tree.

Parameters:

Name Type Description Default
source str | Path

A local filesystem path or http(s) URL pointing to a JSON or YAML OpenAPI document. Format is auto-detected by content.

required
rules str | Path | None

Optional local path to a JSON/YAML rules file that mirrors the OpenAPI extension shape (root x-okapipy-ns and per-operation x-okapipy-kind). URLs are not accepted.

None
lang str

ISO language code controlling which spaCy model is loaded.

'en'
strip_prefix str | None

Optional path prefix to strip from every path before classification, e.g. /public/v1. When set, overrides the prefix inferred from servers[].url.

None
nlp_cache_dir Path

Directory under which spaCy models are stored and looked up. On a cache miss the model is downloaded into this directory.

DEFAULT_NLP_CACHE_DIR
unmatched_namespace str | None

When set, operations that would otherwise be dropped by the routing table are retained as synthetic actions under a top-level namespace of this name. Raises UnmatchedNamespaceCollisionError if the name collides with an existing top-level node.

None

Returns:

Type Description
APIModel

The fully-built APIModel rooted at the namespaces it discovered.

The data model

The tree returned by parse(...) is a graph of these Pydantic v2 models. They're immutable in spirit (downstream code generators treat them as read-only) and round-trip cleanly through JSON / YAML.

APIModel

Bases: BaseModel

The root of the parsed structural tree.

The root holds top-level namespaces, collections, singletons, and actions. Real-world OpenAPI documents commonly expose all four directly under / — e.g. /orders (collection), /me (singleton), /login (action) — with no namespace prefix.

Namespace

Bases: BaseModel

A folder-like grouping of sub-namespaces, collections, singletons, and actions.

Namespace-level actions (e.g. /auth/login) and namespace-level singletons (e.g. /admin/health) are valid: real APIs commonly host verb endpoints and singleton resources directly under a folder prefix.

Collection

Bases: BaseModel

A plural endpoint that fetches a list, creates, and contains a Resource.

Collections may also host sub-singletons that represent collection-level aggregate views — /orders/stats, /datasets/summary — alongside the per-item Resource reached via {id}. Sub-collections are not allowed (collection-under-collection has no canonical meaning).

Resource

Bases: BaseModel

The single-item endpoint of a collection (the segment after {id}).

Singleton

Bases: BaseModel

A resourceful endpoint with no enclosing collection.

Examples: /me, /health, /version, or sub-singletons like /users/{id}/avatar. Carries the same CRUD slots as Resource and may host sub-collections, sub-singletons, and actions. Has no resource slot — a Singleton is the resource.

Action

Bases: BaseModel

A non-CRUD endpoint identified by a verb-phrase path segment.

Actions may attach at the root of the API, under a Namespace, under a Collection, under a Resource, or under a Singleton.

attr_override decouples the surface attribute name from the path's last segment. The generator uses it when set (otherwise it falls back to the path-derived snake_case). It exists to support --unmatched, where the attribute should reflect the operation's operationId rather than wherever in the URL it happens to land.

Operation

Bases: BaseModel

A single HTTP operation declared on a path.

response_model names the literal 2xx response body schema (the envelope when the response wraps a list, or the resource itself for single-item responses). item_model names the inner schema of a list envelope when one is detected (plain type: array, or an object with an items/data/results/records/ entries array property). The generator uses it so paginated iteration can yield typed model instances instead of raw dicts; left as None when the response isn't list-shaped or the item schema is anonymous.

request_model_members is non-empty when the request body is an inline anyOf / oneOf union of $ref members (e.g. Login | RefreshAccessToken). The generator renders the body parameter as a Member1 | Member2 Python union type. When this list is empty and request_model is set, the body is typed as that single class.

pagination_supported defaults to True and is only meaningful on collection-fetch operations; the generator decides what to do with it on other operations. filter_supported and sort_supported default to False and will be flipped on by future x-okapipy-filter / x-okapipy-sort extensions; they drive whether the generator emits filter() / order_by() on the collection. response_headers lists the names of headers declared on the chosen 2xx response, useful to the generator for detecting Link, X-Total-Count, etc.

Loading specs and rules

loader

Load an OpenAPI 3.x document from a local path or an http(s) URL.

load_spec is the public entry point. It auto-detects JSON vs YAML from the file content, fetches the document (off disk for paths, over HTTP for URLs), and returns the parsed mapping. $ref pointers are deliberately left intact: downstream code recovers schema names from the original $ref strings, and full reference resolution would be both unnecessary and prohibitively expensive on real-world specs (deeply self-referential schemas, unreachable external files).

detect_base_path reads the path component of the first servers[].url, and strip_base_path removes that prefix from each path key so subsequent path-walking sees segments relative to the API's logical root.

load_spec

load_spec(source: str | Path) -> dict[str, Any]

Load an OpenAPI 3.x document, preserving $ref pointers as-is.

The source may be a local filesystem path or an http(s) URL. Format (JSON or YAML) is auto-detected from the file content.

Parameters:

Name Type Description Default
source str | Path

Path or URL pointing to the spec.

required

Returns:

Type Description
dict[str, Any]

The parsed spec as a plain dict, with $refs left intact.

Raises:

Type Description
SpecLoadError

When the document cannot be located, read, or parsed.

detect_base_path

detect_base_path(spec: dict[str, Any]) -> str

Return the path component of the spec's first servers[].url, or an empty string.

OpenAPI 3.x uses servers to advertise base URLs; the path portion of the first server URL is treated as the API's base path. If no servers are declared (or the URL has no path), the empty string is returned and no stripping is performed.

rules

External rules file: a project-local override layer for OpenAPI parsing.

A rules file lets a user supply (or override) x-okapipy-ns at the document root and x-okapipy-kind / x-okapipy-paginated / x-okapipy-exclude on path-items or operations without editing the OpenAPI document itself. Rules-file values take precedence over values declared inline in the spec.

The file must be local. URLs are not supported.

Rules

Bases: BaseModel

The full rules document.

PathRules

Bases: BaseModel

Rules entry for a single OpenAPI path.

OperationRules

Bases: BaseModel

Per-method override entry inside a path's rules block.

load_rules

load_rules(source: str | Path | None) -> Rules

Load a rules file from a local path, returning empty rules when source is None.

The file may be JSON or YAML; the format is auto-detected by attempting JSON first and falling back to YAML. URLs are rejected because the rules file is project-local.

Raises:

Type Description
RulesFormatError

When the file cannot be read or parsed, or when an x-okapipy-kind value is not one of the four legal kinds.

NLP

nlp

spaCy-backed POS and morphology lookup for path segments.

The classifier needs to know, for a single path segment, whether it looks like a plural noun (a collection: users, account-tokens), a verb or verb-phrase (an action: login, force-reimport), or neither. This module produces that summary by tagging the segment with a small spaCy model.

Three responsibilities live here:

  1. Map an ISO language code to its spaCy model name (en -> en_core_web_sm).
  2. Load the spaCy pipeline from a user-controlled cache directory, downloading the model on a cache miss via python -m spacy download --target <cache_dir>.
  3. Split a segment on -/_, tag each token, and reduce the result to three mutually exclusive flags: is it a verb-phrase, is it plural, or is it singular/unknown. The compound-word logic uses the head-noun rule (last token determines role) with a postmodifier-word exception for constructions like units-of-measure or terms-and-conditions.

Two non-obvious workarounds preserve correctness against the small spaCy models:

  • Bare path tokens (tokens, users) get mistagged as singular PROPN. To detect plurality reliably, the segment is re-analyzed inside a definite-article wrapper from PLURAL_CONTEXT (e.g. "the tokens"); the head noun then carries the right Number morphology.
  • Verbs (reset, submit) keep their VERB tag in isolation but lose it inside the article wrapper. Each token is analyzed both ways and the signals combined.
  • A small per-language VERB_ACTION_REGISTRY covers high-traffic API verb endpoints (login, refresh, ping, ...) that spaCy mistags even with the workarounds above.

DEFAULT_CACHE_DIR module-attribute

DEFAULT_CACHE_DIR = cwd() / '.spacy'

load_pipeline

load_pipeline(lang: str, cache_dir: Path = DEFAULT_CACHE_DIR) -> Language

Load the spaCy pipeline for lang, downloading it on a cache miss.

The pipeline is cached per-process keyed by (lang, cache_dir), so repeated calls are cheap. On a cache miss the model is downloaded into cache_dir using python -m spacy download <model> --target <cache_dir>. Subsequent calls reuse the on-disk copy without touching the network.

Parameters:

Name Type Description Default
lang str

ISO language code; must exist in the language-to-model table.

required
cache_dir Path

Directory under which model packages live.

DEFAULT_CACHE_DIR

Returns:

Type Description
Language

A loaded spaCy Language pipeline ready for tagging.

Raises:

Type Description
NlpModelMissingError

When the language is unknown or the download fails.

fetch_model

fetch_model(lang: str, cache_dir: Path = DEFAULT_CACHE_DIR) -> Path

Download the spaCy model for lang into cache_dir and return its path.

Uses spaCy's own download command, passing --target so the package is laid out under cache_dir/<model_name>/... instead of being installed globally.

Raises:

Type Description
NlpModelMissingError

When the download fails for any reason (network down, unknown model name, pip failure).

model_path

model_path(lang: str, cache_dir: Path) -> Path

Return the on-disk directory that holds the spaCy model for lang.

python -m spacy download --target installs the model as a Python package laid out as <cache_dir>/<package>/<package>-<version>/.... This helper resolves the versioned subdirectory when one exists, and otherwise returns the package root (which is what tests using a stub directory will see).

Classifier and builder

The classifier and builder are the heart of the pipeline. You generally won't call them directly — use parse(...) — but their docstrings are the most precise statement of what each phase does.

classifier

Classify a single OpenAPI path segment into a SegmentKind.

The structural builder calls classify_segment once per segment as it walks each path. The result decides whether the segment becomes a Namespace, Collection, Resource (path parameter), Singleton, or Action node in the tree.

The classifier applies the following precedence chain, stopping at the first match:

  1. Path-parameter shape — a segment containing {...} is always a RESOURCE_ID.
  2. Explicit hint — an x-okapipy-kind value passed in via extension_hint. The caller is responsible for merging spec values with rules-file values (rules win) before passing the hint here.
  3. Namespace registry — if the cumulative path is declared as a namespace (via spec x-okapipy-ns or rules), the segment is a NAMESPACE.
  4. NLP signalanalyze_segment reports verb-phrase / plural / singular, producing ACTION, COLLECTION, or (depending on parent) NAMESPACE / COLLECTION.
  5. Fallback — emit a warning and treat the segment as a COLLECTION.

SINGLETON never falls out of NLP heuristics: real singletons (/me, /health) look identical to singular-noun namespaces, so the kind is reachable only through an explicit hint.

classify_segment

classify_segment(
    *,
    segment: str,
    cumulative_path: str,
    parent_kind: SegmentKind | None,
    nlp: Language,
    ns_registry: set[str],
    extension_hint: str | None
) -> SegmentKind

Classify a single path segment into one of four kinds.

Parameters:

Name Type Description Default
segment str

The raw segment as it appears between / characters.

required
cumulative_path str

The path so far, joined from previous segments without a leading or trailing slash; used for the namespace-registry lookup.

required
parent_kind SegmentKind | None

The kind of the previous segment, or None when at the root.

required
nlp Language

A loaded spaCy pipeline used for POS and morphology.

required
ns_registry set[str]

The union of namespace paths declared by the spec and rules.

required
extension_hint str | None

A pre-merged x-okapipy-kind hint with rules precedence; one of the five kind names, or None.

required

Returns:

Type Description
SegmentKind

The classified SegmentKind.

builder

Walk an OpenAPI document and produce a populated APIModel tree.

build is the single public entry. It iterates paths, classifies each segment via classify_segment, attaches the corresponding node (Namespace, Collection, Resource, Singleton, or Action) under its parent, and routes the path-item's HTTP methods to operation slots on that node. The function mutates the APIModel and its children in place — there are no draft or wrapper types.

Three concerns live in this module:

  • Naming. contextual_name joins the full breadcrumb of singular collection names accumulated so far, so /organizations/{id}/datasources/{id}/force-reimport yields OrganizationDatasourceForceReimport. Resource names use the breadcrumb for the same reason. singularize reduces a plural collection segment via the spaCy-backed lemmatizer.
  • Node attachment. Each segment is mapped to a node kind by the classifier; _attach then either creates a new child or reuses an existing one with the same name. Namespace-level actions are valid (e.g. /auth/login); a path that attempts to place an action directly under a Namespace raises InvalidStructureError only when structurally impossible.
  • Operation routing. GET/POST on a Collection map to fetch/create; GET/PUT/PATCH/DELETE on a Resource or Singleton map to retrieve/update/partial_update/delete. Operations that don't fit (e.g. POST /users/{id} with no x-okapipy-kind: action hint, PUT on a bare collection) are dropped with a warning rather than coerced into a synthetic action; synthetic actions exist only for explicit x-okapipy-kind: action opt-ins.

Schema names for request_model / response_model are recovered from the unresolved raw_spec by reading the trailing segment of the original $ref, falling back to the resolved schema's title when no ref is present. x-okapipy-exclude skips whole paths ("*") or specific methods (["DELETE", ...], case-insensitive); rules-file values override spec values on every conflict.

build

build(
    spec: dict[str, Any],
    rules: Rules,
    nlp: Language,
    *,
    strip_prefix: str | None = None,
    unmatched_namespace: str | None = None
) -> APIModel

Construct an APIModel from an OpenAPI document.

$ref pointers in the spec are left intact: schema names for request_model and response_model are recovered from the trailing segment of each $ref, falling back to inline schema title when no ref is present.

Parameters:

Name Type Description Default
spec dict[str, Any]

The OpenAPI document, with $refs preserved as in the source.

required
rules Rules

A loaded Rules document (possibly empty).

required
nlp Language

A loaded spaCy pipeline used by the classifier and naming engine.

required
strip_prefix str | None

Optional path prefix to strip from every path before classification, e.g. /public/v1. When set, this overrides the prefix inferred from servers[].url.

None
unmatched_namespace str | None

When set, operations that would otherwise be dropped by the routing table are retained as synthetic actions under a top-level namespace of this name. Raises UnmatchedNamespaceCollisionError when the name collides with an existing top-level node identifier.

None

Returns:

Type Description
APIModel

A populated APIModel.

contextual_name

contextual_name(breadcrumb: list[str], current: str) -> str

Return a contextual PascalCase name built from the full breadcrumb chain.

Every singular collection name and singleton segment accumulated in breadcrumb is concatenated, then the PascalCase form of current is appended. With an empty breadcrumb, only PascalCase(current) is returned.

Namespaces never enter the breadcrumb — they're pure folders and carry no semantic ownership. Singletons do, because the elements they host belong to them (the orders under /me are Me's orders, not generic orders), which also prevents file-name collisions when a top-level collection and a singleton sub-collection share a segment (/orders vs /me/orders).

Examples:

contextual_name([], "orders") == "Orders" contextual_name(["Order"], "lines") == "OrderLines" contextual_name(["Me"], "orders") == "MeOrders" contextual_name(["Organization", "Datasource"], "force-reimport") == "OrganizationDatasourceForceReimport"

Dumping the tree

dump

Serialize an APIModel to JSON or YAML, inferring the format from a path.

write

write(api: APIModel, path: Path) -> None

Write the APIModel to path, choosing JSON or YAML by file extension.

Parameters:

Name Type Description Default
api APIModel

The model to serialize.

required
path Path

Destination file. The extension must be one of .json, .yaml, .yml.

required

Raises:

Type Description
ValueError

When the file extension is not recognized.

to_json

to_json(api: APIModel) -> str

Return a pretty-printed JSON representation of the APIModel.

Errors

Error hierarchy raised by the okapipy structural parser.

ParserError

Bases: Exception

Base class for all errors raised by the structural parser.

SpecLoadError

Bases: ParserError

Raised when the OpenAPI document cannot be loaded, parsed, or validated.

RulesFormatError

Bases: ParserError

Raised when the rules file cannot be parsed.

NlpModelMissingError

NlpModelMissingError(lang: str, cache_dir: str)

Bases: ParserError

Raised when the requested spaCy model is unavailable and cannot be downloaded.

Attributes:

Name Type Description
lang

The ISO language code that was requested.

cache_dir

The directory the loader looked in (and would have downloaded into).

lang instance-attribute

lang = lang

cache_dir instance-attribute

cache_dir = cache_dir

InvalidStructureError

Bases: ParserError

Raised when the parsed structure violates the okapipy hierarchy rules.

Currently this signals an attempt to attach an Action directly under a Namespace, which is not permitted: every Action must live under a Collection or a Resource.

UnmatchedNamespaceCollisionError

UnmatchedNamespaceCollisionError(
    requested: str, conflict_kind: str, conflict_name: str
)

Bases: ParserError

Raised when --unmatched <name> collides with an existing top-level node.

The synthesized container for unmatched operations must not share a snake_case identifier with any top-level Namespace, Collection, Singleton, or Action: that would produce two attributes with the same name on the generated client class. The caller picks a different name.

Attributes:

Name Type Description
requested

The name passed via unmatched_namespace.

conflict_kind

The kind of the conflicting node ("namespace", "collection", "singleton", or "action").

conflict_name

The original (pre-snake_case) name of the conflicting top-level node.

requested instance-attribute

requested = requested

conflict_kind instance-attribute

conflict_kind = conflict_kind

conflict_name instance-attribute

conflict_name = conflict_name