Parser internals¶

The parser turns a flat OpenAPI 3.x document into a hierarchical APIModel tree. It runs as a linear pipeline, eight phases, each in its own module. Reading the modules top-to-bottom in the order listed below is the fastest way to understand the whole thing.

spec source
   │
   ▼
loader.py  ────►  raw_spec + spec   (prance, refs inlined; raw kept for $ref names)
   │
   ▼
nlp.py     ────►  spaCy pipeline    (POS tagging + plural/verb detection)
   │
   ▼
extension.py ───► spec extensions   (x-okapipy-ns, x-okapipy-kind, x-okapipy-exclude)
rules.py     ───► rules file        (project-local overrides; same shape, wins)
   │
   ▼
classifier.py ──► segment → kind    (resource | namespace | collection | action | singleton)
   │
   ▼
builder.py  ────► walks paths       (mutates APIModel in place; hooks operations into slots)
   │
   ▼
model.py    ────► APIModel          (Pydantic v2 tree)
   │
   ▼
dump.py     ────► JSON / YAML out   (parse-time visualization)
api.py      ────► public parse()    (the single supported entry point)

Phase 1 — `loader.py`¶

load_spec(source) and load_raw_spec(source). Both accept a local path or http(s) URL, JSON or YAML — format auto-detected.

load_spec runs prance with the openapi-spec-validator backend and inlines internal and external $refs. Validation runs as part of the load.
load_raw_spec uses prance.BaseParser to keep refs intact. The builder needs the un-resolved doc to recover original schema names by reading the trailing segment of each $ref.

detect_base_path(spec) reads the path component of the first servers[].url. It does not auto-guess from path commonalities (that heuristic stripped meaningful segments and was removed). If you want a different prefix, pass --strip-prefix / strip_prefix=.

Phase 2 — `nlp.py`¶

Two non-obvious tricks worth preserving:

The "the X" wrapper for plural detection¶

Bare path segments like tokens or users get tagged PROPN with Number=Sing by en_core_web_sm in isolation. Wrapping the segment in a language-specific definite article (PLURAL_CONTEXT) coaxes the tagger into a noun analysis with correct plurality. lemma_in_context() uses the same trick for singularization (so users → user).

Bare for verb detection, wrapper for plural¶

A clear verb in isolation (reset, submit) keeps its VERB tag; the wrapper would force it to NOUN. So _analyze_token runs both analyses (bare and wrapped) and combines the signals: bare tells us "is this a verb?", wrapped tells us "if this is a noun, is it plural?".

Model cache¶

Models live at <cache_dir>/<package>/<package>-<version>/. model_path() resolves the inner versioned dir. Default cache_dir is Path.cwd() / ".spacy". A cache miss triggers python -m spacy download <model> --target <dir> automatically (no opt-in flag). The download is idempotent, so concurrent processes trampling on the cache directory is fine.

Phase 3 — `extension.py` + `rules.py`¶

These read the same kind of hint from two places:

extension.py reads x-okapipy-ns, x-okapipy-kind, x-okapipy-exclude, x-okapipy-paginated from the spec.
rules.py reads a project-local rules file (Rules / PathRules / OperationRules Pydantic models) that mirrors the same shape.

Rules-file values win over spec on every conflict. This is the right precedence: the rules file exists precisely for the case where you can't (or shouldn't) edit the upstream spec.

The rules file is local-only (no URL support). That's a deliberate constraint; project-local overrides should be in your repo, not on a remote server.

Phase 4 — `classifier.py`¶

classify_segment(...) decides if a single segment is NAMESPACE | COLLECTION | RESOURCE_ID | ACTION | SINGLETON. Precedence:

Path parameter ({id}) → RESOURCE_ID.
Explicit x-okapipy-kind (rules first, then spec).
Namespace registry (x-okapipy-ns).
spaCy POS / morphology:
Verb in isolation → ACTION.
Plural noun → COLLECTION.
Multi-word kebab segment whose head is not plural → verb-phrase action (e.g. force-reimport).
Single-token API verb in the language registry (login, logout, refresh, revoke, verify, subscribe, unsubscribe, activate, deactivate, enable, disable, archive, publish, ping, …) → ACTION.
Fallback → COLLECTION + warn.

A multi-word segment whose head is not plural is treated as a verb phrase — that's how force-reimport becomes an action while password-recovery-requests stays a collection.

Phase 5 — `builder.py`¶

Walks paths and mutates Pydantic models in place. There are no draft/wrapper types. Key invariants:

Naming¶

contextual_name(breadcrumb, current) joins the full breadcrumb (every singular collection or singleton name accumulated so far). So:

/organizations/{id}/datasources/{id}/force-reimport → OrganizationDatasourceForceReimport.
/me/orders/{id} → MeOrders collection and MeOrder resource — the singleton contributes Me to the breadcrumb so the sub-collection doesn't collide with a top-level /orders.
Resource names use "".join(breadcrumb) for the same reason.

Namespaces don't enter the breadcrumb — they're folders, not type names. /auth/login is Login; /users/{id}/avatar is UserAvatar.

A literal . in a path segment (e.g. /.well-known/openid-configuration) is expanded to the word Dot so the resulting identifier is valid Python: DotWellKnown (PascalCase) and dot_well_known (snake_case). The raw segment is preserved on the parsed Namespace.name for HTTP routing; sanitization happens at render time in the generator.

Operation routing¶

Terminal kind	GET	POST	PUT	PATCH	DELETE
Collection	`fetch`	`create`	dropped	dropped	dropped
Resource	`retrieve`	dropped	`update`	`partial_update`	`delete`
Singleton	`retrieve`	dropped	`update`	`partial_update`	`delete`
Action	appended to `Action.operations` (one Action per path holds every method on it)

Anything that doesn't fit (e.g. POST /users/{id} with no x-okapipy-kind: action hint, PUT on a bare collection) is dropped with a warning, not coerced into a synthetic action. Synthetic actions only exist for explicit x-okapipy-kind: action opt-ins.

Pass --unmatched <name> to opt out of the drop and keep those operations as flat actions under a synthetic top-level namespace — useful when you don't own the spec and per-op annotation isn't practical. See Rules and extensions for the worked example.

Allowed structural shapes¶

_attach enforces a deliberately strict parent table. Singletons and resources are interchangeable as parents — a singleton is "a resource without an {id}", so what hangs off a resource also hangs off a singleton:

Child kind	Allowed parents
`Namespace`	root, `Namespace`
`Collection`	root, `Namespace`, `Resource`, `Singleton`
`Resource` (the `{id}` slot)	`Collection` only
`Singleton`	root, `Namespace`, `Collection`, `Resource`, `Singleton`
`Action`	root, `Namespace`, `Collection`, `Resource`, `Singleton`

Two real-world shapes this unlocks:

Collection under singleton — /me/orders, /orgs/current/members, /workspaces/current/tag-keys. The singleton models a pseudo-resource (the current org, the current user); its sub-collections list the things that belong to it.
Singleton under collection — /orders/stats, /datasets/summary, /workspaces/current/secrets/encrypted. The singleton is a view derived from the collection, not one of its items.

What's not allowed: collection-under-collection and any sub-element under an Action (actions are leaves). Both raise InvalidStructureError, which the builder catches and logs as a warning. Workaround for genuine collection-under-collection cases is x-okapipy-exclude: '*' on the offending paths.

Forbidden combinations¶

Namespace-level actions are forbidden when the segment is itself a namespace. An action segment under a Namespace raises InvalidStructureError. (Actions attached to a namespace — /auth/refresh — are fine; what's forbidden is treating a namespace itself as an action.)

Schema name recovery¶

request_model / response_model names are recovered from raw_spec by reading the original $ref's trailing segment. If a 2xx response has no $ref (inline schema), the resolved schema's title is used. If neither exists, the field is None — Operation.response_model is str | None precisely because some 2xx responses have no body.

Excludes¶

x-okapipy-exclude: "*" skips a whole path.
x-okapipy-exclude: [DELETE, ...] skips just those methods (case-insensitive).
Rules-file values override spec values.

Phase 6 — `model.py`¶

Pydantic v2 models. Two deliberate deviations from the spec in parser.md §6:

APIModel carries top-level collections: list[Collection]. Real APIs commonly expose /orders with no namespace prefix — forcing a fake namespace would distort the tree.
Operation.response_model is str | None — some 2xx responses have no body (DELETE returning 204, for example).
Collection.fetch and Collection.create are the slot names (renamed from the original list_operation / create_operation). Shorter, and list is a builtin you don't want to shadow.

Phase 7 — `dump.py`¶

write(api, path) infers JSON vs YAML from .json / .yaml / .yml. Anything else raises ValueError. There's also to_json(api) for the "just give me a string" case used by the CLI.

Phase 8 — `api.py`¶

The single public entry point:

def parse(
    source: str | Path,
    rules: str | Path | None = None,
    lang: str = "en",
    *,
    strip_prefix: str | None = None,
    nlp_cache_dir: Path = DEFAULT_NLP_CACHE_DIR,
) -> APIModel: ...

parse returns APIModel directly — no result wrapper, no error tuple. Non-fatal warnings go through logging; fatal problems raise a ParserError subclass (caught by the CLI and surfaced as a non-zero exit).

The full API reference is auto-generated from the docstrings.

Adding a new heuristic¶

Most spec-shape oddities should be solvable with extensions or rules without touching parser code. Reach for a parser change only when:

The hint mechanism doesn't yet cover the case (rare — the four extensions cover essentially every classification override). In that case, add it in extension.py first, mirror it in rules.py, and thread it through classifier.py and builder.py.
spaCy is reliably wrong on a class of segments. Look at nlp.py's VERB_REGISTRY — that's the right place for "a known verb the small pipeline mistags as a noun." Anything fancier (custom matchers, domain-specific NER) is overkill for a CLI tool's startup budget.

When in doubt, add a fixture under tests/fixtures/ exhibiting the spec shape, write a test that expects the wrong answer, then push the test green by changing the parser. The fixture stays in the suite forever as a regression guard.