Building A Markdown Compiler Agent For Openapi Terraform Mappers

I got obsessed with a very specific workflow bug: every time my team exported an OpenAPI spec into Terraform-friendly artifacts, the “glue” logic lived in hand-written Markdown docs and ad-hoc scripts. The docs were the source of truth—but the scripts were the ones people actually ran. That mismatch caused subtle drift: sometimes an endpoint got updated in OpenAPI, but the Markdown examples (and therefore the generated mappings) lagged behind.

So I built a domain-specific LLM pipeline that treats a particular kind of Markdown as an input language—specifically, fenced code blocks inside  sections—and turns them into a deterministic Terraform mapping plan.

The main idea: don’t build a generic “chatbot.” Build a small “compiler” that understands a tiny, boring DSL embedded in Markdown, and only then let the LLM help with the messy bits (schema-to-type mapping, naming conventions, and example normalization).

What I built: a Markdown-to-Terraform compiler

My input Markdown looks like this:

It contains sections like:

<!--openapi-to-terraform-->
## endpoint: GET /v1/widgets/{widgetId}
resource: aws_lambda_function.widgets_get_widget
request:
  path:
    widgetId: string
response:
  status: 200
  body:
    type: Widget
    schemaRef: components.schemas.Widget
mappingRules:
  - name: widgetId
    from: path.widgetId
    as: var.widget_id

The compiler:
1. Extracts those sections.
2. Parses the YAML payload under each  block.
3. Uses an LLM to resolve schemaRef into Terraform type expressions.
4. Emits a Terraform mapping file with stable output formatting.

The niche part is the Markdown compiler: most domain-specific LLM efforts stop at RAG (retrieval) or generic codegen. Mine treats Markdown as a formal-ish interface and compiles it like a program.

The “domain model” I taught the system

I needed an explicit contract so the LLM didn’t improvise.

Terms

OpenAPI: an API specification format describing endpoints and JSON schemas.
Terraform: infrastructure-as-code tool; it wants typed expressions like string, map(string), object({ ... }), etc.
Domain DSL: the mini language I defined inside Markdown.

The strict input/output shape

The Markdown compiler always produces:

plan.tf.json (Terraform JSON form so it’s easy to validate)
a diagnostics list for any unresolved schemas

End-to-end code (working)

Below is a complete, runnable example in Python that:

extracts the special Markdown sections
parses the YAML inside them
calls an LLM to resolve a schema reference into a Terraform type
writes a Terraform plan JSON

This uses OpenAI’s API, but the structure works the same with any chat-completions provider.

1) Project layout

md-openapi-terraform-compiler/
  main.py
  sample.md
  schemas.json
  requirements.txt

requirements.txt:

openai
pyyaml
pydantic

Install:

pip install -r requirements.txt

2) Sample Markdown input

sample.md:

# Release Notes

<!--openapi-to-terraform-->
## endpoint: GET /v1/widgets/{widgetId}
resource: aws_lambda_function.widgets_get_widget
request:
  path:
    widgetId: string
response:
  status: 200
  body:
    type: Widget
    schemaRef: components.schemas.Widget
mappingRules:
  - name: widgetId
    from: path.widgetId
    as: var.widget_id

<!--openapi-to-terraform-->
## endpoint: POST /v1/widgets
resource: aws_lambda_function.widgets_post_widget
request:
  body:
    schemaRef: components.schemas.NewWidget
response:
  status: 201
  body:
    schemaRef: components.schemas.WidgetCreated
mappingRules:
  - name: payload
    from: body
    as: var.new_widget

3) A tiny OpenAPI schemas “index”

I’m simulating the OpenAPI schema registry using a local JSON file that maps schema refs to a simplified JSON schema representation.

schemas.json:

{
  "components.schemas.Widget": {
    "type": "object",
    "properties": {
      "id": { "type": "string" },
      "name": { "type": "string" },
      "tags": { "type": "array", "items": { "type": "string" } }
    },
    "required": ["id", "name"]
  },
  "components.schemas.NewWidget": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "tags": { "type": "array", "items": { "type": "string" } }
    },
    "required": ["name"]
  },
  "components.schemas.WidgetCreated": {
    "type": "object",
    "properties": {
      "widget": { "$ref": "components.schemas.Widget" },
      "createdAt": { "type": "string", "format": "date-time" }
    },
    "required": ["widget", "createdAt"]
  }
}

4) The compiler (main.py)

import json
import re
from typing import Any, Dict, List

import yaml
from openai import OpenAI


OPENAPI_BLOCK_RE = re.compile(
    r"<!--openapi-to-terraform-->\s*(.*?)\s*(?=<!--openapi-to-terraform-->|$)",
    re.DOTALL,
)

ENDPOINT_HEADER_RE = re.compile(r"^##\s*endpoint:\s*(.+?)\s*$", re.MULTILINE)


def extract_blocks(markdown: str) -> List[str]:
    return [m.group(1).strip() for m in OPENAPI_BLOCK_RE.finditer(markdown)]


def parse_block(block: str) -> Dict[str, Any]:
    """
    Each block starts with:
      ## endpoint: GET /v1/widgets/{widgetId}
    then a YAML document with keys like resource, request, response, mappingRules.
    """
    header_match = ENDPOINT_HEADER_RE.search(block)
    if not header_match:
        raise ValueError("Missing '## endpoint: ...' header in openapi-to-terraform block")

    endpoint = header_match.group(1).strip()

    # Remove the markdown header line so the rest is pure YAML
    block_wo_header = re.sub(r"^##\s*endpoint:\s*.+?\s*$", "", block, flags=re.MULTILINE).strip()

    data = yaml.safe_load(block_wo_header) or {}
    data["endpoint"] = endpoint
    return data


def terraform_type_from_json_schema(schema: Dict[str, Any], max_depth: int = 5) -> str:
    """
    This is a deterministic fallback.
    The LLM will handle tricky $ref cases, but we still want a local mapper.
    """
    if max_depth <= 0:
        return "any"

    if "$ref" in schema:
        # Leave unresolved for the LLM.
        return "any"

    t = schema.get("type")
    if t == "string":
        fmt = schema.get("format")
        if fmt == "date-time":
            return "string"  # Terraform doesn't have datetime types; keep as string
        return "string"
    if t == "integer":
        return "number"
    if t == "number":
        return "number"
    if t == "boolean":
        return "bool"
    if t == "array":
        items = schema.get("items", {})
        return f"list({terraform_type_from_json_schema(items, max_depth=max_depth-1)})"
    if t == "object":
        props = schema.get("properties", {})
        required = set(schema.get("required", []))
        # Terraform object({ ... }) is strict about keys; we keep it permissive by making keys optional-like
        # via a map conversion. This avoids brittle plans for missing fields.
        # If you want strict objects, replace with object({ ... }).
        if not props:
            return "map(string)"
        fields = []
        for k, v in props.items():
            fields.append(f"{k}={terraform_type_from_json_schema(v, max_depth=max_depth-1)}")
        # Use object({ ... }) form
        return "object({" + ", ".join(fields) + "})"

    return "any"


def resolve_schema_ref_with_llm(client: OpenAI, schemas_index: Dict[str, Any], schema_ref: str) -> Dict[str, str]:
    """
    Ask the LLM to produce a Terraform type expression for a given OpenAPI schema ref.
    The LLM sees:
      - the schema_ref name
      - the referenced schema JSON
      - already-known rules

    Output is structured JSON with:
      - terraform_type
      - explanation (for diagnostics)
    """
    if schema_ref not in schemas_index:
        return {"terraform_type": "any", "explanation": "schemaRef not found in index"}

    schema = schemas_index[schema_ref]
    fallback = terraform_type_from_json_schema(schema)

    system = (
        "You are a domain-specific compiler for OpenAPI schemas into Terraform type expressions. "
        "Return only valid JSON that matches the output schema."
    )

    user = {
        "schemaRef": schema_ref,
        "openapiSchema": schema,
        "deterministicFallback": fallback,
        "rules": [
            "Map OpenAPI primitive types to Terraform: string->string, integer/number->number, boolean->bool.",
            "Map arrays to list(<itemType>).",
            "Map objects to object({key=type,...}). If schema has no properties, use map(string).",
            "If schema contains $ref, resolve it conceptually to the referenced schema type if possible; otherwise use any.",
            "Prefer correctness over strictness; using 'any' is acceptable when uncertain."
        ],
        "Output format": {
            "terraform_type": "string",
            "explanation": "string"
        }
    }

    # NOTE: replace `gpt-4.1-mini` with your preferred model
    resp = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": json.dumps(user)}
        ],
        temperature=0.0,
        response_format={"type": "json_object"},
    )

    content = resp.choices[0].message.content
    return json.loads(content)


def compile_markdown_to_terraform_plan(markdown_path: str, schemas_path: str, out_path: str, api_key: str):
    client = OpenAI(api_key=api_key)

    markdown = open(markdown_path, "r", encoding="utf-8").read()
    schemas_index = json.load(open(schemas_path, "r", encoding="utf-8"))

    blocks = extract_blocks(markdown)
    if not blocks:
        raise ValueError("No <!--openapi-to-terraform--> blocks found")

    compiled: Dict[str, Any] = {
        "version": 1,
        "mappings": []
    }

    diagnostics: List[Dict[str, str]] = []

    for block in blocks:
        data = parse_block(block)

        endpoint = data["endpoint"]
        resource = data.get("resource")
        request = data.get("request", {}) or {}
        response = data.get("response", {}) or {}

        mapping_entry: Dict[str, Any] = {
            "endpoint": endpoint,
            "resource": resource,
            "request": {},
            "response": {},
            "mappingRules": data.get("mappingRules", [])
        }

        # Resolve schemaRef for request body if present
        req_body = request.get("body") if isinstance(request, dict) else None
        if isinstance(req_body, dict) and "schemaRef" in req_body:
            schema_ref = req_body["schemaRef"]
            resolved = resolve_schema_ref_with_llm(client, schemas_index, schema_ref)
            mapping_entry["request"]["bodyTerraformType"] = resolved["terraform_type"]
            diagnostics.append({
                "endpoint": endpoint,
                "schemaRef": schema_ref,
                "resolution": resolved["terraform_type"],
                "explanation": resolved["explanation"]
            })
        # Path params are already simple primitives in this DSL
        if isinstance(request, dict) and "path" in request:
            mapping_entry["request"]["path"] = request["path"]

        # Resolve schemaRef for response body if present
        resp_body = response.get("body") if isinstance(response, dict) else None
        if isinstance(resp_body, dict) and "schemaRef" in resp_body:
            schema_ref = resp_body["schemaRef"]
            resolved = resolve_schema_ref_with_llm(client, schemas_index, schema_ref)
            mapping_entry["response"]["bodyTerraformType"] = resolved["terraform_type"]
            diagnostics.append({
                "endpoint": endpoint,
                "schemaRef": schema_ref,
                "resolution": resolved["terraform_type"],
                "explanation": resolved["explanation"]
            })

        compiled["mappings"].append(mapping_entry)

    plan = {
        "terraform": {
            # This is a lightweight representation. In real life you might generate
            # HCL or terraform-provider-specific resources.
            "mappingPlan": compiled
        },
        "diagnostics": diagnostics
    }

    with open(out_path, "w", encoding="utf-8") as f:
        json.dump(plan, f, indent=2)
    print(f"Wrote {out_path}")


if __name__ == "__main__":
    # Usage:
    #   export OPENAI_API_KEY="..."
    #   python main.py
    import os
    compile_markdown_to_terraform_plan(
        markdown_path="sample.md",
        schemas_path="schemas.json",
        out_path="plan.tf.json",
        api_key=os.environ["OPENAI_API_KEY"],
    )

What happens when I run it

export OPENAI_API_KEY="YOUR_KEY"
python main.py

It reads sample.md, finds both  blocks, and parses each YAML payload. Then it calls the model for each schemaRef it sees:

components.schemas.Widget → generates a Terraform type for an object with id, name, tags
components.schemas.NewWidget → generates Terraform type for request body
components.schemas.WidgetCreated → generates type for response body, possibly using a $ref inside

Finally it writes plan.tf.json.

A typical output (shape) looks like:

{
  "terraform": {
    "mappingPlan": {
      "version": 1,
      "mappings": [
        {
          "endpoint": "GET /v1/widgets/{widgetId}",
          "resource": "aws_lambda_function.widgets_get_widget",
          "request": {
            "path": {
              "widgetId": "string"
            }
          },
          "response": {
            "bodyTerraformType": "object({id=string, name=string, tags=list(string)})"
          },
          "mappingRules": [
            {
              "name": "widgetId",
              "from": "path.widgetId",
              "as": "var.widget_id"
            }
          ]
        }
      ]
    }
  },
  "diagnostics": [
    {
      "endpoint": "GET /v1/widgets/{widgetId}",
      "schemaRef": "components.schemas.Widget",
      "resolution": "object({id=string, name=string, tags=list(string)})",
      "explanation": "Mapped OpenAPI object properties to Terraform object attributes; array tags to list(string)."
    }
  ]
}

Why this is “domain-specific” (and not generic codegen)

The LLM isn’t free-form. It’s constrained to answer one question:

“Given this specific schemaRef, what Terraform type expression should I use?”

Meanwhile, the compiler logic is deterministic:

Markdown extraction
YAML parsing
plan structure
stable JSON output

That combination matters. In my experiments, letting the model “write the plan” directly led to unstable key ordering, inconsistent schemas, and occasional hallucinated fields. Keeping the compiler strict made the system reliable even when the LLM was uncertain (it falls back to any).

A small reliability upgrade I added: deterministic fallback

Notice the terraform_type_from_json_schema function. It generates a best-effort Terraform type without calling the LLM. I used it as:

a baseline shown to the model (deterministicFallback)
a guardrail for cases where the schemaRef is missing

This pattern is a practical way to keep a domain-specific LLM grounded in local, testable logic.

Conclusion

I built a niche Domain-Specific LLM system that compiles a very specific Markdown DSL ( blocks) into a Terraform mapping plan, using the LLM only for the schema-ref-to-typed-expression step. The big win from my tinkering was reliability: deterministic parsing and strict plan structure prevented drift, while the LLM handled the messy schema translation work.