ICAPS 2026 Demo Track

Language-to-Action Planning via Iterative Schema Synthesis

Transform natural language task descriptions into provably correct symbolic plans through LLM-guided PDDL synthesis with automatic refinement.

Launch Demo
LAPIS — Demo Video

The LAPIS Pipeline

From natural language to executable plans through LLM-guided iterative synthesis

LAPIS architecture diagram showing the full pipeline from Natural Language Task through Domain Generation, Problem Generation, Semantic Validity, Planner, Symbolic Validation, GT Simulator, with a Refinement Loop.

Figure: The LAPIS² architecture. Natural language is parsed into PDDL domain and problem files. Candidate plans are generated by a classical planner and evaluated by a symbolic validator. When validation fails, error feedback drives iterative domain/problem refinement. In Sim-LAPIS², a ground-truth simulator provides an additional grounding check.

Synthesis & Adequacy

LAPIS² prompts an LLM to synthesise a PDDL domain directly from a natural-language task description, without any human-authored PDDL. A Schema Check then verifies that every predicate required by the problem can actually be produced by the domain's action schemas — catching structural gaps before the planner is ever called. The domain schema is then injected into the LLM context for consistent problem generation.

Refinement Loop

The generated PDDL is passed to a classical planner (FastDownward / SymK / pyperplan). If no plan is found, or if VAL symbolic validation fails, detailed error feedback is routed back to the LLM for iterative problem refinement — and, when semantic inconsistencies are detected, domain refinement as well. This loop repeats for up to k iterations until a valid plan is produced.

Ground-Truth Grounding

Classical planners validate plans against the synthesised domain, which may use different action and object names than the real-world simulator. Sim-LAPIS² adds a final LLM-based grounding check that maps synthesised names back to ground-truth terms, enabling plans to be executed directly in the GT simulator — closing the "validation gap" that reduces all other methods to 0 % real executability.

1

Domain Synthesis

LLM generates a PDDL domain from natural language, with an optional adequacy check to verify predicate coverage.

2

Problem Generation

Domain schema injected into LLM context; PDDL problem file encoding initial state & goal is synthesised.

3

Planning & Validation

Classical planner generates a candidate plan; VAL validates it symbolically. Errors feed back for refinement.

4

GT Grounding (Sim-LAPIS²)

LLM maps synthesised names to ground-truth terms, enabling direct execution in the real-world simulator.

Key Features

Bridging LLM flexibility with symbolic planning rigor

Provable Correctness

Every generated plan is validated by VAL, ensuring logical soundness and executability before deployment.

Zero-Shot Domains

Handle new domains without manual PDDL modeling. LLMs provide the semantic understanding, planners provide the search.

Long-Horizon Planning

Successfully plans tasks with 80+ steps where pure autoregressive generation fails due to compounding errors.

Interactive Demo

Live web interface for real-time pipeline visualization, PDDL editing, and plan execution trace stepping.

Multiple Backends

Supports pyperplan, FastDownward, and SymK planners. Works with GPT-4o, Claude, and Gemini models.

Coming Soon!

Open Source (Soon!)

Fully open-source implementation with benchmark suite, evaluation scripts, and reproducible experiments.

Benchmark Results

Evaluated on IPC domains from the LLM+P benchmark suite

100%
Planning Success Rate
VAL validation across all 7 IPC domains
73%
GT Execution Success Rate
Ground-truth simulator, avg. 7 domains
8.9s
Avg. Generation Time
Per task (36x faster than NL2Plan)

Success Rate on 20 Problems per Domain

VAL = the plan passes PDDL validation (self-consistent)  ·  GT = the plan actually runs in the ground-truth simulator  ·  Claude Sonnet 4.6 + FastDownward Columns marked * use the human-authored ground-truth PDDL domain and are provided as upper-bound references only. Hover a column header to learn what each method does, or hover a domain name for a description of that benchmark.

Domain LLM+P* NL2Plan GT-LAPIS²* LAPIS² Sim-LAPIS²
Few* few-shot Zero* zero-shot 325s/task 2 domains GT* oracle domain Zero 0 iterations Dom. 3-iter synth. Adq. + adequacy check VAL / GT real executability
Blocksworld 100 80 95 100 100 100 100 100 / 90
Floortile 100 45 90 90 90 85 100 / 75
Tyreworld 95 90 95 75 75 90 100 / 45
Storage 100 90 25 100 75 100 100 100 / 35
Barman 95 0 80 0 50 100 100 / 90
Grippers 100 100 100 100 100 100 100 / 90
Termes 100 95 100 95 100 100 100 / 85
Avg 99 71 95 76 85 96 100 / 73

* These methods receive the human-authored PDDL domain as input and serve as upper-bound references. They are not directly comparable to synthesis methods that must build the domain from scratch.

Every non-Sim-LAPIS² method shows VAL score only. Their GT executability is 0% across all domains: plans are validated against the synthesized PDDL schema, which uses different action and object names than the ground-truth simulator. Sim-LAPIS² is the only method that bridges this gap, using an LLM-based assignment mapping to translate synthesized plans into ground-truth terms.

Key result: LAPIS² with adequacy checking reaches 96% average VAL score using only natural language input, no human-authored PDDL. This is within 3 points of the few-shot oracle (99%) that gets handed the correct domain. On harder domains like Barman and Storage, NL2Plan scores between 0% and 25% while LAPIS² reaches 50-100%. Sim-LAPIS² then goes further, achieving 73% real executability in the ground-truth simulator — something no other method can do at all.