Transform natural language task descriptions into provably correct symbolic plans through LLM-guided PDDL synthesis with automatic refinement.
From natural language to executable plans through LLM-guided iterative synthesis
Figure: The LAPIS² architecture. Natural language is parsed into PDDL domain and problem files. Candidate plans are generated by a classical planner and evaluated by a symbolic validator. When validation fails, error feedback drives iterative domain/problem refinement. In Sim-LAPIS², a ground-truth simulator provides an additional grounding check.
LAPIS² prompts an LLM to synthesise a PDDL domain directly from a natural-language task description, without any human-authored PDDL. A Schema Check then verifies that every predicate required by the problem can actually be produced by the domain's action schemas — catching structural gaps before the planner is ever called. The domain schema is then injected into the LLM context for consistent problem generation.
The generated PDDL is passed to a classical planner (FastDownward / SymK / pyperplan). If no plan is found, or if VAL symbolic validation fails, detailed error feedback is routed back to the LLM for iterative problem refinement — and, when semantic inconsistencies are detected, domain refinement as well. This loop repeats for up to k iterations until a valid plan is produced.
Classical planners validate plans against the synthesised domain, which may use different action and object names than the real-world simulator. Sim-LAPIS² adds a final LLM-based grounding check that maps synthesised names back to ground-truth terms, enabling plans to be executed directly in the GT simulator — closing the "validation gap" that reduces all other methods to 0 % real executability.
LLM generates a PDDL domain from natural language, with an optional adequacy check to verify predicate coverage.
Domain schema injected into LLM context; PDDL problem file encoding initial state & goal is synthesised.
Classical planner generates a candidate plan; VAL validates it symbolically. Errors feed back for refinement.
LLM maps synthesised names to ground-truth terms, enabling direct execution in the real-world simulator.
Bridging LLM flexibility with symbolic planning rigor
Every generated plan is validated by VAL, ensuring logical soundness and executability before deployment.
Handle new domains without manual PDDL modeling. LLMs provide the semantic understanding, planners provide the search.
Successfully plans tasks with 80+ steps where pure autoregressive generation fails due to compounding errors.
Live web interface for real-time pipeline visualization, PDDL editing, and plan execution trace stepping.
Supports pyperplan, FastDownward, and SymK planners. Works with GPT-4o, Claude, and Gemini models.
Fully open-source implementation with benchmark suite, evaluation scripts, and reproducible experiments.
Evaluated on IPC domains from the LLM+P benchmark suite
VAL = the plan passes PDDL validation (self-consistent) · GT = the plan actually runs in the ground-truth simulator · Claude Sonnet 4.6 + FastDownward Columns marked * use the human-authored ground-truth PDDL domain and are provided as upper-bound references only. Hover a column header to learn what each method does, or hover a domain name for a description of that benchmark.
| Domain | LLM+P* | NL2Plan | GT-LAPIS²* | LAPIS² | Sim-LAPIS² | |||
|---|---|---|---|---|---|---|---|---|
| Few* few-shot | Zero* zero-shot | 325s/task 2 domains | GT* oracle domain | Zero 0 iterations | Dom. 3-iter synth. | Adq. + adequacy check | VAL / GT real executability | |
| Blocksworld | 100 | 80 | 95 | 100 | 100 | 100 | 100 | 100 / 90 |
| Floortile | 100 | 45 | — | 90 | 90 | 90 | 85 | 100 / 75 |
| Tyreworld | 95 | 90 | — | 95 | 75 | 75 | 90 | 100 / 45 |
| Storage | 100 | 90 | 25 | 100 | 75 | 100 | 100 | 100 / 35 |
| Barman | 95 | 0 | — | 80 | 0 | 50 | 100 | 100 / 90 |
| Grippers | 100 | 100 | — | 100 | 100 | 100 | 100 | 100 / 90 |
| Termes | 100 | 95 | — | 100 | 95 | 100 | 100 | 100 / 85 |
| Avg | 99 | 71 | — | 95 | 76 | 85 | 96 | 100 / 73 |
* These methods receive the human-authored PDDL domain as input and serve as upper-bound references. They are not directly comparable to synthesis methods that must build the domain from scratch.
Every non-Sim-LAPIS² method shows VAL score only. Their GT executability is 0% across all domains: plans are validated against the synthesized PDDL schema, which uses different action and object names than the ground-truth simulator. Sim-LAPIS² is the only method that bridges this gap, using an LLM-based assignment mapping to translate synthesized plans into ground-truth terms.