Accelerating LLM Code Generation Through Mask Store Streamlining
Structured text generation techniques enforcing context-free grammar (CFG) constraints (Willard & Louf, 2023; Gerganov & et. al., 2024; Lundberg & Ribeiro, 2023; Geng et al., 2024; Beurer-Kellner et al., 2024; Ugare et al., 2024; Dong et al., 2024) are particularly useful to generate syntatically correct computer code in the context of LLM-based coding tools. These techniques guarantee full compliance but introduce computational overhead at inference time.
Minimizing inference overhead is essential for a productive developer experience, especially when suggesting generated code in real time. This is challenging because the CFG of programming languages are complex and CFG constraints are harder to impose than regex constraints.
In this blog post, I propose to accelerate some CFG-constrained decoding techniques. More precisely, I will:
- Briefly describe the principle of such techniques;
- Provide examples of patterns of the input CFG that can be used to reduce the inference overhead;
- Describe algorithms to automatically detect such patterns;
- Present first experimental results.
The code to reproduce the experimental results and the proofs of the algorithms are provided in a Python notebook and a technical appendix.
CFG-Constrained LLM Decoding
This blog post focuses on CFG-constrained decoding techniques (Beurer-Kellner et al., 2024; Willard & Louf, 2023) that leverage:
- an automaton-based lexer to ensure that the generated string can be converted into a sequence of terminals;
- an incremental parser to guarantee that the resulting sequence of terminals complies with the grammar.
In this section, I will illustrate how these two components can be used, using the Python grammar as an example. I will only give a high-level overview, without covering secondary technical details (ignored terminals, terminal priorities, indentations). I invite you to read the papers referenced above, as well as the associated Python notebook, for a more comprehensive introduction.
The Python grammar as specified in the lark
package includes:
- 100 terminals (e.g.
DEF
,NAME
,LPAR
,RPAR
), each of which is described by a regular expression (e.g.def
forDEF
or[^\W\d]\w*]
forNAME
); - 176 nonterminals (e.g.
funcdef
,parameters
,test
); - 536 rules such as
funcdef: "def" name "(" parameters ")" "->" test ":" suite
orname: NAME
.
Producing a string that can be converted into a sequence of terminals
When leveraging an incremental parser, a first step to ensure syntactically correct Python code generation is constructing a non-deterministic finite automaton (NFA) that recognizes strings convertible into terminal sequences. For this, we can simply connect the deterministic finite automata (DFA) corresponding to the regular expression of each terminal with transitions and two additional nodes, as shown on Figure 1 for a few terminals.
Since LLMs manipulate tokens and not characters, we need to convert this character-based NFA into the equivalent token-based NFA. This is done by simply seeing each token as a sequence of characters and applying the corresponding transitions (Beurer-Kellner et al., 2024; Willard & Louf, 2023). For example, if ()
is a token, the token-based NFA will include ()
transitions from and to .
While following the character transitions, we keep track of the DFAs traversed to annotate the transitions of the token-based NFA with the corresponding sequences of new terminals. For example, the ()
transitions would be annotated with [LPAR, RPAR]
. Otherwise said, we create nfa_transition
, an annotated transition function of the token-based NFA from to where is the set of states of the NFA, is the set of tokens and is the set of terminals:
...
nfa_transition[q4][1270] = [(q4, ("DEF",)), (q6, ("NAME",))] # 1270 is the token id for `def`
...
nfa_transition[q4][470] = [(q10, ("LPAR", "RPAR"))] # 470 is the token id for `()`
...
# For the Python grammar, `nfa_transitions` is defined with 2,837,801 rows.
Checking that the generated sequences of terminals comply with the grammar
At each step of the decoding process, nfa_transitions
shows which tokens can eventually lead to a sequence of terminals. Of course, not all sequences of terminals are syntatically valid. We then need to use an incremental parser to only keep sequences of terminals that are compatible with the CFG. In the context of this blog post, we can see such an incremental parser as an object incremental_parser
with an internal state incremental_parser.state
and two methods:
incremental_parser.accepts(new_terminals) -> bool
that takes a sequence of terminals as an input and returnsTrue
orFalse
depending on whether the terminals consumed so far andnew_terminals
form the prefix of a valid sequence of terminals;incremental_parser.consumes(new_terminals) -> None
that updates the internal state withnew_terminals
.
The incremental parser is used to check whether the terminals that would be added with a certain new token would be acceptable. Instead of assessing this compliance token by token, we can group together tokens that lead to the same additional terminals, check only once this sequence of terminals and mark all these tokens as acceptable if the result is positive. Following (Ugare et al., 2024) and (Beurer-Kellner et al., 2024), we can build a mask_store
function that maps to :
...
mask_store[q4][("NAME",)] = [0, ... 1, ..., 1, ..., 0]
# ↑ ↑ ↑ ↑
# 0 1,270 18,641 |V|-1
...
mask_store[q4][("LPAR", "RPAR")] = [0, ... 1, ..., 0]
# ↑ ↑ ↑
# 0 470 |V|-1
...
# For the Python grammar, `mask_store` is defined with 112,283 rows.
The overall constrained decoding algorithm can then be expressed in pseudo Python code as follows.
def generate(llm, prompt, grammar):
token_ids = tokenize(prompt)
new_token_id = None
# Initialize the states of the lexer and the incremental parser
lexer_states = [(0, [])] # list of (nfa_state, terminals) tuples
incremental_parser = IncrementalParser(grammar)
# Generate one token for each iteration of the while loop
while new_token_id != eos_token_id:
# Initialize an empty mask
mask = [0]*len(tokenizer_vocabulary)
# Filter the potential next tokens with the incremental parser
for nfa_state, terminals in lexer_states:
for new_terminals in mask_store[nfa_state]:
if incremental_parser.accepts(terminals + new_terminals):
mask = element_wise_binary_or(mask, mask[nfa_state][new_terminals])
# Sample a new token with the LLM and the mask
# Cf. algorithm 2 of arxiv.org/abs/2307.09702
new_token_id = llm.sample(token_ids, mask)
token_ids.append(new_token_id)
# Update the states of the lexer and the incremental parser
new_states = [
(new_nfa_state, terminals + new_terminals)
for new_nfa_state, new_terminals in nfa_transition[nfa_state][token_id]
]
length_common_prefix = get_length_common_prefix(
[terminals for new_nfa_state, terminals in new_states[new_token_id]]
)
incremental_parser.consumes(length_common_prefix)
lexer_states = [
(new_nfa_state, terminals[len(length_common_prefix):])
for new_nfa_state, terminals in new_states[new_token_id]
]
return token_ids
Patterns of Interest to Accelerate Decoding
The reason why constrained decoding is much slower for CFG constraints than for regex constraints is that, at each decoding step, the incremental parser may be called once for each mask of the current NFA states, as shown in the double for
loop above. In the following, we will see how to significantly reduce the number of calls to the incremental parser.
Consider state from Figure 1. The mask store contains 1,036 sequences of terminals for this state, which corresponds to the DEF
terminal. However, DEF
appears only once in the EBNF formalization of the Python grammar – funcdef: "def" name "(" [parameters] ")" ["->" test] ":" suite
– and name
is defined as name: NAME | "match" | "case"
. Consequently, only the NAME
, "match"
and "case"
terminals can follow DEF
. As a result, DEF
followed by NAME
(or "match"
or "case"
) is always accepted whenever DEF
is accepted while DEF
followed by ASYNC
, CLASS
or any other terminal is never accepted, whatever the current parser state is. This means that we can streamline the mask store to avoid superfluous calls to the incremental parser.
...
mask_store[q4][()] = m1
mask_store[q4][("NAME",)] = m2
mask_store[q4][("MATCH",)] = m3
mask_store[q4][("CASE",)] = m4
mask_store[q4][("ASYNC",)] = m5
mask_store[q4][("CLASS",)] = m6
...
... should be simplified as:
...
mask_store[q4][()] = element_wise_binary_or(m1, m2, m3, m4)
...
Let us now consider another interesting pattern of the Python grammar. The MINUS
and PLUS
terminals only appear in two rules:
!_unary_op: "+"|"-"|"~"
!_add_op: "+"|"-"
Given that MINUS
and PLUS
are interchangeable in these rules, replacing one with the other does not affect the syntatic correctness of a string. This is another opportunity to streamline the mask store by merging the entries with PLUS
and those with MINUS
. For example:
...
mask_store[q6][("MINUS",)] = m1
mask_store[q6][("MINUS", "MINUS")] = m2
mask_store[q6][("PLUS",)] = m3
mask_store[q6][("PLUS", "PLUS")] = m4
...
... should be simplified as:
...
mask_store[q6][("MINUS",)] = element_wise_binary_or(m1, m3)
mask_store[q6][("MINUS", "MINUS")] = element_wise_binary_or(m2, m4)
...
We could systematically identify similar patterns if we had access to the following functions:
... where , and denote the context-free language corresponding to the grammar, the set of prefixes of and the set of terminals of .
The mask store can be streamlined on the basis of these three functions:
- If , we can replace
mask_store(q, ())
withtermwise_binary_or(mask_store(q, ()), mask_store(q, S))
and removemask_store(q, S)
for all NFA stateq
corresponding to the terminal; - If , we can remove
mask_store(q, S)
for all NFA stateq
corresponding to the terminal; - If , we can replace
mask_store(q, S1)
withtermwise_binary_or(mask_store(q, S1), mask_store(q, S2))
and removemask_store(q, S2)
for all NFA stateq
corresponding to the terminal.
We now examine whether it is actually possible to compute , and .
Always Illegal Continuations
The definition of entails that:
Given that is a regular language over , computing is equivalent to determining the intersection of a context-free language and a regular language and testing whether this new context-free language is empty. As there are standard algorithms for these two operations, this suggests a straightforward way to obtain :
def is_never_legal(current_terminal, new_terminals, grammar):
"""
Return True if `new_terminals` can never follow `current_terminal` given the grammar
"""
# We create a regular expression defined over the set of terminals, that recognizes
# any sequence of terminals including `current_terminal` immediately followed by
# `new_terminals`.
regex = f".*{current_terminal}{''.join(new_terminals)}.*"
# The intersection of a context-free grammar and a regular language is a
# context-free grammar which can be efficiently computed with the Bar-Hillel
# construction. We can then test its emptiness with a standard CFG algorithm.
return is_empty(intersection(grammar, regex))
Always Legal Continuations
In contrast, is not computable as proven in the technical appendix of this blog post. A general algorithm is then out of reach but I outline below , a method to obtain some values of this function. For the sake of brevity, I restrict myself to a very simple example and focus on conveying the main intuitions, without formalizing the method or proving its correctness. I invite you to read the technical appendix if you are interested in such details.
Let us take the example of the context free grammar corresponding to the rule. The associated context free language is obviously . If we want to determine some values of , it is useful to understand which symbols can follow another symbol. We can describe these relationships with a directed graph, as illustrated in Figure 2.
However, such a directed graph is generally not rich enough to draw conclusions about the values of . To go further, we extend this representation to not only track the successor relationships among symbols but also the rules from which these relationships arise. More precisely, we use the pushdown automaton shown in Figure 3, with and as the initial state and stack symbol and as the only accepting state. In this pushdown automaton, the states represent terminal or nonterminal symbols that are being generated while the stack symbols represent specific locations in the rules of G and serve as “return addresses” to follow once a symbol has been generated.
This pushdown automaton has interesting properties for our purpose:
- The words accepted by the pushdown automaton are exactly the words generated by the grammar;
- A valid path for an accepted word corresponds to the depth-first traversal of the syntax tree of this word (cf. Figure 4);
- A sequence of valid steps can always be completed so that the resulting path corresponds to an accepted word.
If we want to determine the value of with , the first step is to start at the state with an empty stack and list all the paths reading through the pushdown automaton. For this, we follow the step relation of the pushdown automaton, except that, if the stack is empty and we need to pop a stack symbol, we still proceed with the transition and take note of the missing stack symbol. It is as if we had an unlimited line of credit to borrow stack symbols when the stack is empty. We also define a maximum stack size during the search so that it is guaranteed to terminate.
The second step is to represent all these paths as a non-deterministic finite automaton (NFA) as shown in Figure 5: the nodes correspond to the combination of a state of the pushdown automaton and the associated stack, the start node is , the accepting nodes are with any stack and the transitions are either or a missing stack symbol identified along a path.
The third and last step is to identify the -coaccessible nodes of the NFA, i.e. the nodes that satisfy one of the following properties:
- the node is a final node but not an initial node;
- there is an transition from this node to an -coaccessible node;
- for each stack symbol that can be popped from the pushdown automaton state of the node, there is a transition with this stack symbol from this node to an -coaccessible node.
A sufficient condition for is then that is -coaccessible. Figure 5 shows the NFAs obtained for , , , . Since is the only stack symbol that can be popped from in the pushdown automaton, we can conclude that . This is not the case for and and it is indeed easy to show that .
The intuition behind this approach is that the NFA captures the stack symbols needed to move from to . Along such a trajectory, the conditions defined above ensure that we can always move one step closer to , either because we can move forward with an transition or because there is a relevant transition for each possible stack symbol on top of the stack.
Jointly Legal Continuations
Like , is not computable because the values of can directly be derived from those of . Indeed, for all :
There are however simple ways to get some values of . First we can use known values of , e.g. , to conclude that:
For example, using the method described in the previous section, we can show that for the Python grammar. As a consequence:
Furthermore, we can take advantage of the fact that some terminals are interchangeable in the rules of the grammar. For example, the only rules of the Python grammar with the and terminals are:
!_unary_op: "+"|"-"|"~"
!_add_op: "+"|"-"
This means that it is always possible to replace one terminal with a terminal (and vice versa), without affecting the syntactical correctness of a sequence of terminals. If we can determine that terminals and are interchangeable, we can conclude that:
... where and if .
With the Python grammar, there are seven sets of mutually interchangeable terminals, containing 35 terminals in total:
1: "/=", "%=", "^=", "**=", "*=", ">>=", "|=", "-=", "+=", "//=", "<<=", "@=", "&="
2: FALSE, NONE, TRUE
3: PLUS, MINUS
4: "<<", ">>"
5: "//", PERCENT
6: "!=", LESSTHAN, ">=", "==", "<=", MORETHAN, "<>"
7: HEX_NUMBER, FLOAT_NUMBER, BIN_NUMBER, OCT_NUMBER, DEC_NUMBER, IMAG_NUMBER
Experimental Results
The accompanying Python notebook evaluates the effectiveness of mask store streamlining. Since some of the operations are time-computing, in particular the computation of (because it requires taking into account the whole grammar whereas using only leverages local relationships in the pushdown automaton), I proceeded in the following way:
- Step 1: identify sets of mutually interchangeable terminals to merge the mask store entries corresponding to these terminals;
- Step 2: use the directed graph mentionned above (cf. Figure 2) to identify which terminal can possibly follow another terminal and remove the mask store entries that do not comply with these constraints;
- Step 3: identify terminals so that or and merge or remove mask store entries on this basis;
- Step 4: compute the values of (whenever possible) and for the remaining entries of the mask store and remove and merge these entries on this basis.
Figure 6 shows that these steps combined result in a ten-fold reduction of the size of the mask store.
Furthermore, I performed three experiments to confirm that streamlining the mask store yields valid results:
- Experiment 1: I tokenized a series of Python files from four GitHub repositories (more than one million tokens in total) and for each token, I computed the masks thanks to the streamlined mask store and I checked that the next token was indeed allowed by the mask;
- Experiment 2: I checked that 400 strings (more than one million characters in total) generated using the streamlined mask store were syntatically correct Python code;
- Experiment 3: For more than 100,000 tokens extracted from the same Python files as before, I computed the masks using both the original mask store and the streamlined mask store and I checked that the resulting masks were systematically identical.
These three experiments were successful, which suggests, as expected, that the masks computed are adequate (i.e. neither excessively restrictive nor excessively permissive) and that streamlining the mask store does not alter the masks computed at decoding time.
Conclusion and Potential Future Work
This post introduces a novel method for streamlining mask stores in CFG-constrained decoding techniques that leverage an automaton-based lexer and an incremental parser. This method significantly reduces the inference overhead at the cost of modest additional computations performed once for a given grammar and a given tokenizer.
Potential areas for improvement of this method are as follows:
- We can modify the definitions of , and so that their first argument is a sequence of terminals rather than a single terminal. This adjustment could increase the frequency of these functions returning , creating more opportunities to streamline the mask store. This would also likely increase the computation time of the pre-processing steps but this could be controlled by limiting the size of the input sequence of terminals. In the case of the Python grammar, just switching the first argument from a terminal to a pair of terminals would probably be very helpful. For example, a
NAME
terminal can be used in various contexts but if we know it is preceded by aDEF
terminal or anIMPORT
terminal, it considerably narrows down the options for the next terminal; - The method to obtain amounts to checking whether it is possible to go from state to state through the pushdown automaton, whatever the initial stack is. We could use a similar approach to tentatively compute : we would need to determine whether the NFAs created from paths going from to or from to are equivalent;
- We can attempt to enforce both CFG constraints and proper tokenization, as in the previous blog post for regex constraints. This could yield the same benefits as with regex constraints: faster decoding and fewer risks of distortions of the language model distribution;
- In practical cases, we can consider altering the definition of the input grammar to enforce semantic or stylistic constraints on top the syntactic ones. This could both accelerate decoding and result in a code more suited to our needs. For example, with the Python grammar:
- we can replace the
NAME
terminal by several terminals (FUNCTION_NAME
,CLASS_NAME
,VARIABLE_NAME
...) so that it is possible to apply naming conventions (e.g. lowercase for function names, CamelCase for class names...). Additionally, the lesser ambiguity would lead to a simpler mask store; - we can specify an allowlist of modules that can be imported, which is interesting to control the dependencies of the code. This can also increase the likelihood that only a one token would be allowed, in which case there is no need to call the LLM;
- we can impose a docstring for each function (or class or method). This would help developers better understand the code generated but this could also accelerate decoding: if we also exclude comments at the end of the first line of a function, we know that a closing parenthesis should be followed by a colon, a linebreak an indentation and a long string. We can then add the corresponding tokens without calling the LLM;
- we can replace the
- We can try to dynamically adjust the constraints in light of the code already generated. For example, after
from collections import
, we may allow only the classes contained in thecollections
module or afterobject.
, we may allow only the name of an attribute or method ofobject
. This is a significant departure from the situation discussed in this blog post as it becomes crucial to control the pre-processing time. - In the context of this blog post, I computed the number of entries of the mask store, before and after its streamlining, and checked that the masks computed were correct but I did not measure the actual inference overhead. A follow-up work would be to rigorously compare the inference overhead with those of existing packages.
Thanks to the authors of the papers mentioned below and to the contributors of the various packages used in this project (inter alia, , , and . I am also particularly grateful to Martin Berglund who right away suggested a potential undecidability problem when I let him know of my investigations.
If you want to cite this blog post, you are welcome to use the following BibTeX entry:
@misc{Tran-Thien_2025,
title={Accelerating LLM Code Generation Through Mask Store Streamlining},
url={https://vivien000.github.io/blog/journal/grammar-llm-decoding.html},
journal={Unsupervised Thoughts (blog)},
author={Tran-Thien, Vivien},
year={2025}
}
References
- Willard, B. T., & Louf, R. (2023). Efficient Guided Generation for Large Language Models.
- Gerganov, G., & et. al. (2024). llama.cpp: Inference of Meta’s LLaMA model (and others) in pure C/C++. https://github.com/ggerganov/llama.cpp
- Lundberg, S., & Ribeiro, M. T. C. et al. (2023). Guidanceai/guidance: A guidance language for controlling large language models. https://github.com/guidance-ai/guidance
- Geng, S., Josifoski, M., Peyrard, M., & West, R. (2024). Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning. https://arxiv.org/abs/2305.13971
- Beurer-Kellner, L., Fischer, M., & Vechev, M. (2024). Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation. https://arxiv.org/abs/2403.06988
- Ugare, S., Suresh, T., Kang, H., Misailovic, S., & Singh, G. (2024). SynCode: LLM Generation with Grammar Augmentation. https://arxiv.org/abs/2403.01632
- Dong, Y., Ruan, C. F., Cai, Y., Lai, R., Xu, Z., Zhao, Y., & Chen, T. (2024). XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. https://arxiv.org/abs/2411.15100
This blog post was initially published on my personal blog.