If you’re like me, and you like to cause yourself pain, you may be writing a programming language and want to support python-like fstrings. Well, f-strings are quite a funny little thing to parse, because a typical lexer doesn’t have the theoretical power (as they only support regular languages) to parse pythons handy dandy fstrings feature. The long and short of it is you can have a string like f"I am a string that want to have a value like {({5})} in it!
This string must “count” opening braces to match them with closing braces, and as we’ve learned in our favorite theory or compilers computer science class, you need to go to context free grammars to support this.
Long story short, the best strategy to handle this IMHO is to literally nest the fstring parsing work with a special separate lexer/parser implementation and simply capture the full f"..."
in the parent parser as a single token. Then when constructing the AST, you can invoke the fstring parser to extract the expressions from the fstring terminal token and replace the token with a fleshed out strings plus expressions subtree. Anyway, in the spirit of not inventing the wheel I did my typical Googling and ChatGPTing to get a quick implementation of an fstring parser. I was expecting to spend 5 mins on this, but, to my surprise, it’s unexpectedly hard to find solid examples on the internet for creating a lexer and parser to handle Python’s F-string representation.
Perhaps it shouldn’t be that surprising since f-strings were introduced relatively recently in Python 3.6, and represent a new style of formatting strings in Python.
Anyway, although F-strings are loved for their simplicity and efficiency, and despite their popularity in the Python community, full code examples illustrating how to parse them are non-existant as far as I can tell, save for the python code base of course, and yes I did read some of that code too. Given this gap, I decided to roll up my sleeves and just write one using sly
. Sly is a library that provides classes for constructing lexers and parsers (its basically lex and yacc for python). I figure I’d share it as a blog so it may save someone in the future some time. In the following sections, I’ll walk you through the main components of the basic solution.
Lexer
Our lexer’s job is to break the input into tokens, which are the smallest unit of meaning that the parser can understand. The lexer I’ve written is named FStringLexer
. The code is as follows:
class FStringLexer(Lexer): """Python Like F-String Lexer.""" tokens = { "STRING_START", "STRING_END", "EXPR_START", "EXPR_END", "PIECE", } ignore = " \t" # Tokens STRING_START = r"f\"" STRING_END = r"\"" EXPR_START = r"(?<!\{)\{(?!\{)" EXPR_END = r"(?<!\})\}(?!\})" PIECE = r"[^\{\}\"']+|\{\{|\}\}"
The Lexer class uses regular expressions to define its tokens. We’ve defined five tokens: STRING_START
, STRING_END
, EXPR_START
, EXPR_END
, and PIECE
. We’ve also used the ignore
property to ignore whitespace characters in the beginning and end of the string.
The regular expressions define what constitutes each token. The EXPR_START
and EXPR_END
tokens have lookahead and lookbehind assertions to prevent matching double braces ({{
or }}
), which are escaped versions of braces in Pythons F-strings.
Parser
Next up is the parser. The parser uses the tokens provided by the lexer to create a meaningful representation of the fstring. The parser that I’ve written is called FStringParser
. Its code is as follows:
class FStringParser(Parser): """Python Like F-String Parser.""" tokens = FStringLexer.tokens def __init__(self: "FStringParser") -> None: """Initialize the parser.""" self.names = {} @_("STRING_START parts STRING_END") def fstring(self: "FStringParser", p: YaccProduction) -> YaccProduction: """Start rule for fstring.""" return p @_( "parts expr", "parts PIECE", "PIECE", "expr", ) def parts(self: "FStringParser", p: YaccProduction) -> YaccProduction: """Parts of the string both in string and in expression.""" return p @_("EXPR_START parts EXPR_END") def expr(self: "FStringParser", p: YaccProduction) -> YaccProduction: """Expressions rule.""" return p
In this class, we first define the set of tokens we’re going to use in the parser, which are the same tokens that our lexer uses. The parser’s rules are then defined in its methods, which handle matching sequences of tokens.
We have three rules: fstring
, parts
, and expr
. The fstring
rule matches the start of the F-string, its parts, and its end. The parts
rule is used for handling parts of the string both in string and in expression, and the expr
rule is used for handling expressions within the F-string.
This lexer-parser duo, when combined, are capable of parsing Python’s F-string representation (though not fully PEP complete, but good enough for now).
Full code that you can run and play with below:
"""Python Like F-String Parser.""" # from jaclang.utils.sly.lex import Lexer # from jaclang.utils.sly.yacc import Parser, YaccProduction from sly.lex import Lexer from sly.yacc import Parser, YaccProduction _ = None # For flake8 linting and sly compatibility class FStringLexer(Lexer): """Python Like F-String Lexer.""" tokens = { "STRING_START", "STRING_END", "EXPR_START", "EXPR_END", "PIECE", } ignore = " \t" # Tokens STRING_START = r"f\"" STRING_END = r"\"" EXPR_START = r"(?<!\{)\{(?!\{)" EXPR_END = r"(?<!\})\}(?!\})" PIECE = r"[^\{\}\"']+|\{\{|\}\}" class FStringParser(Parser): """Python Like F-String Parser.""" tokens = FStringLexer.tokens def __init__(self: "FStringParser") -> None: """Initialize the parser.""" self.names = {} @_("STRING_START parts STRING_END") def fstring(self: "FStringParser", p: YaccProduction) -> YaccProduction: """Start rule for fstring.""" return p @_( "parts expr", "parts PIECE", "PIECE", ) def parts(self: "FStringParser", p: YaccProduction) -> YaccProduction: """Parts of the string both in string and in expression.""" return p @_("EXPR_START parts EXPR_END") def expr(self: "FStringParser", p: YaccProduction) -> YaccProduction: """Expressions rule.""" return p if __name__ == "__main__": """Run the parser for live testing.""" lexer = FStringLexer() parser = FStringParser() while True: try: text = input("fstring > ") except EOFError: break if text: tokens = lexer.tokenize(text) result = parser.parse(tokens) print(result)