8 Ways to Improve Your Regex Performance Based on Gathered Data From Real Pull Requests
Long before the advent of natural language processing (NLP) with machine learning (ML), regular expressions dominated the era of natural language processing. Even today, we see examples of regular expressions in almost every piece of production code.
I’ve written some huge regex over the years in various search relevance, and in retrospect many things weren’t essential, but that’s learning.
Let’s Set the Context
- Your code will run millions of times.
- This does not cover writing regular expressions or their various semantics. In the near future, I will write something on these topics, covering the semantics and internals of the regular expression.
- Consider the idea which is language agnostic rather than the shared code.
Regex Love for Backtracking
It’s a good idea to keep your regex from going into excessive backtracking
Now, why is doing Regex correctly important? Because of excessive CPU time, a poorly written regex could become a significant bottleneck for the product in terms of latencies and processor costs.. More on this later.
We’ll look at a few techniques for avoiding excessive regex backtracking and limiting the number of regex runs.
Never Write Uncompiled Regexes in Hot Path
The construction of a regex takes O(2^(length of regex)).
Before performing a Match operation on an uncompiled regex, it must first be compiled. Having an uncompiled regex in the hot path adds construction time to each run.
Prefer this
over this
Runtime of uncompiled and compiled regex Match()
Why Is It Critical to Keep the Number of Regex States to a Minimum?
The number of comparisons required by the regular expression is roughly proportional to the length of the string multiplied by the number of possible intermediate matches.
As previously stated, regex construction takes O(2^(length of regex)), it is advised to reduce the total number of regex states in order to avoid excessive backtracking.
Writing specific regexes rather than generic regexes is one way to accomplish this.
Fail Fast
Being specific aids in failing quickly. During partial or near perfect tokens, a regex takes longer to fail. Writing a long regex that is elaborated is more useful than writing a short regex that does a lot of backtracking to detect partial matches.
Set a Time Limit for the Regex Execution
If you don’t want your regex to run forever and consume all of your CPU resources, set a time limit for regex execution.
When run against moderately sized strings, some poorly written regex can take hours to complete, and it is even possible to write a regular expression (Infinite recursion) that will not finish within the lifetime of the universe.
If the regex cannot find an answer within this time limit, it will abort the execution and will not run indefinitely.
If you’re familiar with the drawbacks of backtracking algorithms, you’ll understand that Regex works on the principle of backtracking, so we must ensure that it never goes into perpetual backtracking mode.
Use Domain Knowledge to Keep Regex Runs to a Minimum
It is important to note that while regex does not take long to find the right match, it may take a long time to detect failure in partial or near perfect match cases.
Apply Domain Knowledge when creating regex to reduce the number of regex runs by not running the regex for obvious cases.
Let’s take an example
If you are looking for a specific URL pattern and need to extract various segments from the URL
Here is the specification for a required URL.
- It ends with a number followed by.html and
- The URL segment includes term “feedback”.
- It must be from a specific domain, such as linkedin.com, microsoft.com, github.com, youtube.com, or anyother.com.
e.g. https://linkedin.com\users\dsats\feebacks\july\2022\ft=123.html
Some simple way to write is
However, a better approach would be to check for all obvious clauses before running the regex.
- Do not run the regex if
- Url is not well formed as per RFC specification
- If the URL does not end with html or if there are no digits after the last index of “=”.
- If the URL scheme is not https. To protect against any other schemes we don’t care about, such as ftp, ssh or other schemes.
- If the Url host domain do not belong to any of the required domains.
Adding these basic checks could save a significant amount of CPU utilization, and if the regex is only expected to return a few matching cases and a large number of failure cases, the CPU savings could be much greater.
Keep in mind that the code executes over millions of URLs, so even a 1% optimization would add up to a significant savings in computation costs.
If you can apply domain knowledge to the problem space, you can improve performance without sacrificing correctness of regex.
Use the Repetition Operator With Care
The incorrect use of repetition operators such as + and * is a major cause of regex going into excessive backtracking mode. During partial match or near perfect cases, this becomes a massive bottleneck and fails much later. In the event that any partial match results in a perfect match, Regex will attempt all possible partial match branches.
Using repetition operators causes the Regex engine to enter a “try all possible combination problem,” which already has an exponential runtime.
Understand the Purpose of Your Regex
If the goal is simply to validate whether or not the tokens match a regex, then using methods that also capture the groups will slow down the process.
Use regexInstance.IsMatch()
to determines whether the token matches the regex pattern and use regexInstance.Match(token)
to return the groups as well. Using group capturing Match() would consume resources that would not be used.
Understand Your Language’s Regex Library and the Possibilities It Opens up for Modifying Its Runtime.
When it comes to modifying the behavior of how regex runs, not every language provides the same level of exposure however each to speed up regexes, each language has its own optimization.
For example, in C#
It is preferable to reuse a static compiled instance of regex for runtime/match rather than calling static methods directly using the Regex class.
Prefer this:
Over this:
Nothing Beats a Poorly Written Regex Pattern
All of these practices do not remove the option of writing bad regex patterns. A time limit does help to reduce the amount of resources that a poorly written regex can consume, but you don’t want to write regex that constantly approaches the limit.
- Understand the purpose of the Regex. A lot of unnecessary regex could be avoided with simple string operations.
- Do not try to imitate Minimalist’s design.
Small regexes with a lot of repetition operators should be avoided. Write elaborated regexes that are very specific to the use case, leveraging all domain knowledge to keep the regex running for only the required scenarios.
Even after being static and compiled, this regex consumes 30% of the CPU during heavy load.
Consider the following scenario: we are attempting to capture a human RNA sequence using regex.
pattern = G.*C.*A.*T
If we have to run this on 20 million sequences of 500 characters each, the excessive backtracking may cause the system to crash. More on the “Why” later.
ccccggggacaccaggaccaaaaccagccccccaggagagcccgcgcaggggacgcacaaaaaaggggaccggccccgaagacaacaagcaccggacaacaagcgacacacaaccagcaccgcacaacacacgcccccgacccgcgagaacaacgacaaccggcgcaccggccgcaaaaacaagaggagagaacagagcgacaacgacgaccgagaacgccgagggacaacgcaacgaaacaccgaccccgcgaaaccgcaccgaagccgagaaggagggcacaggagcacgaacggagaggccgaccccgggccaagacaccaagcggaaggacgccagcggccacgacagcgaagaccgacaagaaaaagcgggaacccacggccggaaccggacacagcggcacgaaaaaagacggagcgcggcgcccaacgagcaggagcgcgcgcgcaggagcgaaggaacaagccaagcaacaagacgggaacgccagccaacacaggcgagcggcgcaaaaaagacgcacaccccgggaaaggacaagggagcaacggaacccaccgccgcaagagcgcgcccaggaacgagcgacgcggggcgggaagacccaccccgacggcagaggacccgagaacgcacagaggcgcccggagccgccacgcaacgcagcagacgcacacggcccggggcagcgaggcaggagcgcagcaagacagaggggacgggagcagcgaggggacggacagaggcacacgaaccaaacacaccggcaacggacaaccgacgccgggaaaacccgcgggggaccacccggcaaggagacagcaccggcacagaccgcgccaaacccacagaagacccccccgggaccacaccaggcgagggcaaccacgacaccaggggcaaagccaaggcggaccgacagccggcacggggggcagccagcaaaaacacagcgaacaagccggggcggccccaagcagcaccggagaacggagg
If you look closely, you will notice that none of these patterns match, but they are all near perfect matches that will fail much later only after trying all of the potential partial match branches.
If you notice, before running the regex, we could use “Use Domain Knowledge to keep regex runs to a minimum” to cut down on excessive backtracking in partial matches.
When Should Your Regex Be Modified?
Remember that different regex engines work in different ways and use different optimizations and algorithms. I urge you to benchmark your regular expressions with a set of your expected input to avoid needlessly obscuring your regexes with performance enhancements that make no real difference.
At last, the best strategy is to avoid regex till there exists no other options.