Hey! I recorded two video courses!
If you like this article, you will also like the courses! Check them out here!
Hey folks, before we begin, a public service announcement:
Respect people who wear glasses.
They paid money to see you.
Okay, that was enough fun for today. Let’s dive into today’s topic: How can you search through millions of usernames using Ecto and Postgres? Let’s find out!
Quick note: This article does not cover full-text search (FTS) with Postgres. It focuses only on searching for short strings like names, movie titles, article names, etc. I will cover FTS in another article in the future.
Imagine that you have a database with a million users. You need to allow admins to search through the users by name. You need to get this out the door quickly, so instead of building a cool, custom search engine, you want to use what’s already at hand. Luckily, Postgres got your back.
In short, there are two good candidates of query methods you can use: (i)like
and similarity
. (i)like
is better if you want to have exact matches with your query. For example, the search term Ber
would return Bert
or Berry
, but not Bart
. similarity
is a more fuzzy
approach and would also return Bart
, since it’s not the same, but very similar to Ber
.
We will test our name search against an example set of 18.000 first names. The non-trivial size of the set allows us to test the speedup of using indexes. Let’s have a look at how to implement each of them.
🔗 (i)LIKE
Probably the most common search method is the LIKE operator. It checks whether a string contains a search term. So, Bert
matches Ber
, but Bart
does not. You can specify where in the string the search term may exist using the wildcard %
sign. For example, a search term of Ber%
means that a string must begin with Ber
whereas %Ber
means the string must end with Ber
. If your search term may exist anywhere in the string, you can use two %
signs like this: %Ber%
. This would match Bert
, but also DagoBert
.
In our queries, we usually don’t use LIKE
, but its sibling ILIKE
, which ignores cases in our strings and search terms. So, whereas Ber% LIKE 'bert'
doesn’t match, Ber% ILIKE 'bert'
does. Now, let’s have a look at how to implement an efficient ILIKE
search in Ecto.
🔗 The Migration
First, we build a migration that creates a simple persons
table with only one field: a name
. We also create an index for that field, which we’ll discuss later. This is how it looks:
defmodule Demo.Repo.Migrations.CreatePersons do
use Ecto.Migration
def up do
create table(:persons) do
add :name, :string
end
execute "CREATE EXTENSION IF NOT EXISTS pg_trgm;"
execute """
CREATE INDEX persons_name_gin_trgm_idx
ON persons
USING gin (name gin_trgm_ops);
"""
end
def down do
drop table(:persons)
end
end
First off, since we use the execute/1
function with custom SQL code, we need to split the migration into up/0
and down/0
since otherwise, Ecto doesn’t know how to roll back our migration.
The two execute/1
statements are the reason our ILIKE
search is efficient. The first statement activates the pg_trgm extension, which allows us to search through a GIN
index of our names instead of having to scan through every row 1-by-1. The second execute/1
statement creates the GIN
index for the name
column using the gin_trgm_ops
.
Watch out: Don’t create the index with the Ecto variant of it, which is:
# Don't use this
create(index(:persons, [:name], using: "GIN"))
# Because it translates to this:
CREATE INDEX persons_name_index ON persons USING GIN (name varchar_ops);
The Ecto variant will create a GIN
index, but by using the varchar_ops
and not the gin_trgm_ops
. This won’t speed up our text search. So, better use the custom SQL code in the second execute/1
statement to create the GIN
index.
Let’s have a look at how to use the index next.
🔗 The Query
Thankfully, Ecto already offers the ilike function, which we can use for our query. This is what it looks like:
defp search(search_term) do
search_term = "%#{search_term}%"
query =
from(
p in Person,
where: ilike(p.name, ^search_term)
)
Repo.all(query)
end
This will return all names that contain the search term anywhere in their string. If you analyze this query in a database explorer, you will see that it uses our GIN index. We ran this query without the index and it was around 30 times slower (~700ms vs 21s). So, before you push this query to production, please check in your database explorer whether your index is used. You can do that by running the following command:
EXPLAIN ANALYZE SELECT name FROM persons WHERE name ILIKE '%bert%';
It should return an analysis that starts with Bitmap Heap Scan
. That means that your index is used. If it returns an analysis beginning with Seq Scan
, it doesn’t use your index and scans every row 1-by-1, which you don’t want.
Now, the ILIKE
query returns all names that match a certain search term, but it won’t return names that don’t match the search term but are very similar to it. For example, if you search for Ana
you might also want to see Anna
and Hannah
in your results. So, names that don’t match exactly, but are very similar. We can use Postgres’ SIMILARITY operator for that. Let’s check out how.
🔗 Similarity
similarity
is one of these functions that appear to be very easy to use, but it’s very easy to get it wrong. So, before you deploy your query to production, please analyze the query using EXPLAIN ANALYZE
first. We will get into the details of what can go wrong later on. Let’s first have a look at how to use this operator.
You can re-use the migration for ILIKE
also for similiarity
. It uses the same GIN
index. Run your your migration and then head over to your context to add the following query:
def similarity(name) do
query =
from(
p in Person,
where: fragment("? % ?", ^name, p.name),
order_by: {:desc, fragment("? % ?", ^name, p.name)}
)
Repo.all(query)
end
This query uses the short-hand version of similarity
, the %
operator, to compare the search term with the names in the database. It orders the results by their similarity to the search term so that the most similar results are on top. There are a few small, but important details about this query:
-
You must use the short-hand
%
operator because the query won’t use theGIN
index otherwise. Don’t usesimilarity(name, ?)
in your query, but the short-hand%
operator. -
If you use the
<%
or<<%
operators, you must put the search term first. So,WHERE bert <% name
. If you turn them around, Postgres won’t use your GIN index. -
When you use the short-hand
%
operator, you can’t set the threshold for how similar names must be to be included in the results. If you use thesimilarity
function, you can set it with:similarity(name, ?) > 0.3
. If you use the short-hand operator%
, the globalpg_trgm.similarity_threshold (defaults to 0.3)
is used instead. You can configure the global threshold for one session or your entire database using:
SET pg_trgm.similarity_threshold = 0.3;
You can configure the similarity threshold of your Ecto.Repo
connection using a Connection prefix (Thanks to Michał for the tip ❤️). You can set it in your configuration like this:
# config.exs
query_args = ["SET pg_trgm.similarity_threshold = 0.4", []]
config :my_app, MyApp.Repo,
after_connect: {Postgrex, :query!, query_args}
This configuration will set your custom similarity_threshold
after connecting to the database.
If the global threshold of 0.3
is a problem for you and you can sacrifice the GIN index, the following variation will help. Be warned that it doesn’t use the GIN index and scans your entire table. Sadly, there’s no free lunch.
def similarity(name, limit \\ 25) do
query =
from(
p in Person,
order_by: {:desc, fragment("? % ?", ^name, p.name)},
limit: ^limit
)
Repo.all(query)
end
This query has no hard threshold for similarity, but orders the names by their similarity and only returns the top - in this case - 25 names. It isn’t perfect, but it works without a hard threshold.
🔗 Variations of similarity
If you implement a name search, it is important to know the two siblings of the similarity
operator, the word_similarity
and strict_word_similarity
.
One problem with the similarity
operator is that it compares a search term against the entire string of a name, ignoring any spaces and therefore, word boundaries. For example, let’s say you want to see the profile of a user with a typical Portuguese name, like João Miguel Mello Fernandes Pereira dos Reis
. You probably don’t remember the full name of the person, but only their first name João
.
Now, if you search for João
using the similarity
operator, it will rank the user lower since João
shares very few letters with João Miguel Mello Fernandes Pereira dos Reis
. A user called João Sausa
will have a much higher ranking simply because his name is shorter. The problem is that similarity
depends on the length of a name. The Postgres team recognized this problem and created the word_similarity
and strict_word_similarity
alternatives.
Both alternatives put more importance on whether a search term is similar to any word in the name. Since the search term, João
, matches perfectly with the first word in the long name João ... Reis
, it will have a similarity of 1
. The same holds for the shorter name João Sausa
. So, the similarity doesn’t depend on the length of a name anymore.
The difference between word_similarity
and strict_word_similarity
is subtle but important. word_similarity
compares your search term with sub-parts of a word whereas strict_word_similiarity
only compares against whole words. So, use word_similarity
if you want to return names even when your search term only matches a part of the name, e.g. Bert
matches Dagobert
. If you want to prioritize full matches, use strict_word_similiarity
, e.g. only Dagobert
matches Dagobert
, but Bert
doesn’t.
In order to make use of your GIN index, you must also use the short-hand operators <%
for word_similarity
and <<%
for strict_word_similarity
when writing your query.
🔗 Combine ILIKE and Similarity
Now that you understand both methods, ILIKE
and similarity
, there’s one combination that you might like best:
def ilike_with_similarity(search_term) do
ilike_search_term = "%#{search_term}%"
query =
from(
p in Person,
where: ilike(p.name, ^ilike_search_term),
order_by: {:desc, fragment("? % ?", ^search_term, p.name)}
)
Repo.all(query)
end
This query uses ILIKE
to filter out any non-matching names and then orders the remaining names by their similarity to the search term. It uses the GIN index for all operations and returns the most relevant results on top. Keep in mind that it has the same limitation as ILIKE
which is that it doesn’t return non-matching, but very similar search results like Ana
for the search term Anna
. It is independent of the global similarity threshold since it only sorts the results of the ILIKE
query. If you use the ILIKE
method, you should seriously consider this combination for production use.
🔗 Intermediate Wrap-up
Now you know how to use the ILIKE
and similarity
operators efficiently to search for names and other short texts in your database. If that is all you want to know, feel free to stop reading at this point. Just make sure to EXPLAIN ANALYZE
your queries and check that they use the index before going to production.
The remainder of this blog post takes a deep dive into the nitty-gritty details of how Postgres calculates the similarity
score. During our research, we were unable to find a comprehensive explanation of the inner workings of the similarity
, word_similarity
, and strict_word_similarity
operators. So, here it is 😊
🔗 A deep-dive into Similarity
To understand the similarity
operator, we first need to understand what Tigrams
are. In short: A trigram is a group of three consecutive characters taken from a string.
So, the Trigrams of the word Bert
would be:
> SELECT show_trgm('Bert');
{" b"," be","ber","ert","rt "}
Trigrams break down each string into a Set
of character groups with 3 letters each. A character group may be padded with spaces if they are at the beginning or end of the string (like " b"
above). The similarity
operator uses these sets to compare two strings and returns how much they overlap on a scale from 0 to 1. The two sets are more similar if they have many Trigrams in common and less similar if they share fewer Trigrams.
🔗 How Postgres calculates similarity
Let’s break down the three names Bert, Bart, and Berry
into their Trigrams and calculate the similarity of Bert and Bart
and Bert and Berry
. Note that Trigrams ignore the case of letters. Everything is lowercase.
The following chart shows which Trigrams occur in both names (X
) and which don’t (-
). For example in Bert and Bart
, the first Trigram " b"
occurs in both names, whereas the second Trigram " ba"
doesn’t.
Bert: {" b"," be","ber","ert","rt "}
Bart: {" b"," ba","art","bar","rt "}
Match: X - - - X
Bert: {" b"," be","ber","ert","rt "}
Berry: {" b"," be","ber","err","rry","ry "}
Match: X X X - - -
Postgres calculates the similarity of these sets using the following calculation:
similarity = number of shared trigrams / number of distinct trigrams
In our comparison of Bert vs Bart
, both Bert
and Bart
have 5 Trigrams and 2 Trigrams in common. We can calculate their similarity with:
similarity = 2 / (5 + 5 - 2) = 0.25
In our comparison of Bert vs Berry
, Bert
has 5 Trigrams whereas Berry
has 6. They have 3 Trigrams in common, so their similarity is:
similarity = 3 / (5 + 6 - 3) = 0.375
Based on this calculation, Bert
is more similar to Berry
than it is to Bart
. Let’s quickly double-check these results in Postgres:
> SELECT similarity('Bert', 'Bart'), similarity('Bert', 'Berry');
similarity | similarity
---------- * ----------
0.25 | 0.375
Cool! Our calculations were correct! Now you know exactly how the similarity
operator works (Yey? 😄).
🔗 How Postgres calculates word similarities
The word_similarity
and strict_word_similarity
scores are calculated in a different way than similarity
. Postgres uses only parts of the names to calculate the similarity with the search term. So, if we search for João
and our user’s name is João Miguel Mello Fernandes Pereira dos Reis
, Postgres will only use the first name João
to calculate the similarity to the search term since it is the most similar word in the name.
Let’s have a look the similarity scores when calculating the word
and strict_word
similarity of Bert and Dagobert Duck
:
SELECT strict_word_similarity('Bert', 'Dagobert Duck'),
word_similarity('Bert', 'Dagobert Duck');
strict_word_similarity | word_similarity
---------------------- * ----------
0.28 (rounded) | 0.6
The similarities are quite different! The high-level explanation is that Postgres compares the Trigram sets of Bert
and the sub-part bert
of Dagobert
for word_similarity
whereas it compares the sets of Bert
and Dagobert
for strict_word_similarity
. Let’s have a look at the Trigram sets and how they overlap:
Bert: {" b"," be","ber","ert","rt "}
bert: {"ber","ert","rt "}
Match: - - X X X
Bert: {" b"," be","ber","ert","rt "}
Dabogert: {" d"," da","dag","ago","gob","obe","ber","ert","rt "}
Match: - - - - - - X X X
So, if we calculate the similarities of these two comparisons, we get the same result:
word_similarity = 3 / (5 + 3 - 3) = 0.6
strict_word_similarity = 3 / (5 + 9 - 3) ~ 0.28
These calculations match Postgres’s results exactly! I hope this little deep-dive clears up how Postgres calculates the different similarity values.
🔗 Conclusion
And that’s it! I hope you enjoyed this article! If you have questions or comments, let’s discuss them on Twitter. Follow me on Twitter or subscribe to my newsletter below if you want to get notified when I publish the next blog post. If you found this topic interesting, here are some references for further reading:
- PostgreSQL LIKE query performance variations
- Best index for similarity function
- Waiting for 9.1 – Faster LIKE/ILIKE
- Fast Search Using PostgreSQL Trigram Text Indexes
Until next time. Thanks for reading!