Hey! I recorded two video courses!
If you like this article, you will also like the courses! Check them out here!
Hey folks, welcome back to another blog post!
Did you know that:
When you go into the bathroom, you're American.
When you come out of the bathroom, you're American.
But what are you IN the bathroom?
European!
Anyways, today’s blog post touches on a topic that I’ve written about plenty already. It’s about efficient name search in Postgres, but with a twist!
🔗 The Problem
In previous articles, I wrote about how to do an efficient name search in Postgres using GIN
indexes. These work great, but once your user base becomes international, you have to deal with all the weird things other cultures add to their words. Like adding little ticks, arrows, dots, and waves above them, like José
, être
, Blömschen
, and São
.
The problem with these special symbols is that they break your text search. For example, if I want to search for José
in my application, I might only type Jose
(without the accent). But Postgres treats é
and e
differently and won’t return a user called José
. For Postgres, Jose
and José
are not the same.
This is bad UX and not what we want. We want to return José
even though the search term was Jose
, so what can we do?
🔗 The Solution
Luckily, Postgres offers a function called unaccent for exactly this situation. It allows Postgres to treat words with and without accent equally (e.g. é
and e
). Let’s see how we can use it in Ecto.
🔗 The Setup
This is our search function without unaccent
:
def search(name) do
query =
from(
p in Person,
where: ilike(p.name, ^search_term)
)
Repo.all(query)
end
This function won’t return a person with the name José
if we search for Jose
. Likewise, it won’t return a person with the name Jose
(no accent) if we search for José
(with an accent). Let’s fix the function.
🔗 The Migration
First, we need to add a new extension to Postgres. Add the following to a migration of your choosing:
def change do
execute(
"CREATE EXTENSION IF NOT EXISTS unaccent;",
"DROP EXTENSION IF EXISTS unaccent;"
)
end
If you run the migration, it will add the unaccent
extension to your Postgres instance. From now on, we can use the function unaccent
in our query.
🔗 The slow Solution
Now, let’s update our query function from above. We simply have to wrap the person’s name and the search term with unaccent
and Postgres will take care of the rest. This is how our search function looks now:
def search(name) do
query =
from(
p in Person,
where: fragment(
"unaccent(?) ILIKE unaccent(?)",
p.name,
^search_term
)
)
Repo.all(query)
end
From now on, our search/1
function will also return a person called José
when we search for Jose
, jose
, José
, or josé
. It will also return a person called Jose
(unaccented) if we search for Josè
(accented). Great success!
🔗 The Fast Solution
Sadly, there is one problem with our solution above: It’s slooooooooow 🐌.
If you have read my post about efficient name search, you know that we can speed up our query significantly by adding a GIN
index to it. Let’s add a GIN index but for the unaccented version of the names in our database.
First, we need to create an immutable version of Postgres’ unaccent
function. If you use a function in a Postgres index, it must be IMMUTABLE. That means that the function cannot alter the database and will always return the same value given the same arguments.
The unaccent
function is not IMMUTABLE
, but SOLID
. That means that it can’t change the database and will always return the same result given the same arguments, but its result might change across SQL statements. Imagine that I want to fetch the current time. That function would rely on the timezone
of my database. If I fetch the time once, change my database timezone, and fetch the time again, the results will change. The same goes for the unaccent
function because it depends on the dictionary that I use in my database. If I change my dictionary, the result of unaccent
might change, and therefore the function is only SOLID
and not IMMUTABLE
.
That means that we can’t use unaccent
directly when creating the index. However, we can create an IMMUTABLE
version of the function if we’re certain to never change the dictionary of our database (which you probably won’t anyways). Let’s do that now.
In our migration after your CREATE EXTENSION ...
call, add this:
execute(
"""
CREATE OR REPLACE FUNCTION f_unaccent(text)
RETURNS text
LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
$func$
SELECT public.unaccent('public.unaccent', $1)
$func$;
""",
"DROP FUNCTION IF EXISTS f_unaccent(text);"
)
We have now an immutable function called f_unaccent
which we can use to generate a GIN
index of our person.name
field. Add this after the snippet above:
execute(
"CREATE EXTENSION IF NOT EXISTS pg_trgm;",
"DROP EXTENSION IF EXISTS pg_trgm;"
)
execute(
"""
CREATE INDEX persons_name_unaccented_gin_trgm_idx
ON persons
USING gin (f_unaccent(name) gin_trgm_ops)
""",
"DROP INDEX IF EXISTS persons_name_unaccented_gin_trgm_idx;"
)
The first statement will add the Postgres Trigram extension to your Postgres database. We need this extension for creating the GIN
index in the second statement.
The second statement creates a GIN
index for the unaccented value in our person.name
field. That means that our search/1
query above will now use a much faster GIN
index for searching for the correct person.
As a last step, we need to update our search/1
function to use the new f_unaccent(text)
function instead of the built-in unaccent
function. This is the updated version of our function:
def search(name) do
query =
from(
p in Person,
where: fragment(
"f_unaccent(?) ILIKE f_unaccent(?)",
p.name,
^search_term
)
)
Repo.all(query)
end
If you want to double-check that your function uses the new index, add this code to the search/1
function:
def search(name) do
query = # build your query
# Add this 👇
Ecto.Adapters.SQL.explain(Repo, :all, query) |> IO.inspect()
Repo.all(query)
end
When you run the search/1
function, you will see something like this in your terminal:
Bitmap Heap Scan on persons p0
Recheck Cond: (f_unaccent((name)::text) ~~* '%jose%'::text)
-> Bitmap Index Scan on persons_name_unaccented_gin_trgm_idx
Index Cond: (f_unaccent((name)::text) ~~* '%jose%'::text)
If you see this, congratulations! Your search function now uses the GIN
index for searching for the unaccented version of the name.
🔗 The Caveats
As always, there are caveats to this approach. If you’re interested into the nitty-gritty details of how this works and what other solutions exist, I can recommend the following resources:
- Does PostgreSQL support “accent insensitive” collations?
- Using ILIKE with unaccent and with only right end wildcard
- Putting unaccent in existing query
🔗 Conclusion
And that’s it! I hope you enjoyed this article! If you want to support me, you can buy my book or video course. Follow me on Twitter or subscribe to my newsletter below if you want to get notified when I publish the next blog post. Until the next time! Cheerio 👋