Hey! I recorded two video courses!
If you like this article, you will also like the courses! Check them out here!
Analyzing large amounts of data has become an everyday problem. Oftentimes, the datasets are only available as CSV file, which creates the the question of how you can import them into your Postgres database? The short answer: By using the Postgres’ COPY
-function. Here’s the long answer:
Let’s imagine you have an Ecto.Schema
called Location
with the following definition:
schema "locations" do
field :name, :string
field :latitude, :float
field :longitude, :float
end
The locations
table stores location data as latitude and longitude coordinates together with a name
. For example, this could be a street address with a house number and its geocoded lat+long
position.
Now, let’s imagine you have a CSV file called locations.csv
with 100.000 rows of location data and you want to import the data into your locations
table. In the following, we will use Postgres’ COPY
-function for that. First, we will call it directly from a psql
-session and then we will wrap it into a simple Mix.Task
. Let’s go!
🔗 Most efficient: Postgres’s COPY
The fastest way of importing your data is to use Postgres’ COPY function. You can use it in two ways: If your CSV file exists in the same file-system as your Postgres database, you can ask Postgres to simply move the data into the database. However, this won’t work when you want to dump the data into a hosted database since the file exists only in your local filesystem. In that case, you can either use COPY
’s little brother, the \copy
-command, or we leverage a Postgrex
-connection to stream the file through our Postgres session. We’ll see how we can do that using Elixir later on.
In all cases, Postgres’ COPY
is mind-bogglingly fast, but it has some limitations:
inserted_at
and updated_at
timestamps won’t be auto-generated. They need to exist in your CSV file. Importing a file without these timestamps will fail because of the Not-Null
-constraints on the timestamp fields. If you need timestamps, but can’t add them to the file, consider adding your timestamp fields after you imported the data with for example:
now = NaiveDateTime.utc_now()
Repo.update_all(Location, set: [inserted_at: now, updated_at: now])
Importing complex associations is tricky. If your data has associations with other tables and you need to validate that e.g. an association on the other table exists, you have to do so manually after the data is imported. Postgres’ COPY
-command works best for “dumb” data that doesn’t depend on other tables.
Now, with these limitations in mind, let’s have a look at how you can use Postgres’ COPY
-command.
🔗 COPY
-ing without Elixir
First, double-check that your CSV file has the same column order as your database. Basically, your CSV should order the fields the same way your Ecto.Schema
defines them. In our case, our CSV needs to order the data as name,latitude,longitude
.
Next, open a psql
-session to your database. If you have Postgres running locally, use this command:
psql -h localhost -p 5432 -U postgres -d locations_dev
Make sure to replace the parameters above with your own parameters. Pay special attention to the database name behind the -d
flag. This should equal the name of the database that you want to import the CSV to.
Next, let’s import the CSV file with the following command:
app_dev=# COPY locations(name,latitude,longitude)
FROM './relative-or-absolute-path-to-your-file/data.csv'
WITH CSV HEADER DELIMITER ',';
COPY 69638
This command took only a second to run and imported almost 70k rows of location data! Amazing!
When using the COPY
-command, Postgres moves the data using your file-system into the database. This is basic file copying and therefore is very fast. If you connected to a hosted database, this might take a bit longer because you first have to upload the file to the database server.
We used a few options in our command.
- The
CSV
told Postgres that we’ll import a CSV file. You could also importtext
orbinary
data. - The
HEADER
option tells Postgres that our file has a header row which it should skip. If your file doesn’t have a header row, you should remove this option. - The
DELIMITER
option informs Postgres about which character separates the values in our CSV file. Usually, this is a,
, but sometimes this might be a;
or something even fancier. If your file is comma-separated (,
), you can remove this option.
Here are a few troubleshooting tips, if this command didn’t work for you:
-
The file can’t be found. You probably run Postgres in a Docker container or connected to a hosted Postgres instance. In that case, your file doesn’t exist in the same file-system as your database and you first need to copy the file to the Postgres instance. Postgres will do this for you if you replace
COPY
with\copy
. -
My file doesn’t have headers. Easy. Simply remove
HEADER
word from the command. -
My file uses ‘;’ instead of ‘,’. Also easy. Simply replace
','
with';'
behindDELIMITER
in the command.
🔗 COPY
-ing with Elixir
As you saw above, you can execute the COPY
-command directly in a psql
session in your database. However, let’s see how we can build an Elixir
wrapper around this command instead. That way, you can easily reuse your already configured Ecto.Repo
connection of your Phoenix application.
Let’s create a file in app/lib/mix/tasks/import_data.ex
and fill it with a Mix.Task
like this:
defmodule Mix.Tasks.ImportData do
use Mix.Task
def run(path) do
Mix.Task.run("app.start", [])
opts = App.Repo.config()
{:ok, pid} = Postgrex.start_link(opts)
Postgrex.transaction(
pid,
fn conn ->
stream =
Postgrex.stream(
conn,
"COPY locations(name,latitude,longitude) FROM STDIN CSV HEADER DELIMITER ','",
[]
)
Enum.into(File.stream!(path, [:trim_bom]), stream)
end,
timeout: :infinity
)
end
end
The Mix.Task
receives a path to the CSV file and uses your application’s Postgrex
-connection to execute the COPY
-command. It creates a new connection to Postgres using Postgrex.start_link/1
and streams the CSV file through the connection using File.stream!/2
. We set the timeout
to :infinity
, because otherwise the import transaction might time out if it takes longer than 15 seconds, which is the default timeout. When we create the File.stream!/2
we use the :trim_bom
option. This option instructs the File
-reader to remove any byte order marks. These magic numbers give meta-data about your encoding and stream readiness which we don’t want to import into our database. That’s why we remove them while streaming the data.
Now, we can almost import the data. As a last step, make sure that your Ecto.Repo
connection is properly configured in your application. This can look like the following:
# config/dev.exs
config :app, App.Repo,
username: "postgres",
password: "postgres",
hostname: "localhost",
database: "locations_dev",
pool_size: 10
Or, if you want to connect using a URL instead:
# config/runtime.exs
config :app, App.Repo,
url: System.get_env("DATABASE_URL"),
pool_size: 10
Once you configured your Ecto.Repo
connection, let’s try to import the data with:
mix import_data ./relative-or-absolute-path-to-your-file/data.csv
And that’s it! Now, you can import large CSVs directly into your database using a Mix.Task
in just a matter of seconds! Amazing! :)