[postgresql] What's the fastest way to do a bulk insert into Postgres?

I need to programmatically insert 10's of millions of records into a postgres database. Presently I am executing 1000's of insert statements in a single "query".

Is there a better way to do this, some bulk insert statement I dont know about?

This question is related to postgresql bulkinsert

The answer is


It mostly depends on the (other) activity in the database. Operations like this effectively freeze the entire database for other sessions. Another consideration is the datamodel and the presence of constraints,triggers, etc.

My first approach is always: create a (temp) table with a structure similar to the target table (create table tmp AS select * from target where 1=0), and start by reading the file into the temp table. Then I check what can be checked: duplicates, keys that already exist in the target, etc.

Then I just do a "do insert into target select * from tmp" or similar.

If this fails, or takes too long, I abort it and consider other methods (temporarily dropping indexes/constraints, etc)


I implemented very fast Postgresq data loader with native libpq methods. Try my package https://www.nuget.org/packages/NpgsqlBulkCopy/


I just encountered this issue and would recommend csvsql (releases) for bulk imports to Postgres. To perform a bulk insert you'd simply createdb and then use csvsql, which connects to your database and creates individual tables for an entire folder of CSVs.

$ createdb test 
$ csvsql --db postgresql:///test --insert examples/*.csv

One way to speed things up is to explicitly perform multiple inserts or copy's within a transaction (say 1000). Postgres's default behavior is to commit after each statement, so by batching the commits, you can avoid some overhead. As the guide in Daniel's answer says, you may have to disable autocommit for this to work. Also note the comment at the bottom that suggests increasing the size of the wal_buffers to 16 MB may also help.


The external file is the best and typical bulk-data

The term "bulk data" is related to "a lot of data", so it is natural to use original raw data, with no need to transform it into SQL. Typical raw data files for "bulk insert" are CSV and JSON formats.

Bulk insert with some transformation

In ETL applications and ingestion processes, we need to change the data before inserting it. Temporary table consumes (a lot of) disk space, and it is not the faster way to do it. The PostgreSQL foreign-data wrapper (FDW) is the best choice.

CSV example. Suppose the tablename (x, y, z) on SQL and a CSV file like

fieldname1,fieldname2,fieldname3
etc,etc,etc
... million lines ...

You can use the classic SQL COPY to load (as is original data) into tmp_tablename, them insert filtered data into tablename... But, to avoid disk consumption, the best is to ingested directly by

INSERT INTO tablename (x, y, z)
  SELECT f1(fieldname1), f2(fieldname2), f3(fieldname3) -- the transforms 
  FROM tmp_tablename_fdw
  -- WHERE condictions
;

You need to prepare database for FDW, and instead static tmp_tablename_fdw you can use a function that generates it:

CREATE EXTENSION file_fdw;
CREATE SERVER import FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE tmp_tablename_fdw(
  ...
) SERVER import OPTIONS ( filename '/tmp/pg_io/file.csv', format 'csv');

JSON example. A set of two files, myRawData1.json and Ranger_Policies2.json can be ingested by:

INSERT INTO tablename (fname, metadata, content)
 SELECT fname, meta, j  -- do any data transformation here
 FROM jsonb_read_files('myRawData%.json')
 -- WHERE any_condiction_here
;

where the function jsonb_read_files() reads all files of a folder, defined by a mask:

CREATE or replace FUNCTION jsonb_read_files(
  p_flike text, p_fpath text DEFAULT '/tmp/pg_io/'
) RETURNS TABLE (fid int,  fname text, fmeta jsonb, j jsonb) AS $f$
  WITH t AS (
     SELECT (row_number() OVER ())::int id, 
           f as fname,
           p_fpath ||'/'|| f as f
     FROM pg_ls_dir(p_fpath) t(f)
     WHERE    f like p_flike
  ) SELECT id,  fname,
         to_jsonb( pg_stat_file(f) ) || jsonb_build_object('fpath',p_fpath),
         pg_read_file(f)::jsonb
    FROM t
$f$  LANGUAGE SQL IMMUTABLE;

Lack of gzip streaming

The most frequent method for "file ingestion" (mainlly in Big Data) is preserving original file on gzip format and transfering it with streaming algorithm, anything that can runs fast and without disc consumption in unix pipes:

 gunzip remote_or_local_file.csv.gz | convert_to_sql | psql 

So ideal (future) is a server option for format .csv.gz.


There is an alternative to using COPY, which is the multirow values syntax that Postgres supports. From the documentation:

INSERT INTO films (code, title, did, date_prod, kind) VALUES
    ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
    ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');

The above code inserts two rows, but you can extend it arbitrarily, until you hit the maximum number of prepared statement tokens (it might be $999, but I'm not 100% sure about that). Sometimes one cannot use COPY, and this is a worthy replacement for those situations.


UNNEST function with arrays can be used along with multirow VALUES syntax. I'm think that this method is slower than using COPY but it is useful to me in work with psycopg and python (python list passed to cursor.execute becomes pg ARRAY):

INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
VALUES (
    UNNEST(ARRAY[1, 2, 3]), 
    UNNEST(ARRAY[100, 200, 300]), 
    UNNEST(ARRAY['a', 'b', 'c'])
);

without VALUES using subselect with additional existance check:

INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
SELECT * FROM (
    SELECT UNNEST(ARRAY[1, 2, 3]), 
           UNNEST(ARRAY[100, 200, 300]), 
           UNNEST(ARRAY['a', 'b', 'c'])
) AS temptable
WHERE NOT EXISTS (
    SELECT 1 FROM tablename tt
    WHERE tt.fieldname1=temptable.fieldname1
);

the same syntax to bulk updates:

UPDATE tablename
SET fieldname1=temptable.data
FROM (
    SELECT UNNEST(ARRAY[1,2]) AS id,
           UNNEST(ARRAY['a', 'b']) AS data
) AS temptable
WHERE tablename.id=temptable.id;

You can use COPY table TO ... WITH BINARY which is "somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.

Here is an example recipe in Python, using psycopg2 with binary input.