I am testing Postgres insertion performance. I have a table with one column with number as its data type. There is an index on it as well. I filled the database up using this query:
insert into aNumber (id) values (564),(43536),(34560) ...
I inserted 4 million rows very quickly 10,000 at a time with the query above. After the database reached 6 million rows performance drastically declined to 1 Million rows every 15 min. Is there any trick to increase insertion performance? I need optimal insertion performance on this project.
Using Windows 7 Pro on a machine with 5 GB RAM.
This question is related to
sql
postgresql
bulkinsert
sql-insert
In addition to excellent Craig Ringer's post and depesz's blog post, if you would like to speed up your inserts through ODBC (psqlodbc) interface by using prepared-statement inserts inside a transaction, there are a few extra things you need to do to make it work fast:
Protocol=-1
in the connection string. By default psqlodbc uses "Statement" level, which creates a SAVEPOINT for each statement rather than an entire transaction, making inserts slower.UseServerSidePrepare=1
in the connection string. Without this option the client sends the entire insert statement along with each row being inserted.SQLSetConnectAttr(conn, SQL_ATTR_AUTOCOMMIT, reinterpret_cast<SQLPOINTER>(SQL_AUTOCOMMIT_OFF), 0);
SQLEndTran(SQL_HANDLE_DBC, conn, SQL_COMMIT);
. There is no need to explicitly open a transaction.Unfortunately, psqlodbc "implements" SQLBulkOperations
by issuing a series of unprepared insert statements, so that to achieve the fastest insert one needs to code up the above steps manually.
If you happend to insert colums with UUIDs (which is not exactly your case) and to add to @Dennis answer (I can't comment yet), be advise than using gen_random_uuid() (requires PG 9.4 and pgcrypto module) is (a lot) faster than uuid_generate_v4()
=# explain analyze select uuid_generate_v4(),* from generate_series(1,10000);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series (cost=0.00..12.50 rows=1000 width=4) (actual time=11.674..10304.959 rows=10000 loops=1)
Planning time: 0.157 ms
Execution time: 13353.098 ms
(3 filas)
vs
=# explain analyze select gen_random_uuid(),* from generate_series(1,10000);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series (cost=0.00..12.50 rows=1000 width=4) (actual time=252.274..418.137 rows=10000 loops=1)
Planning time: 0.064 ms
Execution time: 503.818 ms
(3 filas)
Also, it's the suggested official way to do it
Note
If you only need randomly-generated (version 4) UUIDs, consider using the gen_random_uuid() function from the pgcrypto module instead.
This droped insert time from ~2 hours to ~10 minutes for 3.7M of rows.
I spent around 6 hours on the same issue today. Inserts go at a 'regular' speed (less than 3sec per 100K) up until to 5MI (out of total 30MI) rows and then the performance sinks drastically (all the way down to 1min per 100K).
I will not list all of the things that did not work and cut straight to the meat.
I dropped a primary key on the target table (which was a GUID) and my 30MI or rows happily flowed to their destination at a constant speed of less than 3sec per 100K.
Use COPY table TO ... WITH BINARY
which is according to the documentation is "somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.
Here is an example recipe in Python, using psycopg2 with binary input.
For optimal Insertion performance disable the index if that's an option for you. Other than that, better hardware (disk, memory) is also helpful
I encountered this insertion performance problem as well. My solution is spawn some go routines to finish the insertion work. In the meantime, SetMaxOpenConns
should be given a proper number otherwise too many open connection error would be alerted.
db, _ := sql.open()
db.SetMaxOpenConns(SOME CONFIG INTEGER NUMBER)
var wg sync.WaitGroup
for _, query := range queries {
wg.Add(1)
go func(msg string) {
defer wg.Done()
_, err := db.Exec(msg)
if err != nil {
fmt.Println(err)
}
}(query)
}
wg.Wait()
The loading speed is much faster for my project. This code snippet just gave an idea how it works. Readers should be able to modify it easily.
Source: Stackoverflow.com