[python] mongodb: insert if not exists

Every day, I receive a stock of documents (an update). What I want to do is insert each item that does not already exist.

  • I also want to keep track of the first time I inserted them, and the last time I saw them in an update.
  • I don't want to have duplicate documents.
  • I don't want to remove a document which has previously been saved, but is not in my update.
  • 95% (estimated) of the records are unmodified from day to day.

I am using the Python driver (pymongo).

What I currently do is (pseudo-code):

for each document in update:
      existing_document = collection.find_one(document)
      if not existing_document:
           document['insertion_date'] = now
      else:
           document = existing_document
      document['last_update_date'] = now
      my_collection.save(document)

My problem is that it is very slow (40 mins for less than 100 000 records, and I have millions of them in the update). I am pretty sure there is something builtin for doing this, but the document for update() is mmmhhh.... a bit terse.... (http://www.mongodb.org/display/DOCS/Updating )

Can someone advise how to do it faster?

This question is related to python mongodb bulkinsert mongodb-query

The answer is


1. Use Update.

Drawing from Van Nguyen's answer above, use update instead of save. This gives you access to the upsert option.

NOTE: This method overrides the entire document when found (From the docs)

var conditions = { name: 'borne' }   , update = { $inc: { visits: 1 }} , options = { multi: true };

Model.update(conditions, update, options, callback);

function callback (err, numAffected) {   // numAffected is the number of updated documents })

1.a. Use $set

If you want to update a selection of the document, but not the whole thing, you can use the $set method with update. (again, From the docs)... So, if you want to set...

var query = { name: 'borne' };  Model.update(query, ***{ name: 'jason borne' }***, options, callback)

Send it as...

Model.update(query, ***{ $set: { name: 'jason borne' }}***, options, callback)

This helps prevent accidentally overwriting all of your document(s) with { name: 'jason borne' }.


In general, using update is better in MongoDB as it will just create the document if it doesn't exist yet, though I'm not sure how to work that with your python adapter.

Second, if you only need to know whether or not that document exists, count() which returns only a number will be a better option than find_one which supposedly transfer the whole document from your MongoDB causing unnecessary traffic.


Summary

  • You have an existing collection of records.
  • You have a set records that contain updates to the existing records.
  • Some of the updates don't really update anything, they duplicate what you have already.
  • All updates contain the same fields that are there already, just possibly different values.
  • You want to track when a record was last changed, where a value actually changed.

Note, I'm presuming PyMongo, change to suit your language of choice.

Instructions:

  1. Create the collection with an index with unique=true so you don't get duplicate records.

  2. Iterate over your input records, creating batches of them of 15,000 records or so. For each record in the batch, create a dict consisting of the data you want to insert, presuming each one is going to be a new record. Add the 'created' and 'updated' timestamps to these. Issue this as a batch insert command with the 'ContinueOnError' flag=true, so the insert of everything else happens even if there's a duplicate key in there (which it sounds like there will be). THIS WILL HAPPEN VERY FAST. Bulk inserts rock, I've gotten 15k/second performance levels. Further notes on ContinueOnError, see http://docs.mongodb.org/manual/core/write-operations/

    Record inserts happen VERY fast, so you'll be done with those inserts in no time. Now, it's time to update the relevant records. Do this with a batch retrieval, much faster than one at a time.

  3. Iterate over all your input records again, creating batches of 15K or so. Extract out the keys (best if there's one key, but can't be helped if there isn't). Retrieve this bunch of records from Mongo with a db.collectionNameBlah.find({ field : { $in : [ 1, 2,3 ...}) query. For each of these records, determine if there's an update, and if so, issue the update, including updating the 'updated' timestamp.

    Unfortunately, we should note, MongoDB 2.4 and below do NOT include a bulk update operation. They're working on that.

Key Optimization Points:

  • The inserts will vastly speed up your operations in bulk.
  • Retrieving records en masse will speed things up, too.
  • Individual updates are the only possible route now, but 10Gen is working on it. Presumably, this will be in 2.6, though I'm not sure if it will be finished by then, there's a lot of stuff to do (I've been following their Jira system).

I don't think mongodb supports this type of selective upserting. I have the same problem as LeMiz, and using update(criteria, newObj, upsert, multi) doesn't work right when dealing with both a 'created' and 'updated' timestamp. Given the following upsert statement:

update( { "name": "abc" }, 
        { $set: { "created": "2010-07-14 11:11:11", 
                  "updated": "2010-07-14 11:11:11" }},
        true, true ) 

Scenario #1 - document with 'name' of 'abc' does not exist: New document is created with 'name' = 'abc', 'created' = 2010-07-14 11:11:11, and 'updated' = 2010-07-14 11:11:11.

Scenario #2 - document with 'name' of 'abc' already exists with the following: 'name' = 'abc', 'created' = 2010-07-12 09:09:09, and 'updated' = 2010-07-13 10:10:10. After the upsert, the document would now be the same as the result in scenario #1. There's no way to specify in an upsert which fields be set if inserting, and which fields be left alone if updating.

My solution was to create a unique index on the critera fields, perform an insert, and immediately afterward perform an update just on the 'updated' field.


You may use Upsert with $setOnInsert operator.

db.Table.update({noExist: true}, {"$setOnInsert": {xxxYourDocumentxxx}}, {upsert: true})

You could always make a unique index, which causes MongoDB to reject a conflicting save. Consider the following done using the mongodb shell:

> db.getCollection("test").insert ({a:1, b:2, c:3})
> db.getCollection("test").find()
{ "_id" : ObjectId("50c8e35adde18a44f284e7ac"), "a" : 1, "b" : 2, "c" : 3 }
> db.getCollection("test").ensureIndex ({"a" : 1}, {unique: true})
> db.getCollection("test").insert({a:2, b:12, c:13})      # This works
> db.getCollection("test").insert({a:1, b:12, c:13})      # This fails
E11000 duplicate key error index: foo.test.$a_1  dup key: { : 1.0 }

As of MongoDB 2.4, you can use $setOnInsert (http://docs.mongodb.org/manual/reference/operator/setOnInsert/)

Set 'insertion_date' using $setOnInsert and 'last_update_date' using $set in your upsert command.

To turn your pseudocode into a working example:

now = datetime.utcnow()
for document in update:
    collection.update_one(
        {"_id": document["_id"]},
        {
            "$setOnInsert": {"insertion_date": now},
            "$set": {"last_update_date": now},
        },
        upsert=True,
    )

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to mongodb

Server Discovery And Monitoring engine is deprecated Avoid "current URL string parser is deprecated" warning by setting useNewUrlParser to true MongoNetworkError: failed to connect to server [localhost:27017] on first connect [MongoNetworkError: connect ECONNREFUSED 127.0.0.1:27017] Failed to auto-configure a DataSource: 'spring.datasource.url' is not specified Failed to start mongod.service: Unit mongod.service not found db.collection is not a function when using MongoClient v3.0 MongoError: connect ECONNREFUSED 127.0.0.1:27017 MongoDB: How To Delete All Records Of A Collection in MongoDB Shell? How to resolve Nodejs: Error: ENOENT: no such file or directory How to create a DB for MongoDB container on start up?

Examples related to bulkinsert

Insert Multiple Rows Into Temp Table With SQL Server 2012 PostgreSQL - SQL state: 42601 syntax error Import CSV file into SQL Server Cannot bulk load. Operating system error code 5 (Access is denied.) How to Bulk Insert from XLSX file extension? Bulk Insert Correctly Quoted CSV File in SQL Server How to speed up insertion performance in PostgreSQL BULK INSERT with identity (auto-increment) column How do I temporarily disable triggers in PostgreSQL? mongodb: insert if not exists

Examples related to mongodb-query

How to join multiple collections with $lookup in mongodb $lookup on ObjectId's in an array how to convert string to numerical values in mongodb How to convert a pymongo.cursor.Cursor into a dict? Mongodb find() query : return only unique values (no duplicates) How to list all databases in the mongo shell? Printing Mongo query output to a file while in the mongo shell MongoDB "root" user How to query nested objects? How to filter array in subdocument with MongoDB