mongodb insert if not exists

Question

Every day  I receive a stock of documents  an update   What I want to do is insert each item that does not already exist    I also want to keep track of the first time I inserted them  and the last time I saw them in an update   I don t want to have duplicate documents   I don t want to remove a document which has previously been saved  but is not in my update   95   estimated  of the records are unmodified from day to day    I am using the Python driver  pymongo    What I currently do is  pseudo-code    for each document in update        existing document   collection find one document        if not existing document             document  insertion date     now       else             document   existing document       document  last update date     now       my collection save document    My problem is that it is very slow  40 mins for less than 100 000 records  and I have millions of them in the update   I am pretty sure there is something builtin for doing this  but the document for update   is mmmhhh     a bit terse      http   www mongodb org display DOCS Updating    Can someone advise how to do it faster

User · Answer

In general  using update is better in MongoDB as it will just create the document if it doesn t exist yet  though I m not sure how to work that with your python adapter   Second  if you only need to know whether or not that document exists  count   which returns only a number will be a better option than find one which supposedly transfer the whole document from your MongoDB causing unnecessary traffic

User · Answer

1  Use Update   Drawing from Van Nguyen s answer above  use update instead of save  This gives you access to the upsert option    NOTE  This method overrides the entire document when found  From the docs   var conditions     name   borne        update      inc    visits  1      options     multi  true     Model update conditions  update  options  callback    function callback  err  numAffected         numAffected is the number of updated documents      1 a  Use  set  If you want to update a selection of the document  but not the whole thing  you can use the  set method with update   again  From the docs     So  if you want to set     var query     name   borne      Model update query       name   jason borne        options  callback    Send it as     Model update query        set    name   jason borne         options  callback    This helps prevent accidentally overwriting all of your document s  with   name   jason borne

User · Answer

You may use Upsert with  setOnInsert operator   db Table update  noExist  true      setOnInsert    xxxYourDocumentxxx     upsert  true

User · Answer

Sounds like you want to do an  upsert    MongoDB has built-in support for this   Pass an extra parameter to your update   call    upsert true    For example   key     key   value   data     key2   value2    key3   value3    coll update key  data  upsert True    In python upsert must be passed as a keyword argument   This replaces your if-find-else-update block entirely   It will insert if the key doesn t exist and will update if it does   Before     key   value    key2   Ohai      After     key   value    key2   value2    key3   value3     You can also specify what data you want to write   data      set    key2   value2      Now your selected document will update the value of  key2  only and leave everything else untouched

User · Answer

You could always make a unique index  which causes MongoDB to reject a conflicting save  Consider the following done using the mongodb shell    gt  db getCollection  test   insert   a 1  b 2  c 3    gt  db getCollection  test   find       id    ObjectId  50c8e35adde18a44f284e7ac     a    1   b    2   c    3    gt  db getCollection  test   ensureIndex    a    1    unique  true    gt  db getCollection  test   insert  a 2  b 12  c 13          This works  gt  db getCollection  test   insert  a 1  b 12  c 13          This fails E11000 duplicate key error index  foo test  a 1  dup key      1 0

User · Answer

Summary   You have an existing collection of records  You have a set records that contain updates to the existing records  Some of the updates don t really update anything  they duplicate what you have already  All updates contain the same fields that are there already  just possibly different values  You want to track when a record was last changed  where a value actually changed    Note  I m presuming PyMongo  change to suit your language of choice   Instructions    Create the collection with an index with unique true so you don t get duplicate records  Iterate over your input records  creating batches of them of 15 000 records or so   For each record in the batch  create a dict consisting of the data you want to insert  presuming each one is going to be a new record   Add the  created  and  updated  timestamps to these   Issue this as a batch insert command with the  ContinueOnError  flag true  so the insert of everything else happens even if there s a duplicate key in there  which it sounds like there will be    THIS WILL HAPPEN VERY FAST   Bulk inserts rock  I ve gotten 15k second performance levels   Further notes on ContinueOnError  see http   docs mongodb org manual core write-operations   Record inserts happen VERY fast  so you ll be done with those inserts in no time   Now  it s time to update the relevant records   Do this with a batch retrieval  much faster than one at a time  Iterate over all your input records again  creating batches of 15K or so   Extract out the keys  best if there s one key  but can t be helped if there isn t    Retrieve this bunch of records from Mongo with a db collectionNameBlah find   field      in     1  2 3       query   For each of these records  determine if there s an update  and if so  issue the update  including updating the  updated  timestamp     Unfortunately  we should note  MongoDB 2 4 and below do NOT include a bulk update operation   They re working on that    Key Optimization Points    The inserts will vastly speed up your operations in bulk     Retrieving records en masse will speed things up  too  Individual updates are the only possible route now  but 10Gen is working on it   Presumably  this will be in 2 6  though I m not sure if it will be finished by then  there s a lot of stuff to do  I ve been following their Jira system

User · Answer

I don t think mongodb supports this type of selective upserting  I have the same problem as LeMiz  and using update criteria  newObj  upsert  multi  doesn t work right when dealing with both a  created  and  updated  timestamp  Given the following upsert statement   update     name    abc                 set     created    2010-07-14 11 11 11                       updated    2010-07-14 11 11 11              true  true      Scenario  1 - document with  name  of  abc  does not exist  New document is created with  name     abc    created    2010-07-14 11 11 11  and  updated    2010-07-14 11 11 11   Scenario  2 - document with  name  of  abc  already exists with the following   name     abc    created    2010-07-12 09 09 09  and  updated    2010-07-13 10 10 10  After the upsert  the document would now be the same as the result in scenario  1  There s no way to specify in an upsert which fields be set if inserting  and which fields be left alone if updating   My solution was to create a unique index on the critera fields  perform an insert  and immediately afterward perform an update just on the  updated  field

User · Answer

As of MongoDB 2 4  you can use  setOnInsert  http   docs mongodb org manual reference operator setOnInsert    Set  insertion date  using  setOnInsert and  last update date  using  set in your upsert command   To turn your pseudocode into a working example   now   datetime utcnow   for document in update      collection update one             id   document   id                             setOnInsert     insertion date   now                 set     last update date   now                      upsert True

[python] mongodb: insert if not exists

Examples related to python

Examples related to mongodb

Examples related to bulkinsert

Examples related to mongodb-query