[amazon-web-services] Querying DynamoDB by date

I'm coming from a relational database background and trying to work with amazon's DynamoDB

I have a table with a hash key "DataID" and a range "CreatedAt" and a bunch of items in it.

I'm trying to get all the items that were created after a specific date and sorted by date. Which is pretty straightforward in a relational database.

In DynamoDB the closest thing i could find is a query and using the range key greater than filter. The only issue is that to perform a query i need a hash key which defeats the purpose.

So what am I doing wrong? Is my table schema wrong, shouldn't the hash key be unique? or is there another way to query?

This question is related to amazon-web-services nosql amazon-dynamodb

The answer is

You can have multiple identical hash keys; but only if you have a range key that varies. Think of it like file formats; you can have 2 files with the same name in the same folder as long as their format is different. If their format is the same, their name must be different. The same concept applies to DynamoDB's hash/range keys; just think of the hash as the name and the range as the format.

Also, I don't recall if they had these at the time of the OP (I don't believe they did), but they now offer Local Secondary Indexes.

My understanding of these is that it should now allow you to perform the desired queries without having to do a full scan. The downside is that these indexes have to be specified at table creation, and also (I believe) cannot be blank when creating an item. In addition, they require additional throughput (though typically not as much as a scan) and storage, so it's not a perfect solution, but a viable alternative, for some.

I do still recommend Mike Brant's answer as the preferred method of using DynamoDB, though; and use that method myself. In my case, I just have a central table with only a hash key as my ID, then secondary tables that have a hash and range that can be queried, then the item points the code to the central table's "item of interest", directly.

Additional data regarding the secondary indexes can be found in Amazon's DynamoDB documentation here for those interested.

Anyway, hopefully this will help anyone else that happens upon this thread.

Updated Answer There is no convenient way to do this using Dynamo DB Queries with predictable throughput. One (sub optimal) option is to use a GSI with an artificial HashKey & CreatedAt. Then query by HashKey alone and mention ScanIndexForward to order the results. If you can come up with a natural HashKey (say the category of the item etc) then this method is a winner. On the other hand, if you keep the same HashKey for all items, then it will affect the throughput mostly when when your data set grows beyond 10GB (one partition)

Original Answer: You can do this now in DynamoDB by using GSI. Make the "CreatedAt" field as a GSI and issue queries like (GT some_date). Store the date as a number (msecs since epoch) for this kind of queries.

Details are available here: Global Secondary Indexes - Amazon DynamoDB : http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.Using

This is a very powerful feature. Be aware that the query is limited to (EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN) Condition - Amazon DynamoDB : http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Condition.html

You could make the Hash key something along the lines of a 'product category' id, then the range key as a combination of a timestamp with a unique id appended on the end. That way you know the hash key and can still query the date with greater than.

Given your current table structure this is not currently possible in DynamoDB. The huge challenge is to understand that the Hash key of the table (partition) should be treated as creating separate tables. In some ways this is really powerful (think of partition keys as creating a new table for each user or customer, etc...).

Queries can only be done in a single partition. That's really the end of the story. This means if you want to query by date (you'll want to use msec since epoch), then all the items you want to retrieve in a single query must have the same Hash (partition key).

I should qualify this. You absolutely can scan by the criterion you are looking for, that's no problem, but that means you will be looking at every single row in your table, and then checking if that row has a date that matches your parameters. This is really expensive, especially if you are in the business of storing events by date in the first place (i.e. you have a lot of rows.)

You may be tempted to put all the data in a single partition to solve the problem, and you absolutely can, however your throughput will be painfully low, given that each partition only receives a fraction of the total set amount.

The best thing to do is determine more useful partitions to create to save the data:

  • Do you really need to look at all the rows, or is it only the rows by a specific user?

  • Would it be okay to first narrow down the list by Month, and do multiple queries (one for each month)? Or by Year?

  • If you are doing time series analysis there are a couple of options, change the partition key to something computated on PUT to make the query easier, or use another aws product like kinesis which lends itself to append-only logging.

Your Hash key (primary of sort) has to be unique (unless you have a range like stated by others).

In your case, to query your table you should have a secondary index.

|  ID  | DataID | Created | Data |
| hash | xxxxx  | 1234567 | blah |

Your Hash Key is ID Your secondary index is defined as: DataID-Created-index (that's the name that DynamoDB will use)

Then, you can make a query like this:

var params = {
    TableName: "Table",
    IndexName: "DataID-Created-index",
    KeyConditionExpression: "DataID = :v_ID AND Created > :v_created",
    ExpressionAttributeValues: {":v_ID": {S: "some_id"},
                                ":v_created": {N: "timestamp"}
    ProjectionExpression: "ID, DataID, Created, Data"

ddb.query(params, function(err, data) {
    if (err) 
    else {
        data.Items.sort(function(a, b) {
            return parseFloat(a.Created.N) - parseFloat(b.Created.N);
        // More code here

Essentially your query looks like:

SELECT * FROM TABLE WHERE DataID = "some_id" AND Created > timestamp;

The secondary Index will increase the read/write capacity units required so you need to consider that. It still is a lot better than doing a scan, which will be costly in reads and in time (and is limited to 100 items I believe).

This may not be the best way of doing it but for someone used to RD (I'm also used to SQL) it's the fastest way to get productive. Since there is no constraints in regards to schema, you can whip up something that works and once you have the bandwidth to work on the most efficient way, you can change things around.

Approach I followed to solve this problem is by created a Global Secondary Index as below. Not sure if this is the best approach but hopefully if it is useful to someone.

Hash Key                 | Range Key
Date value of CreatedAt  | CreatedAt

Limitation imposed on the HTTP API user to specify the number of days to retrieve data, defaulted to 24 hr.

This way, I can always specify the HashKey as Current date's day and RangeKey can use > and < operators while retrieving. This way the data is also spread across multiple shards.

Examples related to amazon-web-services

How to specify credentials when connecting to boto3 S3? Is there a way to list all resources in AWS Access denied; you need (at least one of) the SUPER privilege(s) for this operation Job for mysqld.service failed See "systemctl status mysqld.service" What is difference between Lightsail and EC2? AWS S3 CLI - Could not connect to the endpoint URL boto3 client NoRegionError: You must specify a region error only sometimes How to write a file or data to an S3 object using boto3 Missing Authentication Token while accessing API Gateway? The AWS Access Key Id does not exist in our records

Examples related to nosql

Firestore Getting documents id from collection What is Hash and Range Primary Key? Mongodb: Failed to connect to, reason: errno:10061 Explanation of JSONB introduced by PostgreSQL DynamoDB vs MongoDB NoSQL Querying DynamoDB by date Delete all nodes and relationships in neo4j 1.8 When to use CouchDB over MongoDB and vice versa Difference between scaling horizontally and vertically for databases NoSQL Use Case Scenarios or WHEN to use NoSQL

Examples related to amazon-dynamodb

What is Hash and Range Primary Key? How to get item count from DynamoDB? Hive ParseException - cannot recognize input near 'end' 'string' DynamoDB vs MongoDB NoSQL Querying DynamoDB by date How can I fetch all items from a DynamoDB table without specifying the primary key? Is it possible to ORDER results with query or scan in DynamoDB? What is the recommended way to delete a large number of items from DynamoDB?