Retrieving subfolders names in S3 bucket from boto3

Question

Using boto3  I can access my AWS S3 bucket   s3   boto3 resource  s3   bucket   s3 Bucket  my-bucket-name     Now  the bucket contains folder first-level  which itself contains several sub-folders named with a timestamp  for instance 1456753904534   I need to know the name of these sub-folders for another job I m doing and I wonder whether I could have boto3 retrieve those for me   So I tried   objs   bucket meta client list objects Bucket  my-bucket-name     which gives a dictionary  whose key  Contents  gives me all the third-level files instead of the second-level timestamp directories  in fact I get a list containing things as       u ETag     etag      u Key   first-level 1456753904534 part-00014     u LastModified     datetime datetime 2016  2  29  13  52  24  tzinfo tzutc       u Owner    u DisplayName    owner      u ID      id      u Size   size    u StorageClass    storageclass     you can see that the specific files  in this case part-00014 are retrieved  while I d like to get the name of the directory alone   In principle I could strip out the directory name from all the paths but it s ugly and expensive to retrieve everything at third level to get the second level   I also tried something reported here   for o in bucket objects filter Delimiter           print o key    but I do not get the folders at the desired level   Is there a way to solve this

User · Answer

I had the same issue but managed to resolve it using boto3.client and list_objects_v2 with Bucket and StartAfter parameters.

s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'

theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
    print object['Key']

The output result for the code above would display the following:

firstlevelFolder/secondLevelFolder/item1
firstlevelFolder/secondLevelFolder/item2

Boto3 list_objects_v2 Documentation

In order to strip out only the directory name for secondLevelFolder I just used python method split():

s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'

theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
    direcoryName = object['Key'].encode("string_escape").split('/')
    print direcoryName[1]

The output result for the code above would display the following:

secondLevelFolder
secondLevelFolder

Python split() Documentation

If you'd like to get the directory name AND contents item name then replace the print line with the following:

print "{}/{}".format(fileName[1], fileName[2])

And the following will be output:

secondLevelFolder/item2
secondLevelFolder/item2

Hope this helps

User · Answer

Below piece of code returns ONLY the  subfolders  in a  folder  from s3 bucket   import boto3 bucket    my-bucket   Make sure you provide   in the end prefix    prefix-name-with-slash      client   boto3 client  s3   result   client list objects Bucket bucket  Prefix prefix  Delimiter      for o in result get  CommonPrefixes        print  sub folder      o get  Prefix     For more details  you can refer to https   github com boto boto3 issues 134

User · Answer

Why not use the s3path package which makes it as convenient as working with pathlib  If you must however use boto3  Using boto3 resource This builds upon the answer by itz-azhar to apply an optional limit  It is obviously substantially simpler to use than the boto3 client version  import logging from typing import List  Optional  import boto3 from boto3 type annotations s3 import ObjectSummary    pip install boto3 type annotations  log   logging getLogger   name     S3 RESOURCE   boto3 resource  quot s3 quot    def s3 list bucket name  str  prefix  str     limit  Optional int    None  - gt  List ObjectSummary        quot  quot  quot Return a list of S3 object summaries  quot  quot  quot        Ref  https   stackoverflow com a 57718002      return list  S3 RESOURCE Bucket bucket name  objects limit count limit  filter Prefix prefix     if   name       quot   main   quot       s3 list  quot noaa-gefs-pds quot    quot gefs 20190828 12 pgrb2a quot   limit 10 000   Using boto3 client This uses list objects v2 and builds upon the answer by CpILL to allow retrieving more than 1000 objects  import logging from typing import cast  List  import boto3  log   logging getLogger   name     S3 CLIENT   boto3 client  quot s3 quot    def s3 list bucket name  str  prefix  str     limit  int   cast int  float  quot inf quot     - gt  List dict        quot  quot  quot Return a list of S3 object summaries  quot  quot  quot        Ref  https   stackoverflow com a 57718002      contents  List dict           continuation token   None     if limit  lt   0          return contents     while True          max keys   min 1000  limit - len contents           request kwargs     quot Bucket quot   bucket name   quot Prefix quot   prefix   quot MaxKeys quot   max keys          if continuation token              log info     type  ignore                  quot Listing  s objects in s3    s  s using continuation token ending with  s with  s objects listed thus far  quot                   max keys  bucket name  prefix  continuation token -6    len contents      pylint  disable unsubscriptable-object             response    S3 CLIENT list objects v2   request kwargs  ContinuationToken continuation token          else              log info  quot Listing  s objects in s3    s  s with  s objects listed thus far  quot   max keys  bucket name  prefix  len contents               response    S3 CLIENT list objects v2   request kwargs          assert response  quot ResponseMetadata quot    quot HTTPStatusCode quot      200         contents extend response  quot Contents quot            is truncated   response  quot IsTruncated quot           if  not is truncated  or  len contents   gt   limit               break         continuation token   response  quot NextContinuationToken quot       assert len contents   lt   limit     log info  quot Returning  s objects from s3    s  s  quot   len contents   bucket name  prefix      return contents   if   name       quot   main   quot       s3 list  quot noaa-gefs-pds quot    quot gefs 20190828 12 pgrb2a quot   limit 10 000

User · Answer

First of all  there is no real folder concept in S3  You definitely can have a file     folder subfolder myfile txt  and no folder nor subfolder   To  simulate  a folder in S3  you must create an empty file with a     at the end of its name   see Amazon S3 boto - how to create a folder    For your problem  you should probably use the method get all keys with the 2 parameters   prefix and delimiter  https   github com boto boto blob develop boto s3 bucket py L427  for key in bucket get all keys prefix  first-level    delimiter           print key name

User · Answer

Here is a possible solution   def download list s3 folder my bucket my folder       import boto3     s3   boto3 client  s3       response   s3 list objects v2          Bucket my bucket          Prefix my folder          MaxKeys 1000      return  item  Key   for item in response  Contents

User · Answer

It took me a lot of time to figure out  but finally here is a simple way to list contents of a subfolder in S3 bucket using boto3  Hope it helps  prefix    folderone foldertwo   s3   boto3 resource  s3   bucket   s3 Bucket name  bucket name here   FilesNotFound   True for obj in bucket objects filter Prefix prefix        print   0   1   format bucket name  obj key        FilesNotFound   False if FilesNotFound       print  ALERT    No file in  0   1   format bucket  prefix

User · Answer

S3 is an object storage  it doesn t have real directory structure  The     is rather cosmetic   One reason that people want to have a directory structure  because they can maintain prune add a tree to the application  For S3  you treat such structure as sort of index or search tag    To manipulate object in S3  you need boto3 client or boto3 resource  e g   To list all object  import boto3  s3   boto3 client  s3   all objects   s3 list objects Bucket    bucket-name      http   boto3 readthedocs org en latest reference services s3 html S3 Client list objects  In fact  if the s3 object name is stored using     separator  The more recent version of list objects  list objects v2  allows you to limit the response to keys that begin with the specified prefix    To limit the items to items under certain sub-folders       import boto3      s3   boto3 client  s3       response   s3 list objects v2              Bucket BUCKET              Prefix   DIR1 DIR2               MaxKeys 100     Documentation  Another option is using python os path function to extract the folder prefix  Problem is that this will require listing objects from undesired directories    import os s3 key    first-level 1456753904534 part-00014  filename   os path basename s3 key   foldername   os path dirname s3 key     if you are not using conventional delimiter like      s3 key    first-level 1456753904534 part-00014 filename   s3 key split      -1    A reminder about boto3   boto3 resource is a nice high level API  There are pros and cons using boto3 client vs boto3 resource  If you develop internal shared library  using boto3 resource will give you a blackbox layer over the resources used

User · Answer

The big realisation with S3 is that there are no folders directories just keys  The apparent folder structure is just prepended to the filename to become the  Key   so to list the contents of myBucket s some path to the file  you can try   s3   boto3 client  s3   for obj in s3 list objects v2 Bucket  myBucket   Prefix  some path to the file     Contents        print obj  Key      which would give you something like   some path to the file yo jpg some path to the file meAndYou gif

User · Answer

The following works for me    S3 objects   s3   bucket      form1         section11            file111           file112        section12            file121     form2         section21            file211           file112        section22            file221           file222                                  Using   from boto3 session import Session s3client   session client  s3   resp   s3client list objects Bucket bucket  Prefix     Delimiter      forms    x  Prefix   for x in resp  CommonPrefixes       we get   form1  form2        With   resp   s3client list objects Bucket bucket  Prefix  form1    Delimiter      sections    x  Prefix   for x in resp  CommonPrefixes       we get   form1 section11  form1 section12

User · Answer

I know that boto3 is the topic being discussed here  but I find that it is usually quicker and more intuitive to simply use awscli for something like this - awscli retains more capabilities that boto3 for what than is worth   For example  if I have objects saved in  subfolders  associated with a given bucket  I can list them all out with something such as this      1   mydata    bucket name      2   f1 f2 f3     path  leading to  files  or objects      3   foo2 csv  barfar segy  gar tar    all objects  inside  f3   So  we can think of the  absolute path  leading to these objects is   mydata f1 f2 f3 foo2 csv      Using awscli commands  we can easily list all objects inside a given  subfolder  via       aws s3 ls s3   mydata f1 f2 f3  --recursive

User · Answer

The AWS cli does this  presumably without fetching and iterating through all keys in the bucket  when you run aws s3 ls s3   my-bucket   so I figured there must be a way using boto3   https   github com aws aws-cli blob 0fedc4c1b6a7aee13e2ed10c3ada778c702c22c3 awscli customizations s3 subcommands py L499  It looks like they indeed use Prefix and Delimiter - I was able to write a function that would get me all directories at the root level of a bucket by modifying that code a bit   def list folders in bucket bucket       paginator   boto3 client  s3   get paginator  list objects       folders          iterator   paginator paginate Bucket bucket  Prefix     Delimiter      PaginationConfig   PageSize   None       for response data in iterator          prefixes   response data get  CommonPrefixes               for prefix in prefixes              prefix name   prefix  Prefix               if prefix name endswith                       folders append prefix name rstrip           return folders

User · Answer

Following is the piece of code that can handle pagination  if you are trying to fetch large number of S3 bucket objects    def get matching s3 objects bucket  prefix     suffix           s3   boto3 client  s3       paginator   s3 get paginator  list objects v2        kwargs     Bucket   bucket         We can pass the prefix directly to the S3 API   If the user has passed       a tuple or list of prefixes  we go through them one by one      if isinstance prefix  str           prefixes    prefix        else          prefixes   prefix      for key prefix in prefixes          kwargs  Prefix     key prefix          for page in paginator paginate   kwargs               try                  contents   page  Contents               except KeyError                  return              for obj in contents                  key   obj  Key                   if key endswith suffix                       yield obj

User · Answer

As for Boto 1 13 3  it turns to be as simple as that  if you skip all pagination considerations  which were covered in other answers    def get sub paths bucket  prefix   s3   boto3 client  s3   response   s3 list objects v2      Bucket bucket      Prefix prefix      MaxKeys 1000  return  item  Prefix   for item in response  CommonPrefixes

User · Answer

Short answer    Use Delimiter      This avoids doing a recursive listing of your bucket   Some answers here wrongly suggest doing a full listing and using some string manipulation to retrieve the directory names  This could be horribly inefficient  Remember that S3 has virtually no limit on the number of objects a bucket can contain  So  imagine that  between bar  and foo   you have a trillion objects  you would wait a very long time to get   bar     foo     Use Paginators  For the same reason  S3 is an engineer s approximation of infinity   you must list through pages and avoid storing all the listing in memory  Instead  consider your  lister  as an iterator  and handle the stream it produces  Use boto3 client  not boto3 resource  The resource version doesn t seem to handle well the Delimiter option   If you have a resource  say a bucket   boto3 resource  s3   Bucket name   you can get the corresponding client with  bucket meta client    Long answer   The following is an iterator that I use for simple buckets  no version handling    import boto3 from collections import namedtuple from operator import attrgetter   S3Obj   namedtuple  S3Obj     key    mtime    size    ETag      def s3list bucket  path  start None  end None  recursive True  list dirs True             list objs True  limit None               Iterator that lists a bucket s objects under path   optionally  starting with     start and ending before end       If recursive is False  then list only the  depth 0  items  dirs and objects        If recursive is True  then list recursively all objects  no dirs        Args          bucket              a boto3 resource  s3   Bucket            path              a directory in the bucket          start              optional  start key  inclusive  may be a relative path under path  or             absolute in the bucket          end              optional  stop key  exclusive  may be a relative path under path  or             absolute in the bucket          recursive              optional  default True  If True  lists only objects  If False  lists             only depth 0  directories  and objects          list dirs              optional  default True  Has no effect in recursive listing  On             non-recursive listing  if False  then directories are omitted          list objs              optional  default True  If False  then directories are omitted          limit              optional  If specified  then lists at most this many items       Returns          an iterator of S3Obj       Examples            set up          gt  gt  gt  s3   boto3 resource  s3               bucket   s3 Bucket name             iterate through all S3 objects under some dir          gt  gt  gt  for p in s3ls bucket   some dir                    print p             iterate through up to 20 S3 objects under some dir  starting with foo 0010          gt  gt  gt  for p in s3ls bucket   some dir   limit 20  start  foo 0010                    print p             non-recursive listing under some dir           gt  gt  gt  for p in s3ls bucket   some dir   recursive False                   print p             non-recursive listing under some dir  listing only dirs           gt  gt  gt  for p in s3ls bucket   some dir   recursive False  list objs False                   print p          kwargs   dict       if start is not None          if not start startswith path               start   os path join path  start            note  need to use a string just smaller than start  because           the list object API specifies that start is excluded  the first           result is  after  start           kwargs update Marker   prev str start       if end is not None          if not end startswith path               end   os path join path  end      if not recursive          kwargs update Delimiter              if not path endswith                   path            kwargs update Prefix path      if limit is not None          kwargs update PaginationConfig   MaxItems   limit        paginator   bucket meta client get paginator  list objects       for resp in paginator paginate Bucket bucket name    kwargs           q              if  CommonPrefixes  in resp and list dirs              q    S3Obj f  Prefix    None  None  None  for f in resp  CommonPrefixes            if  Contents  in resp and list objs              q     S3Obj f  Key    f  LastModified    f  Size    f  ETag    for f in resp  Contents              note  even with sorted lists  it is faster to sort a b            than heapq merge a  b  at least up to 10K elements in each list         q   sorted q  key attrgetter  key            if limit is not None              q   q  limit              limit -  len q          for p in q              if end is not None and p key  gt   end                  return             yield p   def   prev str s       if len s     0          return s     s  c   s  -1   ord s -1       if c  gt  0          s    chr c - 1      s       join    u7FFF  for   in range 10        return s   Test   The following is helpful to test the behavior of the paginator and list objects   It creates a number of dirs and files  Since the pages are up to 1000 entries  we use a multiple of that for dirs and files  dirs contains only directories  each having one object   mixed contains a mix of dirs and objects  with a ratio of 2 objects for each dir  plus one object under dir  of course  S3 stores only objects    import concurrent def genkeys top  tmp test   n 2000       for k in range n           if k   100    0              print k          for name in               os path join top   dirs   f  k 04d  dir    foo                os path join top   mixed   f  k 04d  dir    foo                os path join top   mixed   f  k 04d  foo a                os path join top   mixed   f  k 04d  foo b                           yield name   with concurrent futures ThreadPoolExecutor max workers 32  as executor      executor map lambda name  bucket put object Key name  Body  hi n  encode     genkeys      The resulting structure is     dirs 0000 dir foo   dirs 0001 dir foo   dirs 0002 dir foo       dirs 1999 dir foo   mixed 0000 dir foo   mixed 0000 foo a   mixed 0000 foo b   mixed 0001 dir foo   mixed 0001 foo a   mixed 0001 foo b   mixed 0002 dir foo   mixed 0002 foo a   mixed 0002 foo b       mixed 1999 dir foo   mixed 1999 foo a   mixed 1999 foo b   With a little bit of doctoring of the code given above for s3list to inspect the responses from the paginator  you can observe some fun facts    The Marker is really exclusive  Given Marker topdir    mixed 0500 foo a  will make the listing start after that key  as per the AmazonS3 API   i e   with     mixed 0500 foo b  That s the reason for   prev str    Using Delimiter  when listing mixed   each response from the paginator contains 666 keys and 334 common prefixes  It s pretty good at not building enormous responses  By contrast  when listing dirs   each response from the paginator contains 1000 common prefixes  and no keys   Passing a limit in the form of PaginationConfig   MaxItems   limit  limits only the number of keys  not the common prefixes  We deal with that by further truncating the stream of our iterator

[python] Retrieving subfolders names in S3 bucket from boto3

Examples related to python

Examples related to amazon-web-services

Examples related to amazon-s3

Examples related to boto3