Downloading and unzipping a zip file without writing to disk

Question

I have managed to get my first python script to work which downloads a list of  ZIP files from a URL and then proceeds to extract the ZIP files and writes them to disk   I am now at a loss to achieve the next step    My primary goal is to download and extract the zip file and pass the contents  CSV data  via a TCP stream  I would prefer not to actually write any of the zip or extracted files to disk if I could get away with it   Here is my current script which works but unfortunately has to write the files to disk   import urllib  urllister import zipfile import urllib2 import os import time import pickle    check for extraction directories existence if not os path isdir  downloaded        os makedirs  downloaded    if not os path isdir  extracted        os makedirs  extracted      open logfile for downloaded data and save to local variable if os path isfile  downloaded pickle        downloadedLog   pickle load open  downloaded pickle    else      downloadedLog     key   value      remove entries older than 5 days  to maintain speed     path of zip files zipFileURL    http   www thewebserver com that contains a directory of zip files     retrieve list of URLs from the webservers usock   urllib urlopen zipFileURL  parser   urllister URLLister   parser feed usock read    usock close   parser close      only parse urls for url in parser urls       if  PUBLIC P5MIN  in url             download the file         downloadURL   zipFileURL   url         outputFilename    downloaded     url            check if file already exists on disk         if url in downloadedLog or os path isfile outputFilename               print  Skipping     downloadURL             continue          print  Downloading   downloadURL         response   urllib2 urlopen downloadURL          zippedData   response read              save data to disk         print  Saving to   outputFilename         output   open outputFilename  wb           output write zippedData          output close              extract the data         zfobj   zipfile ZipFile outputFilename          for name in zfobj namelist                uncompressed   zfobj read name                 save uncompressed data to disk             outputFilename    extracted     name             print  Saving extracted file to   outputFilename             output   open outputFilename  wb               output write uncompressed              output close                  send data via tcp stream                file successfully downloaded and extracted store into local log and filesystem log             downloadedLog url    time time                pickle dump downloadedLog  open  downloaded pickle    wb

User · Answer

I d like to add my Python3 answer for completeness   from io import BytesIO from zipfile import ZipFile import requests  def get zip file url       url   requests get file url      zipfile   ZipFile BytesIO url content       zip names   zipfile namelist       if len zip names     1          file name   zip names pop           extracted file   zipfile open file name          return extracted file     return  zipfile open file name  for file name in zip names

User · Answer

All of these answers appear too bulky and long  Use requests to shorten the code  e g   import requests  zipfile  io r   requests get zip file url  z   zipfile ZipFile io BytesIO r content   z extractall  quot  path to directory quot

User · Answer

It wasn t obvious in Vishal s answer what the file name was supposed to be in cases where there is no file on disk  I ve modified his answer to work without modification for most needs   from StringIO import StringIO from zipfile import ZipFile from urllib import urlopen  def unzip string zipped string       unzipped string          zipfile   ZipFile StringIO zipped string       for name in zipfile namelist            unzipped string    zipfile open name  read       return unzipped string

User · Answer

My suggestion would be to use a StringIO object  They emulate files  but reside in memory  So you could do something like this    get zip data   gets a zip archive containing  foo txt   reading  hey  foo   import zipfile from StringIO import StringIO  zipdata   StringIO   zipdata write get zip data    myzipfile   zipfile ZipFile zipdata  foofile   myzipfile open  foo txt   print foofile read      output   quot hey  foo quot   Or more simply  apologies to Vishal   myzipfile   zipfile ZipFile StringIO get zip data     for name in myzipfile namelist                 In Python 3 use BytesIO instead of StringIO  import zipfile from io import BytesIO  filebytes   BytesIO get zip data    myzipfile   zipfile ZipFile filebytes  for name in myzipfile namelist

User · Answer

Vishal s example  however great  confuses when it comes to the file name  and I do not see the merit of redefing  zipfile     Here is my example that downloads a zip that contains some files  one of which is a csv file that I subsequently read into a pandas DataFrame   from StringIO import StringIO from zipfile import ZipFile from urllib import urlopen import pandas  url   urlopen  https   www federalreserve gov apps mdrm pdf MDRM zip   zf   ZipFile StringIO url read     for item in zf namelist        print  File in zip      item    find the first matching csv file in the zip  match    s for s in zf namelist   if   csv  in s  0    the first line of the file contains a string - that line shall de ignored  hence skiprows df   pandas read csv zf open match   low memory False  skiprows  0      Note  I use Python 2 7 13   This is the exact solution that worked for me  I just tweaked it a little bit for Python 3 version by removing StringIO and adding IO library  Python 3 Version  from io import BytesIO from zipfile import ZipFile import pandas import requests  url    https   www nseindia com content indices mcwb jun19 zip  content   requests get url  zf   ZipFile BytesIO content content    for item in zf namelist        print  File in zip      item     find the first matching csv file in the zip  match    s for s in zf namelist   if   csv  in s  0    the first line of the file contains a string - that line shall de     ignored  hence skiprows df   pandas read csv zf open match   low memory False  skiprows  0

User · Answer

write to a temporary file which resides in RAM  it turns out the tempfile module   http   docs python org library tempfile html   has just the thing      tempfile SpooledTemporaryFile  max size 0     mode  w b    bufsize -1   suffix        prefix  tmp    dir None             This   function operates exactly as   TemporaryFile   does  except that data   is spooled in memory until the file   size exceeds max size  or until the   file   s fileno   method is called  at   which point the contents are written   to disk and operation proceeds as with   TemporaryFile         The resulting file has one additional   method  rollover    which causes the   file to roll over to an on-disk file   regardless of its size       The returned object is a file-like   object whose  file attribute is either   a StringIO object or a true file   object  depending on whether   rollover   has been called  This   file-like object can be used in a with   statement  just like a normal file       New in version 2 6    or if you re lazy and you have a tmpfs-mounted  tmp on Linux  you can just make a file there  but you have to delete it yourself and deal with naming

User · Answer

Use the zipfile module  To extract a file from a URL  you ll need to wrap the result of a urlopen call in a BytesIO object  This is because the result of a web request returned by urlopen doesn t support seeking  from urllib request import urlopen  from io import BytesIO from zipfile import ZipFile  zip url    http   example com my file zip   with urlopen zip url  as f      with BytesIO f read    as b  ZipFile b  as myzipfile          foofile   myzipfile open  foo txt           print foofile read     If you already have the file downloaded locally  you don t need BytesIO  just open it in binary mode and pass to ZipFile directly  from zipfile import ZipFile  zip filename    my file zip   with open zip filename   rb   as f      with ZipFile f  as myzipfile          foofile   myzipfile open  foo txt           print foofile read   decode  utf-8     Again  note that you have to open the file in binary   rb   mode  not as text or you ll get a zipfile BadZipFile  File is not a zip file error  It s good practice to use all these things as context managers with the with statement  so that they ll be closed properly

User · Answer

Adding on to the other answers using requests      download from web   import requests  url    http   mlg ucd ie files datasets bbc zip   content   requests get url      unzip the content  from io import BytesIO  from zipfile import ZipFile  f   ZipFile BytesIO content content    print f namelist        outputs   bbc classes    bbc docs    bbc mtx    bbc terms     Use help f  to get more functions details for e g  extractall   which extracts the contents in zip file which later can be used with with open

User · Answer

I d like to offer an updated Python 3 version of Vishal s excellent answer  which was using Python 2  along with some explanation of the adaptations   changes  which may have been already mentioned  from io import BytesIO from zipfile import ZipFile import urllib request      url   urllib request urlopen  quot http   www unece org fileadmin DAM cefact locode loc162txt zip quot    with ZipFile BytesIO url read     as my zip file      for contained file in my zip file namelist              with open   quot unzipped and read  quot    contained file    quot  file quot     quot wb quot   as output          for line in my zip file open contained file  readlines                print line                output write line   Necessary changes   There s no StringIO module in Python 3  it s been moved to io StringIO   Instead  I use io BytesIO 2  because we will be handling a bytestream -- Docs  also this thread  urlopen    quot The legacy urllib urlopen function from Python 2 6 and earlier has been discontinued  urllib request urlopen   corresponds to the old urllib2 urlopen  quot    Docs and this thread     Note   In Python 3  the printed output lines will look like so  b some text   This is expected  as they aren t strings - remember  we re reading a bytestream  Have a look at Dan04 s excellent answer   A few minor changes I made   I use with     as instead of zipfile       according to the Docs  The script now uses  namelist   to cycle through all the files in the zip and print their contents  I moved the creation of the ZipFile object into the with statement  although I m not sure if that s better  I added  and commented out  an option to write the bytestream to file  per file in the zip   in response to NumenorForLife s comment  it adds  quot unzipped and read  quot  to the beginning of the filename and a  quot  file quot  extension  I prefer not to use  quot  txt quot  for files with bytestrings   The indenting of the code will  of course  need to be adjusted if you want to use it   Need to be careful here -- because we have a byte string  we use binary mode  so  quot wb quot   I have a feeling that writing binary opens a can of worms anyway      I am using an example file  the UN LOCODE text archive   What I didn t do   NumenorForLife asked about saving the zip to disk  I m not sure what he meant by it -- downloading the zip file  That s a different task  see Oleh Prypin s excellent answer   Here s a way  import urllib request import shutil  with urllib request urlopen  quot http   www unece org fileadmin DAM cefact locode 2015-2 UNLOCODE SecretariatNotes pdf quot   as response  open  quot downloaded file pdf quot    w   as out file      shutil copyfileobj response  out file

User · Answer

Below is a code snippet I used to fetch zipped csv file  please have a look    Python 2   from StringIO import StringIO from zipfile import ZipFile from urllib import urlopen  resp   urlopen  http   www test com file zip   zipfile   ZipFile StringIO resp read     for line in zipfile open file  readlines        print line   Python 3   from io import BytesIO from zipfile import ZipFile from urllib request import urlopen   or  requests get url  content  resp   urlopen  http   www test com file zip   zipfile   ZipFile BytesIO resp read     for line in zipfile open file  readlines        print line decode  utf-8      Here file is a string   To get the actual string that you want to pass  you can use zipfile namelist     For instance   resp   urlopen  http   mlg ucd ie files datasets bbc zip   zipfile   ZipFile BytesIO resp read     zipfile namelist       bbc classes    bbc docs    bbc mtx    bbc terms

[python] Downloading and unzipping a .zip file without writing to disk

Examples related to python

Examples related to unzip