[python] Save Dataframe to csv directly to s3 Python

I have a pandas DataFrame that I want to upload to a new CSV file. The problem is that I don't want to save the file locally before transferring it to s3. Is there any method like to_csv for writing the dataframe to s3 directly? I am using boto3.
Here is what I have so far:

import boto3
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
read_file = s3.get_object(Bucket, Key)
df = pd.read_csv(read_file['Body'])

# Make alterations to DataFrame

# Then export DataFrame to CSV through direct transfer to s3

This question is related to python csv amazon-s3 dataframe boto3

The answer is


This is a more up to date answer:

import s3fs

s3 = s3fs.S3FileSystem(anon=False)

# Use 'w' for py3, 'wb' for py2
with s3.open('<bucket-name>/<filename>.csv','w') as f:
    df.to_csv(f)

The problem with StringIO is that it will eat away at your memory. With this method, you are streaming the file to s3, rather than converting it to string, then writing it into s3. Holding the pandas dataframe and its string copy in memory seems very inefficient.

If you are working in an ec2 instant, you can give it an IAM role to enable writing it to s3, thus you dont need to pass in credentials directly. However, you can also connect to a bucket by passing credentials to the S3FileSystem() function. See documention:https://s3fs.readthedocs.io/en/latest/


I read a csv with two columns from bucket s3, and the content of the file csv i put in pandas dataframe.

Example:

config.json

{
  "credential": {
    "access_key":"xxxxxx",
    "secret_key":"xxxxxx"
}
,
"s3":{
       "bucket":"mybucket",
       "key":"csv/user.csv"
   }
}

cls_config.json

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import json

class cls_config(object):

    def __init__(self,filename):

        self.filename = filename


    def getConfig(self):

        fileName = os.path.join(os.path.dirname(__file__), self.filename)
        with open(fileName) as f:
        config = json.load(f)
        return config

cls_pandas.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd
import io

class cls_pandas(object):

    def __init__(self):
        pass

    def read(self,stream):

        df = pd.read_csv(io.StringIO(stream), sep = ",")
        return df

cls_s3.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import boto3
import json

class cls_s3(object):

    def  __init__(self,access_key,secret_key):

        self.s3 = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key)

    def getObject(self,bucket,key):

        read_file = self.s3.get_object(Bucket=bucket, Key=key)
        body = read_file['Body'].read().decode('utf-8')
        return body

test.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from cls_config import *
from cls_s3 import *
from cls_pandas import *

class test(object):

    def __init__(self):
        self.conf = cls_config('config.json')

    def process(self):

        conf = self.conf.getConfig()

        bucket = conf['s3']['bucket']
        key = conf['s3']['key']

        access_key = conf['credential']['access_key']
        secret_key = conf['credential']['secret_key']

        s3 = cls_s3(access_key,secret_key)
        ob = s3.getObject(bucket,key)

        pa = cls_pandas()
        df = pa.read(ob)

        print df

if __name__ == '__main__':
    test = test()
    test.process()

If you pass None as the first argument to to_csv() the data will be returned as a string. From there it's an easy step to upload that to S3 in one go.

It should also be possible to pass a StringIO object to to_csv(), but using a string will be easier.


I found a very simple solution that seems to be working :

s3 = boto3.client("s3")

s3.put_object(
    Body=open("filename.csv").read(),
    Bucket="your-bucket",
    Key="your-key"
)

Hope that helps !


You can directly use the S3 path. I am using Pandas 0.24.1

In [1]: import pandas as pd

In [2]: df = pd.DataFrame( [ [1, 1, 1], [2, 2, 2] ], columns=['a', 'b', 'c'])

In [3]: df
Out[3]:
   a  b  c
0  1  1  1
1  2  2  2

In [4]: df.to_csv('s3://experimental/playground/temp_csv/dummy.csv', index=False)

In [5]: pd.__version__
Out[5]: '0.24.1'

In [6]: new_df = pd.read_csv('s3://experimental/playground/temp_csv/dummy.csv')

In [7]: new_df
Out[7]:
   a  b  c
0  1  1  1
1  2  2  2

Release Note:

S3 File Handling

pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas. GH11915.


since you are using boto3.client(), try:

import boto3
from io import StringIO #python3 
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
def copy_to_s3(client, df, bucket, filepath):
    csv_buf = StringIO()
    df.to_csv(csv_buf, header=True, index=False)
    csv_buf.seek(0)
    client.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key=filepath)
    print(f'Copy {df.shape[0]} rows to S3 Bucket {bucket} at {filepath}, Done!')

copy_to_s3(client=s3, df=df_to_upload, bucket='abc', filepath='def/test.csv')

I like s3fs which lets you use s3 (almost) like a local filesystem.

You can do this:

import s3fs

bytes_to_write = df.to_csv(None).encode()
fs = s3fs.S3FileSystem(key=key, secret=secret)
with fs.open('s3://bucket/path/to/file.csv', 'wb') as f:
    f.write(bytes_to_write)

s3fs supports only rb and wb modes of opening the file, that's why I did this bytes_to_write stuff.


I use AWS Data Wrangler. For example:

import awswrangler as wr
import pandas as pd

# read a local dataframe
df = pd.read_parquet('my_local_file.gz')

# upload to S3 bucket
wr.s3.to_parquet(df=df, path='s3://mys3bucket/file_name.gz')

The same applies to csv files. Instead of read_parquet and to_parquet, use read_csv and to_csv with the proper file extension.


I found this can be done using client also and not just resource.

from io import StringIO
import boto3
s3 = boto3.client("s3",\
                  region_name=region_name,\
                  aws_access_key_id=aws_access_key_id,\
                  aws_secret_access_key=aws_secret_access_key)
csv_buf = StringIO()
df.to_csv(csv_buf, header=True, index=False)
csv_buf.seek(0)
s3.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key='path/test.csv')

You can also use the AWS Data Wrangler:

import awswrangler as wr
    
wr.s3.to_csv(
    df=df,
    path="s3://...",
)

Note that it will handle multipart upload for you to make the upload faster.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to csv

Pandas: ValueError: cannot convert float NaN to integer Export result set on Dbeaver to CSV Convert txt to csv python script How to import an Excel file into SQL Server? "CSV file does not exist" for a filename with embedded quotes Save Dataframe to csv directly to s3 Python Data-frame Object has no Attribute (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape How to write to a CSV line by line? How to check encoding of a CSV file

Examples related to amazon-s3

How to specify credentials when connecting to boto3 S3? AWS S3 CLI - Could not connect to the endpoint URL How to write a file or data to an S3 object using boto3 The AWS Access Key Id does not exist in our records AccessDenied for ListObjects for S3 bucket when permissions are s3:* Save Dataframe to csv directly to s3 Python Listing files in a specific "folder" of a AWS S3 bucket How to get response from S3 getObject in Node.js? Getting Access Denied when calling the PutObject operation with bucket-level permission Read file content from S3 bucket with boto3

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to boto3

How to specify credentials when connecting to boto3 S3? Difference in boto3 between resource, client, and session? boto3 client NoRegionError: You must specify a region error only sometimes How to write a file or data to an S3 object using boto3 Save Dataframe to csv directly to s3 Python Read file content from S3 bucket with boto3 Retrieving subfolders names in S3 bucket from boto3 check if a key exists in a bucket in s3 using boto3 Unable to install boto3 How to choose an AWS profile when using boto3 to connect to CloudFront