Use the DataFrame returned by:
yourDF.orderBy(account)
There is no explicit way to use partitionBy
on a DataFrame, only on a PairRDD, but when you sort a DataFrame, it will use that in it's LogicalPlan and that will help when you need to make calculations on each Account.
I just stumbled upon the same exact issue, with a dataframe that I want to partition by account.
I assume that when you say "want to have the data partitioned so that all of the transactions for an account are in the same Spark partition", you want it for scale and performance, but your code doesn't depend on it (like using mapPartitions()
etc), right?