Supercharging Pandas: 3600x Faster with Vectorization and CuDF

Praneet Bomma
Analytics Vidhya
Published in
5 min readSep 17, 2023

--

Working with tabular data is a breeze, thanks to pandas for all the ease of implementation it provides. As powerful as pandas is, it still can’t give you raw power.

You can’t expect a fiat to zip like a ferarri.

It just isn’t built to do that. Today, I will be showing you exactly how I did that using a beautiful tool called CuDF and an even more amazing technique called vectorization.

Some time back, I encountered a significant challenge. There was a script that needed a speed boost — it was taking a whopping 2 hours and 30 minutes to finish its job. Naturally, my aim was to drastically cut down this runtime, ideally to something between 30 to 45 minutes. Without revealing any proprietary details, I simulated a similar scenario using dummy data to illustrate how I achieved this remarkable speed enhancement. Let’s jump right in!

Problem Statement

We have positions of 24 objects for every second and this data is distributed across individual files. The task is to get the distance delta of the object from 15 seconds ago.

We will be synthesizing this data ourselves and adding some pointless operations in the code in between just to show the speed difference for blog purposes. Here’s how we synthesize data.

for i in tqdm(range(24)):
items = 500000
data = pd.DataFrame({'x_pos': np.random.rand(items),
'y_pos': np.random.rand(items),
'numbers': list(range(items)),
'id': str(i+1)})
data.to_csv(f'file_{i+1}.csv', index=False)
del data

We have created random positions of the object for 500000 seconds and assigned a unique ID to each object. We need this scale of data to see any time differences achieved.

Solution

A small disclaimer:
Please don’t bother too much about the solution or how we are going about solving the problem. This is a made-up problem to demonstrate the speed differences achieved using different techniques.

We will be writing a class with three functions.

  • __init__ will be used to read all 24 object csvs.
  • operation_1 will be used to do placeholder calculations.
  • operation_2 will be to iterate through each object group.
  • operation_3 will be to calculate the delta distance between the step size of 15 seconds. This function will be the most important as it takes maximum time. We will be concentrating on this function the most hereon.

I am not going to explain the details of the code unless too complex as I assume the reader to be aware of pandas and its basic functioning. If you aren’t comfortable with the technical details, don’t worry, you don’t really need to understand the code as I will be explaining the approach in detail to get the gist of the blog.

The first approach will be to iterate through each row and calculate the distance between the current position and the position 15 seconds ago. Here’s the code to do this.

Please refrain from running this if you value your time because this will take hours to complete. This is the sequential (also the slowest) implementation of the solution. The estimated runtime of this implementation was around 5 to 6 hours. This is simply not efficient by any stretch of the imagination.

Note: All the runtimes were calculated on Google Colab with Tesla T4 GPU.

Let’s try to make this faster by a technique called vectorization. You might ask, “What is vectorization and how does it help make things faster?”. Well, let me dumb it down a little for you.

Essentially, what we are trying to do is to carry out the same instruction with multiple data points in a parallel manner. This is also called SIMD (Single Instruction Multiple Data). The previous implementation we had was a SISD (Single Instruction Single Data), which carried out the given instruction on a single data point at a time. If we make use of the dedicated vector units supporting SIMD on the CPU, this operation can be parallelized and sped up exponentially.

The implementation of the vectorized version of the function operation_3 looks like this.

def operation_3(self, df, step_size = 10):

df = df.sort_values('numbers')

shifted_df = df.shift(-step_size)

df['dist'] = -1

dists_shifted = ((shifted_df[['x_pos', 'y_pos']].values - df[['x_pos', 'y_pos']]. values) ** 2).sum(1) ** 0.5
dists = np.empty_like(dists_shifted)
dists[:step_size] = -1
dists[step_size:] = dists_shifted[:-step_size]

As it can be seen, we don't have loops anymore. We got rid of the loop that was iterating through millions of data points we had and carrying out an operation. Instead, now we carry out the same operation on all the data points all at once. Sounds cool, doesn’t it?

We make use of the shift operation available with pandas, which shifts/moves the rows by the given step size. This shifted dataframe is saved to a new variable and the operation is carried out between the two dataframes.

Now, can you guess how long it takes for this implementation to complete? If you guessed anywhere near what it actually took, you had very high expectations! It takes 60 seconds for the whole implementation to complete. That is nearly 300 times faster than the previous implementation! Do you think we can go faster than this? Of course, we can!

Now, we reimplement the same operations and functions using CuDF. It is a GPU-based implementation of Pandas developed by rapids.ai.

We were using the CPU for all the operations used above. Using GPU for the same operations should be much faster as the GPU is foundationally built for parallel processing.

Except for using cudf instead of pandas for loading the data, everything remains the same. Some operations like grouping still have bugs in the cudf implementation which should probably be fixed in future releases. If you know an alternative way of doing it, please comment about it in the comment section below. For these operations, we need to convert the cudf dataframe to pandas for the operation. The implementation can be seen below.

Did the GPU-based implementation help make things faster? I guess making it 12 times faster is a success? Yes, you heard it right! The new GPU-based implementation with CuDF runs in 5 seconds. We literally managed to bring down the runtime from 5 hours to 5 seconds! That is a whopping 3600 times faster than the original implementation.

In conclusion, we managed to speed up the delta distance calculation problem for 12 million data points from 5 hours to 5 seconds! Even though pandas does the job for most of the problems at hand, it is not always the fastest solution out there. Simple tricks like vectorization and using the right tools like CuDF can change your Fiat into a Ferrari! No offense meant to Fiat. 😅 Please comment down if I need to elaborate on any points or if there are any corrections needed. The code for the above blog is available to use on Google Colab. You can access the code here.

References

--

--