December 2021 – Librehat的部落格

Not an avid Pandas or Numpy user myself, but I had to spend some time lately to optimise probably a fairly common process: looking up a nested dictionary (2 or more levels) to find the right values element-wise for a column in a Pandas DataFrame. If it’s not clear, the problem I’m trying to solve here is to optimise a look-up function that can be done with .apply() to something more performant.

You might say, why not using .map()? Because the look-up function is not y = f(x), no, it is more like y = f(x, a) or even y = f(x, a, b), depending on the level of nestedness.

As mentioned earlier, this can be implemented with .apply() by supplying a Python function that does the look-up. However, .apply() is very slow (it’s not vectorised). The solution here is actually straightforward (I’m very new to Pandas and it took me some time to get here so I decided to make a note here for this). We first flatten the nested dictionary to have different levels of keys as tuples, which allows us to create a DataFrame with MultiIndex. With MultiIndex, we can easily apply .merge to join the DataFrame objects.

Hopefully the code snippet is more understandable.

import pandas as pd

nested_dict = {
    "A": {
        "Apple": "Red",
        "Banana": "Green",
    },
    "B": {"Apple": "Green", "Banana": "Yellow"},
}
df = pd.DataFrame.from_dict(
    {
        "Fruit": {0: "Apple", 1: "Banana", 2: "Banana"},
        "Price": {0: 0.911, 1: 1.734, 2: 1.844},
        "Bucket": {0: "A", 1: "B", 2: "A"},
    }
)

# Method 1: .apply()
# Apply Python function element-wise, as slow as a regular for loop
df1 = df.copy()
df1["Color"] = df1.apply(
    lambda row: nested_dict.get(row["Bucket"], {}).get(row["Fruit"]), axis=1
)
print(df1)

# Method 2: .merge()
# Vectorized, much faster (even though the big O is the same)
df2 = df.copy()
# The only overhead is to construct a dataframe with 'MultiIndex'
nested_df = pd.DataFrame.from_dict(
    {
        (inner_key, outer_key): value
        for outer_key, outer_value in nested_dict.items()
        for inner_key, value in outer_value.items()
    },
    orient="index",
)
nested_df.index = pd.MultiIndex.from_tuples(nested_df.index)
nested_df.rename(columns={0: "Color"}, inplace=True)
df2 = df2.merge(nested_df, how="left", left_on=("Fruit", "Bucket"), right_index=True)
print(df2)

Month: December 2021

2021年度回顾

Merge Pandas DataFrame with Nested Dictionary