Merge Pandas DataFrame with Nested Dictionary

Written by

Not an avid Pandas or Numpy user myself, but I had to spend some time lately to optimise probably a fairly common process: looking up a nested dictionary (2 or more levels) to find the right values element-wise for a column in a Pandas DataFrame. If it’s not clear, the problem I’m trying to solve here is to optimise a look-up function that can be done with .apply() to something more performant.

You might say, why not using .map()? Because the look-up function is not y = f(x), no, it is more like y = f(x, a) or even y = f(x, a, b), depending on the level of nestedness.

As mentioned earlier, this can be implemented with .apply() by supplying a Python function that does the look-up. However, .apply() is very slow (it’s not vectorised). The solution here is actually straightforward (I’m very new to Pandas and it took me some time to get here so I decided to make a note here for this). We first flatten the nested dictionary to have different levels of keys as tuples, which allows us to create a DataFrame with MultiIndex. With MultiIndex, we can easily apply .merge to join the DataFrame objects.

Hopefully the code snippet is more understandable.

import pandas as pd

nested_dict = {
    "A": {
        "Apple": "Red",
        "Banana": "Green",
    },
    "B": {"Apple": "Green", "Banana": "Yellow"},
}
df = pd.DataFrame.from_dict(
    {
        "Fruit": {0: "Apple", 1: "Banana", 2: "Banana"},
        "Price": {0: 0.911, 1: 1.734, 2: 1.844},
        "Bucket": {0: "A", 1: "B", 2: "A"},
    }
)

# Method 1: .apply()
# Apply Python function element-wise, as slow as a regular for loop
df1 = df.copy()
df1["Color"] = df1.apply(
    lambda row: nested_dict.get(row["Bucket"], {}).get(row["Fruit"]), axis=1
)
print(df1)

# Method 2: .merge()
# Vectorized, much faster (even though the big O is the same)
df2 = df.copy()
# The only overhead is to construct a dataframe with 'MultiIndex'
nested_df = pd.DataFrame.from_dict(
    {
        (inner_key, outer_key): value
        for outer_key, outer_value in nested_dict.items()
        for inner_key, value in outer_value.items()
    },
    orient="index",
)
nested_df.index = pd.MultiIndex.from_tuples(nested_df.index)
nested_df.rename(columns={0: "Color"}, inplace=True)
df2 = df2.merge(nested_df, how="left", left_on=("Fruit", "Bucket"), right_index=True)
print(df2)

English Python

Merge Pandas DataFrame with Nested Dictionary

More posts

Install Citrix Receiver for Linux in Toolbox

Soft AP on Linux with WPS

My Attempts to Integrate Home Assistant with ADT UK

Citrix Receiver UK Keyboard Layout Issue on Linux and Chrome OS