Month: December 2021

  • 2021年度回顾

    很少写这类年度回顾/年末总结的文章,最近很多想法,就在这里乱写一通来告别2021吧。

    (more…)
  • Merge Pandas DataFrame with Nested Dictionary

    Not an avid Pandas or Numpy user myself, but I had to spend some time lately to optimise probably a fairly common process: looking up a nested dictionary (2 or more levels) to find the right values element-wise for a column in a Pandas DataFrame. If it’s not clear, the problem I’m trying to solve here is to optimise a look-up function that can be done with .apply() to something more performant.

    You might say, why not using .map()? Because the look-up function is not y = f(x), no, it is more like y = f(x, a) or even y = f(x, a, b), depending on the level of nestedness.

    As mentioned earlier, this can be implemented with .apply() by supplying a Python function that does the look-up. However, .apply() is very slow (it’s not vectorised). The solution here is actually straightforward (I’m very new to Pandas and it took me some time to get here so I decided to make a note here for this). We first flatten the nested dictionary to have different levels of keys as tuples, which allows us to create a DataFrame with MultiIndex. With MultiIndex, we can easily apply .merge to join the DataFrame objects.

    Hopefully the code snippet is more understandable.

    import pandas as pd
    
    nested_dict = {
        "A": {
            "Apple": "Red",
            "Banana": "Green",
        },
        "B": {"Apple": "Green", "Banana": "Yellow"},
    }
    df = pd.DataFrame.from_dict(
        {
            "Fruit": {0: "Apple", 1: "Banana", 2: "Banana"},
            "Price": {0: 0.911, 1: 1.734, 2: 1.844},
            "Bucket": {0: "A", 1: "B", 2: "A"},
        }
    )
    
    # Method 1: .apply()
    # Apply Python function element-wise, as slow as a regular for loop
    df1 = df.copy()
    df1["Color"] = df1.apply(
        lambda row: nested_dict.get(row["Bucket"], {}).get(row["Fruit"]), axis=1
    )
    print(df1)
    
    # Method 2: .merge()
    # Vectorized, much faster (even though the big O is the same)
    df2 = df.copy()
    # The only overhead is to construct a dataframe with 'MultiIndex'
    nested_df = pd.DataFrame.from_dict(
        {
            (inner_key, outer_key): value
            for outer_key, outer_value in nested_dict.items()
            for inner_key, value in outer_value.items()
        },
        orient="index",
    )
    nested_df.index = pd.MultiIndex.from_tuples(nested_df.index)
    nested_df.rename(columns={0: "Color"}, inplace=True)
    df2 = df2.merge(nested_df, how="left", left_on=("Fruit", "Bucket"), right_index=True)
    print(df2)