很少写这类年度回顾/年末总结的文章,最近很多想法,就在这里乱写一通来告别2021吧。
(more…)Month: December 2021
-
Merge Pandas DataFrame with Nested Dictionary
Not an avid Pandas or Numpy user myself, but I had to spend some time lately to optimise probably a fairly common process: looking up a nested dictionary (2 or more levels) to find the right values element-wise for a column in a Pandas DataFrame. If it’s not clear, the problem I’m trying to solve here is to optimise a look-up function that can be done with
.apply()to something more performant.You might say, why not using
.map()? Because the look-up function is noty = f(x), no, it is more likey = f(x, a)or eveny = f(x, a, b), depending on the level of nestedness.As mentioned earlier, this can be implemented with
.apply()by supplying a Python function that does the look-up. However,.apply()is very slow (it’s not vectorised). The solution here is actually straightforward (I’m very new to Pandas and it took me some time to get here so I decided to make a note here for this). We first flatten the nested dictionary to have different levels of keys as tuples, which allows us to create aDataFramewithMultiIndex. WithMultiIndex, we can easily apply.mergeto join theDataFrameobjects.Hopefully the code snippet is more understandable.
import pandas as pd nested_dict = { "A": { "Apple": "Red", "Banana": "Green", }, "B": {"Apple": "Green", "Banana": "Yellow"}, } df = pd.DataFrame.from_dict( { "Fruit": {0: "Apple", 1: "Banana", 2: "Banana"}, "Price": {0: 0.911, 1: 1.734, 2: 1.844}, "Bucket": {0: "A", 1: "B", 2: "A"}, } ) # Method 1: .apply() # Apply Python function element-wise, as slow as a regular for loop df1 = df.copy() df1["Color"] = df1.apply( lambda row: nested_dict.get(row["Bucket"], {}).get(row["Fruit"]), axis=1 ) print(df1) # Method 2: .merge() # Vectorized, much faster (even though the big O is the same) df2 = df.copy() # The only overhead is to construct a dataframe with 'MultiIndex' nested_df = pd.DataFrame.from_dict( { (inner_key, outer_key): value for outer_key, outer_value in nested_dict.items() for inner_key, value in outer_value.items() }, orient="index", ) nested_df.index = pd.MultiIndex.from_tuples(nested_df.index) nested_df.rename(columns={0: "Color"}, inplace=True) df2 = df2.merge(nested_df, how="left", left_on=("Fruit", "Bucket"), right_index=True) print(df2)