很少写这类年度回顾/年末总结的文章,最近很多想法,就在这里乱写一通来告别2021吧。
(more…)Month: December 2021
-
Merge Pandas DataFrame with Nested Dictionary
Not an avid Pandas or Numpy user myself, but I had to spend some time lately to optimise probably a fairly common process: looking up a nested dictionary (2 or more levels) to find the right values element-wise for a column in a Pandas DataFrame. If it’s not clear, the problem I’m trying to solve here is to optimise a look-up function that can be done with
.apply()
to something more performant.You might say, why not using
.map()
? Because the look-up function is noty = f(x)
, no, it is more likey = f(x, a)
or eveny = f(x, a, b)
, depending on the level of nestedness.As mentioned earlier, this can be implemented with
.apply()
by supplying a Python function that does the look-up. However,.apply()
is very slow (it’s not vectorised). The solution here is actually straightforward (I’m very new to Pandas and it took me some time to get here so I decided to make a note here for this). We first flatten the nested dictionary to have different levels of keys as tuples, which allows us to create aDataFrame
withMultiIndex
. WithMultiIndex
, we can easily apply.merge
to join theDataFrame
objects.Hopefully the code snippet is more understandable.
import pandas as pd nested_dict = { "A": { "Apple": "Red", "Banana": "Green", }, "B": {"Apple": "Green", "Banana": "Yellow"}, } df = pd.DataFrame.from_dict( { "Fruit": {0: "Apple", 1: "Banana", 2: "Banana"}, "Price": {0: 0.911, 1: 1.734, 2: 1.844}, "Bucket": {0: "A", 1: "B", 2: "A"}, } ) # Method 1: .apply() # Apply Python function element-wise, as slow as a regular for loop df1 = df.copy() df1["Color"] = df1.apply( lambda row: nested_dict.get(row["Bucket"], {}).get(row["Fruit"]), axis=1 ) print(df1) # Method 2: .merge() # Vectorized, much faster (even though the big O is the same) df2 = df.copy() # The only overhead is to construct a dataframe with 'MultiIndex' nested_df = pd.DataFrame.from_dict( { (inner_key, outer_key): value for outer_key, outer_value in nested_dict.items() for inner_key, value in outer_value.items() }, orient="index", ) nested_df.index = pd.MultiIndex.from_tuples(nested_df.index) nested_df.rename(columns={0: "Color"}, inplace=True) df2 = df2.merge(nested_df, how="left", left_on=("Fruit", "Bucket"), right_index=True) print(df2)