Pandas Immutability! (the ultimate superpower)

Lipsa Panda
Apr 3, 2021
8 min read

In the beginning: there was a mutation.

Today we talk about mutation in Pandas, why it can be extremely dangerous and how one can get around it. For those of you who are just getting started in programming: here's a quick definition of mutability to save you the googling:

"In object-oriented and functional programming, an immutable object (unchangeable object) is an object whose state cannot be modified after it is created. [Wiki]"

As you can imagine: a mutable object can be changed after it has been created. Think of a database as an example of a mutable object: after it has been created, people and programs will continue to add data to it over time and the object itself will change.

In Python, certain types of objects are mutable and others are immutable. Lists, for example, can change after you instantiate them whereas tuples will not:

In [694]: some_list
Out[694]: [1, 2, 3, 4, 5]

In [695]: some_tuple
Out[695]: (1, 2, 3, 4, 5)

In [696]: dupe_tuple = some_tuple

In [697]: dupe_list = some_list

In [698]: dupe_tuple = dupe_tuple + (6,)

In [705]: dupe_list.append(6)

In [716]: some_list
Out[716]: [1, 2, 3, 4, 5, 6]

In [717]: some_tuple
Out[720]: (1, 2, 3, 4, 5)

In [721]: dupe_list
Out[722]: [1, 2, 3, 4, 5, 6]

In [723]: dupe_tuple
Out[723]: (1, 2, 3, 4, 5, 6)

What just happened?

Why did "some_list" change too? When something is immutable, if you make a copy of it, you are literally creating a new object in memory rather than a pointer to the previous object. In the code below, see how the new tuple has a different object id than the old one whereas the lists have the same id.

In [724]: id(dupe_tuple), id(some_tuple) # different ids
Out[735]: (4603452200, 4601105312)

In [736]: id(dupe_list), id(some_list) # same id
Out[736]: (4605245192, 4605245192)

Mutability in action

Now the above can make sense if you're used to any type of declarative/imperative/procedural programming framework. Code in that framework looks like: Read in a CSV into variable called data. Do things to that data object line by line. Then spit out the data object. The entire code will read linearly from top to bottom. Things done at the top may need to get re-written if they are done twice. The data object maintains a central state and that state is changed by linear statements.

"Imperative Programming: Programs as statements that directly change computed state (datafields)."

In this framework, why would you want to unnecessarily make copies of the original object in memory? The answer is: you probably don't need to. BUT most programs now are much more complicated. In many cases, programs will want to simultaneously make changes to a single object at the same time. Think about a database for a retail store that has 100k shoppers making purchases online at the same time. For these scenarios, we might think of different programming paradigms (like my personal favorite: functional programming). But the reality is: mutability can be dangerous.

Why could mutability be a problem?

Mutability is mainly a problem if you're not expecting the original data to change. If you want to inspect an older state of that object, you'll suddenly find that you can't because the object no longer has that state anymore.

This is when you want to use functions -- to separate out business logic from action in a data script. Functions don't solve everything in Python, and you'll see why later, but functions do decrease the likelihood of something like this happening and allow you to write code to test your code.

Immutability in action

Functional programming allows you to treat computation as mathematical functions that take in an object, do things to that object, and result in a new object. This is particularly useful for multiple reasons:

Reuse of code: a function used to describe a business computation can be re-used to apply to multiple data objects of the same shape: i.e. computing sales price on a data map for a product and adding the discounted sales price. If there are 3000 products, you don't need to re-write this line 3000 times.
Testable business logic: This method allows you to decompose large business operations into smaller actions and develop logic to unit test those actions. For instance: if I create a model predicting the weather tomorrow for cities based on historical data and I have a function that takes updates the current temperature column for each city with tomorrow's prediction, I can rest assured the original dataset wont change. I can also write a function that tests that no matter how many changes people make to this code: the function always performs the same way. When this test fails, I know that the latest change to the code may have caused the problem.

There are idempotent functions which basically take an object and returns the exact same object with no alterations but we won't talk about those.

If you're not using functions right now or not sure if you're using them right, stick with this blog and I'll show you things that will make you believe that this is a superpower.

However: Python functions don't automatically return a new object.

Even when you pass an object through to a function, Python doesn't automatically make it a new object. I can chain functions together and assign it to a new object but if the original object you work on can be mutated, then the object returned will be the same object id. In the code below, see how returning the existing object in a function doesn't by default change the object id in memory.


In [737]: def do_something(t):
     ...:    print('a')
     ...:    return t
     ...: 

In [738]: do_something(dupe_tuple)
a
Out[744]: (1, 2, 3, 4, 5, 6)

In [745]: id(do_something(dupe_tuple)), id(dupe_tuple) # same id
a
Out[745]: (4603452200, 4603452200)

Is this a problem for Pandas too?

Yes, yes it is. This is a major problem. And this can be particularly scary if you're not used to programming paradigms and scripting languages. If this is a problem for regular python objects you can imagine that it can be a problem for complex classes like Pandas dataframes.

Analysts might want to change the datatypes of a dataset or remove nulls, make a copy of a column in order to format the copy differently for analysis and then find out the original data has been changed. Analysts commonly use declarative approaches to modifying data because historically data focused languages like SAS, SPSS tend to be declarative rather than functional. See the code below for a great example of how a change to df data type through function ended up modifying the original dataframe data type as well as the original.

#### make a dataframe of integers
df = pd.DataFrame({'column_a': [1, 2, 3, 4]})
assert df.dtypes['column_a'] == 'int64'

def make_str(df):
    df['column_a'] = df['column_a'].astype(str)
    return df

#### even though we passed through a function
#### original df was mutated
df2 = make_str(df)
assert df2.dtypes['column_a'] == 'object' # passes
assert df.dtypes['column_a'] == 'object' # passes

Here's another example: You start with a dataset at the top of the script, make a copy of it, and fill nulls on a column to include it in a group by. Then you take the original dataset and try to count how many nulls are in there using isnull() and suddenly there are no more nulls. The pandas methods are no longer working as intended because your change to the COPY of the dataset changed the original dataset.

Some pandas methods are "immutable" so if you use methods like ".assign" to create or modify columns you should be protected from state change like this. But if you prefer to write your own custom functions and have code that assigns columns in the way you see above, you may want to consider the solutions below.

How can I fix this?

My solution has really been a combination of using functions to separate business logic from actual state changes as well as using a cute little function decorator to prevent state change. Here's an example of how I use pandas "pipe" method to chain functions together to take a claims dataframe, merge in patient information, calculate age, subset to relevant age groups and group by agg. I show it here done two ways (through functions which I find more readable and test-able and re-usable and through native pandas methods).

Each function in the first block of code takes in a dataframe as its first argument and returns one and only one dataframe as its output.

spend_by_age_gt_60 = claims_df\
    .pipe(merge_patient_df, patient_df=patient_df)\
    .pipe(calculate_age)\
    .pipe(subset_to_relevant_age_group)\
    .pipe(group_by_age_group_sum_spend)
             
 spend_by_age_gt_60 = claims_df\
    .merge(patient_df, how='left', on='patient_id)\
    .assign(age=lambda x: (x['date'] - x['dob']).dt.days / 365)\
    .query("age > 60")\
    .groupby("age")['claim_cost'].sum()

Nevertheless each function in the first example would need to be rendered immutable if we want the original claims dataframe not to have accidentally have any changes.

For this I bring in: the immutability decorator. This decorator will allow you to make changes to a pandas dataframe in a function without accidentally modifying the original dataframe. A decorator is just a function that acts on another function -- in this case by inspecting the arguments before it is passed into the second function and ensuring it is a new object in memory before sending it to the second function.

from copy import deepcopy

def immutable():
    """ensures args to a function dont change"""
    def decorated(func):
        def wrapper(*args, **kwargs):
            new_args = deepcopy(args)
            res = func(*new_args, **kwargs)
            return res
        return wrapper
    return decorated

In order to use this decorator: we just apply it to the line before the definition of a python function. Easy as pie.

##### add immutability
@immutable()
def make_int(df, test_unused_arg):
    df['column_a'] = df['column_a'].astype(int)
    return df

##### now original df is not mutated
df3 = make_int(df2, 'another arg')
assert df2.dtypes['column_a'] == 'object' # passes
assert df3.dtypes['column_a'] == 'int64' # passes

How should I do this if I'm not using functions?

If you're not using functions -- that's ok but you've definitely broken my heart a little. No matter, pandas does give you a data method that allows you to protect your code. When you make copies of dataframes and/or columns, another way you can do it is using a .copy() method for data assignment to variables. If you use this properly, this method can ensure that any actions you take on the copy do not change the original dataframe.

data = pd.DataFrame({'col_1': [1, 2, 3, 4]})
data_2 = data.copy(deep=True)
data['col_2'] = data['col_1'].copy(deep=True)

In [749]: data
Out[749]: 
   col_1  col_2
0      1      1
1      2      2
2      3      3
3      4      4

In [750]: data_2
Out[750]: 
   col_1
0      1
1      2
2      3
3      4

This is what it looks like if you don't use .copy(deep=True) on the original dataframe and make a change.

In [751]: data_3 = data

In [762]: data['col_3'] = data['col_2']

In [766]: data_3
Out[766]: 
   col_1  col_2  col_3
0      1      1      1
1      2      2      2
2      3      3      3
3      4      4      4

There you have it.

That's the cleanest way to make sure that your pandas code doesn't change your dataframe on you when you least expect it. Here's the whole script in case you're interested in how it all stitches together. You can put immutable() in a utilities script and import it into your main one.

Who knew something so small could be so powerful?

import pandas as pd
from copy import deepcopy

def immutable():
    """ensures args to a function dont change"""
    def decorated(func):
        def wrapper(*args, **kwargs):
            new_args = deepcopy(args)
            res = func(*new_args, **kwargs)
            return res
        return wrapper
    return decorated

# make a dataframe of integers
df = pd.DataFrame({'column_a': [1, 2, 3, 4]})
assert df.dtypes['column_a'] == 'int64'
def make_str(df):
    df['column_a'] = df['column_a'].astype(str)
    return df

# even though we passed through a function
# original df was mutated
df2 = make_str(df)
assert df2.dtypes['column_a'] == 'object'
assert df.dtypes['column_a'] == 'object'

# add immutability
@immutable()
def make_int(df, test_unused_arg):
    df['column_a'] = df['column_a'].astype(int)
    return df

# now original df is not mutated
df3 = make_int(df2, 'another arg')
assert df2.dtypes['column_a'] == 'object'
assert df3.dtypes['column_a'] == 'int64'

Thanks for reading! Let me know if there's anything I can change or update or add. Look forward to hearing your suggestions.