Faking data for reporting (Part 2 of 2)

Lipsa Panda
Mar 16, 2021
6 min read

In my previous post we talked about synthetic 📈 data (specifically what it is and why it's useful). In this one, we're going to actually put our fingers where our brains are and ... welp that was a weird analogy. We are going to make some data! For this exercise I'm going to use Python 🐍 and I'm going to start with the most important thing you can do: define a schema 🏗.

Data Schema

Definitions for columns, data types, valid values, formatting and nullability.

Structure and Column Notes

We are creating a dataset of synthetic facility claims data. In this tabular datasource, we will be colocating header and line level information about claims. Each observation represents a claim line of a particular type. Claim lines represent things that happen to a particular member on a particular day by a particular provider. Fro instance, member "John Doe" was given an insulin shot by doctor "James Smith" with id "12443" at a "Skilled Nursing Facility (SNF)" on 4/1/2020. That encounter cost $1000.

claim_id: Each claim should have a unique number except when the claim has more than one line. These numbers should be in the format "claim_id_X" where X represents the number.

line_number: Certain types of claims will have more than one line (hha specifically). In this case, the "claim_id" field will be duplicated for each line as well as the "member_name", "provider_type" and "provider_id". The "line_number" variable will rise incrementally for each row for up to 10 rows.

member_name: This field will take the format LastName, FirstName and represents the name of the member getting the healthcare service.

provider_id: All healthcare providers have a unique identifier in this dataset that represents a 6 digit string where all the characters are digits between 0-9. These are meaningfully numbered where the first digit indicates which type of provider the provider id is for -- i.e. (ip = 1, snf = 2, hha = 3, and op = 4). Thus, an outpatient provider might have a provider id that looks liks 634282.

provider_type: Healthcare facility providers will typically fall into these four categories (for simplicity, I have consolidated some and omitted others) - Inpatient, Skilled Nursing, Home Health and Outpatient. This field should represent the type of provider the member went to see. The shorthand for these locations is (ip, snf, hha, op).

service_date: For any given claim this represents the date the service took place. For home health claims with multiple lines, line 1 should have the earliest date and all subsequent lines should be 1 day more than the previous. For example: a HHA claim might have 2014-01-01 as line 1's service date and then 2014-01-02 as line 2's service date and then 2014-01-03 and so on until the end of the lines for that claim.

claim_cost: Payments in healthcare are typically a function of two factors: "provider_type" and "procedure_code". As a general rule, snf > ip > op >> hha in terms of average costs. Similarly, cat therapy > mri > xray >> insulin. If you were to get "cat therapy" at a "snf" it would be more expensive than "cat therapy" in an "hha". "hha" claim lines typically are paid as components to the full cost.

procedure_code: This represents the type of procedure that was done for the patient. In the real world, this might be something like a laparoscopic surgery -- in here, cat therapy cures all.

diagnosis: This represents the disorder/diagnosis the patient is coming in with -- typically cat therapy is intended to cure loneliness.

is_lonely: This is a flag if the diagnosis was loneliness.

given_cat_therapy: This is a flag if the procedure_code indicates that cat therapy was given to the patient. This should only be given to people who are lonely.

high_cost: This is a flag if the claim_cost on any given line is greater than or equal to the 90th percentile of cost within a provider_type and procedure_code group.

outlier_cost_inpatient: This is a flag if the "claim_cost" on any given inpatient line is greater than or equal to the 99th percentile of cost for all inpatient claims.

Let's make some data

First I start by importing all the code I'm probably going to use:

import pandas as pd
import numpy as np
import datetime as dt
import random
import re
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(2019)

Then I set some global variables to define the valid values or pairs of values.

CLAIM_CT = 5000 # how many claims do we want to generate
LINE_MAX = 10  # how many lines can some claims have?

Each of the following functions generates a single value that matches the schema above.

def gen_claim_id(i):
    return 'claim_id_' + str(i)

def gen_line_number(i):
    return i + 1

def gen_member_name(i=0):
    firstnames = ['Alicia','John','Maria']
    lastnames = ['Quinta', 'Drummond', 'DogLover']
    return random.choice(lastnames) + ', ' + random.choice(firstnames)

def gen_provider_id(provider_type):
    provider_id_len = 6
    first_digit_map = {
        'ip': '1',
        'snf': '2',
        'hha': '3',
        'op': '4'
    }
    digits = [str(i) for i in range(10)]
    gen_digit = lambda : random.choice(digits)
    rest_length = range(provider_id_len - 1)
    rest = ''.join([gen_digit() for i in rest_length])
    return first_digit_map[provider_type] + rest

def gen_provider_type(i=0):
    valid_values = ['ip', 'snf', 'hha', 'op']
    return random.choice(valid_values)

def gen_svc_date(i=0):
    begin = {'yr': 2010, 'mo': 1, 'day': 1}
    end = {'yr': 2020, 'mo': 12, 'day': 28}
    return dt.datetime(
        random.choice(range(begin['yr'], end['yr'])), 
        random.choice(range(begin['mo'], end['mo'])),
        random.choice(range(begin['day'], end['day'])))

def gen_svc(i=0):
    proc_diag_pairs = [
        ('cat therapy', 'loneliness'),
        ('insulin', 'diabetes'),
        ('mri', 'broken collarbone'),
        ('xray', 'broken collarbone')
    ]
    return random.choice(proc_diag_pairs)

What if I want to generate large amounts of data?

def generate(idx, func):
    if isinstance(idx, int):
        # for any function if i'm passing a desired len, 
        # just create an array of that length
        return [func(i) for i in range(1, idx+1)]
    elif isinstance(idx, pd.Series):
        # otherwise pass the function on the series
        return idx.apply(func)

For claim cost, we will want to use the numpy statistical distribution simulators. Each setting (ip, snf etc) will have a different theoretical mean and std dev. Each procedure will have a payment adjustment factor to keep things consistent with the above specifications.

def generate_claim_cost(data):
    """for any given provider_type or procedure, 
    calculate a simulated distribution of 
    spend and attach it back to the original data."""
    
    # get variables
    provider_type = data['provider_type'].iloc[0]
    procedure = data['procedure_code'].iloc[0]
    
    # statistical distribution of cost
    by_provider = {
        'ip': {'mu': 8000., 'sigma': 400., 'dist': 'lognormal'},
        'snf': {'mu': 15000., 'sigma': 600., 'dist': 'lognormal'},
        'op': {'mu': 1200., 'sigma': 400., 'dist': 'lognormal'},
        'hha': {'low': 70, 'high': 150, 'dist': 'uniform'},
    }
    by_procedure = {
        'cat therapy': 1.2,
        'insulin': 0.1,
        'mri': 0.8,
        'xray': 0.7
    }
    args = by_provider[provider_type]
    args['size'] =  data.shape[0]
            
    get_std = lambda mu, sig: np.sqrt(np.log(1 + (sig/mu)**2))
    get_mean = lambda mu, std: np.log(mu) - std**2 / 2

    arr = []
    if args['dist'] == 'lognormal':
        args['mu'] *= by_procedure[procedure]
        std = get_std(args['mu'], args['sigma'])
        mean = get_mean(args['mu'], std)
        arr = np.random.lognormal(mean, std, args['size'])
    elif args['dist'] == 'uniform':
        args['low'] *= by_procedure[procedure]
        args['high'] *= by_procedure[procedure]
        arr = np.random.uniform(args['low'], args['high'], args['size'])
       
    return data.assign(claim_cost = arr)

You'll have to make the service dates chronological for claims with more than one line.

def coerce_svc_dates(data):
    """make sure that svc date is stepwise for all 
    claims with more than one line"""
    data['min_date'] = data.groupby('claim_id')['service_date']\
                           .transform('min')
    data['step'] = data.groupby('claim_id')\
                       .cumcount()\
                       .apply(lambda x: pd.Timedelta(x, 'd'))
    data['service_date'] = data['min_date'] + data['step']
    fmt = '%Y-%m-%d'
    data['service_date'] = data['service_date'].dt.strftime(fmt)
    return data.drop(['min_date', 'step'], axis=1)

And then you stitch it all together.

def generate_synthetic_claims():
    """generate a dataset with schema 1"""
    # create variables at the claim level
    data = pd.DataFrame({'provider_type': generate(CLAIM_CT, gen_provider_type)})
    row_ct = data.shape[0]
    data['claim_id'] = generate(row_ct, gen_claim_id)
    data['member_name'] = generate(row_ct, gen_member_name)
    data['provider_id'] = generate(data['provider_type'], gen_provider_id)
    
    # for hha, home_health make more than one line using cartesian product merge
    subs = (data.loc[data['provider_type'] == 'hha']).copy(deep=True)
    subs = subs[['claim_id', 'provider_type']].assign(key = lambda x: 1)
    lines = pd.DataFrame({'key': [1 for i in range(LINE_MAX)], 'line_number': range(LINE_MAX)})
    subs = subs.merge(lines, how='outer', on='key', validate='m:m').drop('key', axis=1)
    subs = subs.sample(frac=.75, replace=False) # vary how many lines exist for each hha claim
    data = data.merge(subs, how='outer', on=['claim_id', 'provider_type'], validate='1:m')
    
    # make line number
    data['line_number'] = data.groupby('claim_id').cumcount().apply(gen_line_number)
    
    # get number of rows
    row_ct = data.shape[0]
    
    # generate the rest of the columns
    data[['procedure_code', 'diagnosis']] = pd.DataFrame.from_records(generate(row_ct, gen_svc))
    data['service_date'] = generate(row_ct, gen_svc_date)
                     
    # generate claim cost which happens by group
    data = data.groupby(['provider_type', 'procedure_code'], as_index=False)\
               .apply(generate_claim_cost)\
               .reset_index(drop=True)
    
    # coercion for claims with multiple lines
    data['claim_id_num'] = data['claim_id'].apply(lambda x: x.split('_')[-1]).astype(int)
    data = data.sort_values(['claim_id_num', 'line_number']).reset_index(drop=True)
    data = coerce_svc_dates(data)
    data = data.drop('claim_id_num', axis=1)
    
    return data

Last we just add the flags.

def is_lonely(data):
    return data.assign(is_lonely = lambda x: x.diagnosis == 'loneliness')

def given_cat_therapy(data):
    return  data.assign(given_cat_therapy = lambda x: x.procedure_code == 'cat therapy')

def high_cost(data):
    data['high_cost'] = data.groupby(['provider_type', 'procedure_code'])['claim_cost']\
                            .transform(lambda x: np.percentile(x, q=90))
    data['high_cost'] = data['claim_cost'] >= data['high_cost']
    return data

def outlier_cost_inpatient(data):
    data['outlier_cost_inpatient'] = data.groupby(['provider_type'])['claim_cost']\
                                         .transform(lambda x: np.percentile(x, q=99))
    data['outlier_cost_inpatient'] = data['claim_cost'] >= data['outlier_cost_inpatient']
    data['outlier_cost_inpatient'] = np.where(data['provider_type'] == 'ip', 
                                              data['outlier_cost_inpatient'],
                                              None)
    return data

Creating a sample dataset - Voila!

generate_synthetic_claims()\
    .pipe(is_lonely)\
    .pipe(given_cat_therapy)\
    .pipe(high_cost)\
    .pipe(outlier_cost_inpatient)\
    .head(10)

That's it!

And just like that, you have a clean dataset to put through reporting platforms as test data or confuse your co-worker Rob. As you can see from this exercise, we had to deeply understand the underlying structure of the claims data in order to make the synthetic dataset look close to normal. There is still enough variation (in the cost for instance) that it looks like a real dataset but that variation has been parameterized.

You can use the individual generators to generate test data on the fly, either a single value or an array. But you could also use the generate_synthetic_claims() function to make a combined dataset. The flags in this dataset solely exist to reflect back to us certain things we hope to see in the data.

Please feel free to leave your thoughts, comments and suggestions below! How would you improve this code?