Faking data for reporting (Part 1 of 2)

Lipsa Panda
Mar 16, 2021
3 min read

In healthcare, the Health Insurance Portability and Accountability Act of 1996 (HIPAA) reigns supreme. If don't work in this industry: this rule protects healthcare patients from having sensitive medical records stolen by limiting access to that data. If you work for a healthcare organization in analytics, you will only typically have access to Personal Health Information or Personally Identifiable Information (PHI/PII) if it is critical for your job function.

Consequently, most other healthcare developers will have little access to real PHI/PII. But how do I, as a healthcare developer, test my reporting application? 🤔 Typically healthcare organizations will do one of two things to have useable test data for applications:

Anonymize/de-identify real data by modifying the PHI/PII in it
Make shit up (generate a synthetic dataset)

As a general rule of thumb, #1 is preferred. Since it's real, there are tons of real-world edge cases that exist in that data that would be difficult to simulate since, by definition, they are rarely observed. Also: why make fake data when you don't have to? And yet, sometimes what must be done must be done.

So in this series, I'm going to chat with you all about synthetic datasets (what are they, why are they useful, how to make one, how to sound cool when talking about them).

What is synthetic data?

Synthetic data is any data generated by a computer algorithm that simulates the real world. Source: Wikipedia, you fools.

In order to create it, you need a crystal clear 🔮 understanding of the original dataset. You typically will need to define:

the schema (column/datatypes/format/order/valid values)
the data structure (relationships between data attributes, statistical distributions of variables)

For example, if you were to create a test dataset with diabetes patients, you may want to have both diabetes diagnosis codes as well as procedure codes that indicate the patient was being treated for diabetes and they would typically appear together in the same observation.

That all being said, sometimes you don't have the luxury of defining all these things ( time/patience/chutzpah 🤦🏾‍♀️). In that case, you only need to simulate the things that really matter to the application you're working on. Let's say I'm building a chart with inpatient claims spend -- I don't care about any other information about the claim other than cost and I only need to simulate a believable distribution of cost over time.

Once you have an understanding of what the data should look like, generating a synthetic dataset is .... still not easy🍋🥧. You'll end up writing a program to fake the dataset of interest (probably making use of a parameterized statistical distribution generator). It'll probably take you a bunch of tries to get it right. You will probably never get it completely right because there are things about the data that are too rare to simulate.

So why do it at all?

Well, there are very clear use cases for synthetic data in healthcare. Researchers and developers need access to this type of data (often to test hypotheses and assumptions) before they get access to the real thing📈📊.

In fact, there's a government initiative (beginning 2019, ending 2022) supporting open-sourcing and collaborating on synthetic data models for Patient Centered Outcomes Research in collaboration with Synthea™. "Synthea uses publicly available data to generate synthetic health records and can export information in multiple standardized formats."

Not to mention, companies that generate this type of data (like DataGen) have raises $18.5 million dollars. Make fake data to make fake money, that's what I say. Datagen improves the speed of machine learning by creating fake, labeled data for others to use in ML work. And while this doesn't solve inherent data biases, they hope that their product will serve as an industry enabler by parameterizing the process of creating fake data.

Other uses include:

Testing reporting applications
Demo-ing ideas to clients
Code challenges for prospective candidates
Machine Learning
Messing with your co-workers (Damn Rob this looks like the real thing but I can't put my finger on why it's weirding me out 🤔)

In a world where privacy is increasingly becoming a concern for the ordinary citizen, synthetic data also allows people to work on real world problems by reducing the risk of private information being inadvertently stolen or exposed 🙅🏽‍♀️🚫. Although granted, at least one person needs to have access to the real data to confirm if you've made a good fake. That person should be Matt Bomer (obviously).

So hopefully I've convinced you that it's useful?

Stick around for the next article (lots of Pandas/Numpy) to see how to make a quick synthetic dataset.

Faking data for reporting (Part 1 of 2)

What is synthetic data?

So why do it at all?

So hopefully I've convinced you that it's useful?

Recent Posts

Comments