Releasing Anonymized ETL flow datasets for FSM
Intro
As researchers working in the industry, you rarely have the privilege of sharing the data you are working on with the public. But this time, we do. Together with the IEEE BigData paper — Refactoring ETL Flows in The Wild, we - at IBM Research Israel - are releasing an anonymized ETL flows dataset for FSM research.
What is ETL, what is FSM, and what is this dataset? If you are curious about these questions, keep reading. We will briefly cover these topics and provide the info you need to get started.
What is ETL?
ETL stands for Extract, Transfer and Load, a three-phase data manipulation process. For many reasons, there are an enormous number of data tools and services, and from time to time, you need to combine data from a few of them and move data from one to another. That is why ETL tools exist. The Extract phase connects to (possibly) heterogeneous sources and loads them into the system, the Transform phase deals with various manipulations of the data (join, filter, clean, etc.), and the Load phase stores the manipulated data in the target.
A typical usage would be to collect data from databases that interact with users of an app, join, filter, and clean them, and load the data into a data warehouse for analytical purposes. That process could be repeated once in a while as necessary.
One of the most longstanding and trusted ETL tools is IBM’s DataStage™. In DataStage, an ETL process is called a flow. A flow is a graph where each node (aka stage) represents an action, and the edges represent the data that moves between nodes. Let’s take, for example, this flow:
It takes data from two DB2 tables, joins and filters them, joins them again with another table from PostgreSQL, and stores the result in a sequential file. Many clients of IBM are using DataStage, and they have thousands of flows crafted over the years for their ETL needs.
What is FSM?
Frequent Subgraph Mining is the task of finding repeating patterns in a very large graph or across multiple numbers of graphs. In other words, we desire to find subgraphs that appear many times in the explored graph or set of graphs. In each graph, we have nodes, edges, and labels that represent the content of the nodes and edges. A common subgraph is a recurring structure with identical labels. Here are two simple graphs with their highlighted common subgraph:
Discovering such patterns is desirable in various fields, including biology and chemical molecule discovery, social network analysis, and the automotive industry. In our context, of ETL and DataStage in particular, finding common subgraphs (aka subflows) can provide value from a few angles:
- Analyzing user behavior - Recognizing common subgraphs can help us understand the users, see what typical patterns they employ, and how they use the system. It can inform feature development, optimization, and so on.
- Flow authoring - Knowing frequent patterns can allow us to suggest the next step to users while they create flows, similar to code completion and suggestion tools. Building a library of common subflows can be a good base for that.
- Refactoring - When finding a subflow that appears in many flows, we can suggest the user extract it to a separate subflow and use it in their flows as a shared function. That makes the flows easier to understand, maintain, and update. This was the primary motivation for our work and the focus of our paper.
This problem is known to be an exponentially hard problem and has been studied by the research community over the last few decades. Multiple heuristics and algorithms have been suggested to tackle it, including FSG, GASTON, gSpan, and more.
The Dataset
While we worked on this problem, as part of a research effort, we were able to access some customer flows, and obtained approval to share an anonymized version of them with the public.
In DataStage, each flow is represented by a long and comprehensive JSON file containing a lot of details. For example, the JSON of the flow above has more than 4,000 lines. It contains many attributes for each stage - such as type, parameters, and UI elements (labels, coordinates, image) - alongside information about connections between stages, runtime variables, and more.
However, for the purposes of FSM, we are not interested in all of those details but only in the question of whether two nodes are identical, i.e., having the same label. In order to construct the label and define what we consider identical, we introduce the concept of the lifter.
The Lifter and the Anonymization
Let’s look at a super-simplified example of a join stage in a flow:
{
"op-type" : "join",
"description" : "joining employees and departements by id",
"inputs" : [
"employees",
"departments"
],
"condition" : "employees.id == departments.id",
"image" : "icons/join.png",
"..." : "..."
}
A lifter takes the relevant parameters and constructs a unique label from them. The underlying question is — what is important to us when we consider if that node is equivalent to another node in another flow? Clearly, we don't care about the description and the image, but most certainly, we care about the operation type. For other parameters, it depends on what kind of patterns we are looking for. In case we look for general user behavior, we might want to consider only the operation type and ignore all the rest. On the contrary, if we want to find common subflows to refactor, we need all the relevant parameters (i.e., the join key) so we can extract the subflow and reuse it in all the flows it appears in. Other cases could need different elements.
The content of the label is not important when we use the FSM algorithm, only that the same label will be attached to equivalent nodes. For that reason, we could easily hash the created label and get an anonymized version of it, keeping the structure of the flows and the ability to find common subgraphs using FSM algorithms, while omitting all the information that can expose our clients. That allowed us to publish these datasets.
We published two versions of the datasets using two different lifters:
- Simple - lifting only the operation type of the node. In DataStage, there are more than two hundred types of stages, and mining them could be used to create tools to help flow authors.
- Detailed - lifting parameters of the node, aiming to take those needed for the refactoring tool we were building.
Notice that as you lift more parameters, there are more unique labels and fewer labels considered equivalent, so naturally, you will find fewer common subflows.
Format
The format of the dataset is simple and quite easy to understand:
- A line that starts with a
t
states the start of a flow (aka transaction), followed by#
and the number of the flow. - A line that starts with a
v
states a node (aka vertex), followed by an inter-flow ID and a label (that could be shared among different flows). - A line that starts with an
e
states an edge between the two following node IDs, with the following label (in our case, edge labels are irrelevant. Hence, they are always1
).
Overall, a simple flow like this:
t # 1
v 1 A
v 2 B
v 3 A
e 1 2 1
e 2 3 1
e 3 1 1
would represent the graph:
Statistics
Now, let’s write some code and take a glimpse of the datasets while extracting a few statistics.
Going over the flows, we can count the number of vertices in each flow and print out the minimal, maximal, average, and median size of flows for each dataset:
from py_markdown_table.markdown_table import markdown_table
from statistics import median, mean
table = []
lifter = 'detailed'
for i in range(1,6+1):
sizes = []
with open(f'{lifter}-lifter/dataset_{i}.txt', 'r') as f:
for line in f:
if line.startswith('t'):
sizes.append(0) # new flow
if line.startswith('v'):
sizes[-1] += 1 # new vertex
table.append({ 'Dataset' : i,
'Number of Flows' : len(sizes),
'Min Stages' : min(sizes),
'Max Stage' : max(sizes),
'Avg Stages' : f'{mean(sizes):.2f}',
'Median Stages' : int(median(sizes))})
print(markdown_table(table).get_markdown())
produces:
+---------------------------------------------------------------------+
|Dataset|Number of Flows|Min Stages|Max Stage|Avg Stages|Median Stages|
+-------+---------------+----------+---------+----------+-------------+
| 1 | 734 | 2 | 50 | 7.88 | 6 |
+-------+---------------+----------+---------+----------+-------------+
| 2 | 688 | 2 | 85 | 8.57 | 5 |
+-------+---------------+----------+---------+----------+-------------+
| 3 | 10829 | 1 | 86 | 7.16 | 5 |
+-------+---------------+----------+---------+----------+-------------+
| 4 | 10571 | 1 | 192 | 7.92 | 5 |
+-------+---------------+----------+---------+----------+-------------+
| 5 | 4031 | 2 | 88 | 7.14 | 4 |
+-------+---------------+----------+---------+----------+-------------+
| 6 | 3775 | 2 | 83 | 5.24 | 4 |
+---------------------------------------------------------------------+
Changing the code slightly, we can draw the distribution sizes of all the datasets together (omitting the extremely rare flows larger than 60 nodes):
As expected, most of the flows are small, but there are a considerable amount of big ones.
We could also check the difference between the two lifters by counting the number of unique labels:
from py_markdown_table.markdown_table import markdown_table
table = []
for i in range(1,6+1):
table.append({'Dataset' : i})
for lifter in ['simple', 'detailed']:
labels = set()
with open(f'{lifter}-lifter/dataset_{i}.txt', 'r') as f:
for line in f:
if line.startswith('v'):
labels.add(line.strip().split(' ')[2])
table[-1][lifter.capitalize() + " Lifter"] = f'{len(labels):,}'
print(markdown_table(table).get_markdown())
and we got:
+-------------------------------------+
|Dataset|Simple Lifter|Detailed Lifter|
+-------+-------------+---------------+
| 1 | 114 | 4,330 |
+-------+-------------+---------------+
| 2 | 89 | 3,013 |
+-------+-------------+---------------+
| 3 | 225 | 28,414 |
+-------+-------------+---------------+
| 4 | 233 | 49,975 |
+-------+-------------+---------------+
| 5 | 131 | 16,098 |
+-------+-------------+---------------+
| 6 | 80 | 8,765 |
+-------------------------------------+
Wrapping up
The dataset files, together with the code we used here and some technical details, are available in our public GitHub repository.
If you want to learn more about the work we have done, you can take a look at the paper we have published and the companion blog post that came with it.
Although in this era of Big Data and ML, it is not considered a huge dataset, it is a unique one, and we hope the community will use it to develop and benchmark FSM algorithms and other related tools. Let us know if you used it, and don’t forget to put a star on GitHub.