Making the data pipeline easier: My hackathon onboarding project

In their first week with Commit, each new Engineering Partner takes on a hackathon onboarding project. They build a project, present it to the Commit community, and write a short summary of their experience. There are no restrictions, no limits, no jokes.

We have lots of tools, like Tableau or Microsoft Power BI, that help to visualize data easily. However, those tools are only useful if the data is structured in the right way. Because data isn’t always in a nice format, or even collected in the same location, data scientists spend a lot of time writing code to find and shape the data the way they need.

For my hackathon onboarding project, I thought it would be beneficial for data scientists to focus more on working with the data rather than writing code. For example, most websites these days are built without a programmer. There are tons of tools like WordPress or Wix that allow people to build a website without coding. I wanted the same experience for data scientists.

How does it work?

The overall concept of the application is simple. There will be different modules or plug-ins that connect to one another to create a data pipeline.

What do you mean by module?

Data pipelines usually consist of three processes: extract, transform, and load (together they’re known as the ETL process). It’s a procedure where you read (extract) the data from different sources, convert (transform) the data to your needs, and put or insert (load) the data where you desire.

The ETL process

Each process will have multiple modules and each module will have specific functions to do. For example, extract processes could have modules like reading data from a database, Google Sheet, or even an API endpoint. Each module will be doing an extraction process but has different functionality.

There is one more additional process that’s outside of the ETL process that I need to add: an input module. The input module is where a user can input values that they need. This will allow the user to configure the input as they need and change it if they want.

What does it look like?

Before starting the project, I had to think about what is needed. I remembered using a tool called Draw.io to easily draw flow diagrams. There are a few things that I liked about it and wanted to copy some of the features:

Drag and drop modules and move freely around the space
Display diagram flows using arrows to items

Outside of those features, I also needed to test or run the data pipeline. Without this feature, it would be very difficult to test functionality.

Drag & Drop Feature

I needed a drag and drop feature, so I could drag around the module and drop it wherever I wanted on screen. I found a great library called react-draggable (https://www.npmjs.com/package/react-draggable) that allows you to drag around the element how you want. The implementation is simple enough that you can do it right away.

Implementation of a draggable library

Connecting modules with arrows

Because the data pipeline has a flow, each module should connect to the other. The easiest way to represent the flow is by adding an arrow. I found another great library called react-xarrows (https://www.npmjs.com/package/react-xarrows) that allows me to easily represent arrows from one module to another. You just need to tell it which element is connected to which.

Implementation of arrow feature

Testing the data pipeline

This was one of the most difficult parts of this project, where I would test whether my method of building a pipeline without code works. This is also the part where you have to connect the front-end and back-end systems. The major difficulty comes from not only being able to run the data pipeline, but showing the results back to the user.

There are two aspects needed to be implemented. First, the ability to show the status of the modules. Modules can either run or fail, and the easiest way to indicate this status is to show the module by colour. If the module successfully runs, the module will turn green. If the module failed to run, the module will turn red.

Second, it should show the input and output of the module for debugging purposes. I added a little icon in the module—when you click the icon, it gives you the result of what the inputs and outputs are.

Implementation of testing pipeline

Testing with Business Intelligence (BI) Tool

The main objective for the demo is to grab two datasets from different data sources (one from Google Sheets and one from API), join the datasets based on a specific dataset, and present the data in the BI tool.

Datasets

My dataset, which I made up in Google Sheets, includes email address, type of drinks and quantity—data a coffee shop might have. The API contains information about the user: email, first name and last name, which I grabbed from https://reqres.in/api/users.

Sample dataset that’s being used

Pipeline configuration

Sample pipeline configuration

The data pipeline configuration is simple too. We are loading data from Google Sheets and API, joining two datasets based on e-mail addresses.

Testing the pipeline with the BI tool

Redash is an open source BI tool that allows visualizing data like Tableau. Redash has specific requirements for how the JSON API output should look (docs are in https://redash.io/help/data-sources/querying/json-api). Since I’ve specifically made an output module for Redash, we can use the pipeline directly in Redash. All you need to do is to pass the URL of the data pipeline you configured.

Running the data pipeline in Redash BI

Another interesting feature is that I can also change the input variable without modifying the existing data pipeline. For example, I’ve created a bigger dataset from Google Sheets in a different tab called “Sheet2.” I can load this data quickly by simply changing the input parameter.

Running the data pipeline while changing input parameters

What did I learn?

1. Tailwind CSS is awesome!

I had never used Tailwind CSS before and I like to challenge myself by using something I’ve never worked with. Honestly, my experience using Tailwind CSS was much better than I expected. I was able to create components quickly and make them the way I wanted. Typically, I use Material-UI or Bootstrap, a library that has pre-built components and it makes all the websites look the same. Tailwind CSS allows you to create the component the way you want.

One drawback is that you need to have a good understanding of CSS. It’s not beginner-friendly. If you do not have experience with CSS, you’ll have lots of trouble working with it.

2. Too many things to do but not enough time to implement

This is probably the same as any other hackathon project. You have a constraint on the timeline to complete the project and you probably need to sacrifice some things as you go. There were times when I wanted to add more features or refactor duplicate code. But, if I did, I probably wouldn’t have finished the project. Similarly, in startups, it’s very important to think about the goal of the project and the trade-offs that you are willing to make.

3. Coffee is my friend!

Yes! Caffeine surely helps, but this is not the point I’m trying to make. What I want to emphasize is the importance of taking a break when you need it. There was a time when I was stuck trying to figure out an algorithm. It was like doing leetcode. I think I stared at the code for a while and I just couldn’t figure it out. Finally, I said, “Okay, I need a coffee.” I started to make coffee, which took a good 10 minutes. After that, I came back to the computer and, weirdly enough, the algorithm or an idea popped into my head. I’m not sure if it was the caffeine or the break that helped, but I learned to go make coffee if I’m stuck on a problem. :D.

Write a one- or two-sentence bio for yourself.

Changjoo Jeon is a full-stack developer who is passionate about solving complex problems using modern technology. When he is not at the computer, he enjoys spending time with his family and taking long walks.