The first portion of Data Engineering Camp is complete, including a submitted and graded first project. The goal of the project was to demonstrate an understanding of fundamental ETL concepts, as well as the specific tech tools we reviewed during the first few weeks. I was partnered with another cohort teammate, and we divied up the work based on existing strengths, as well as some collaboration on planning, demo outline, and AWS services.

For the project, we decided to use data from the Senator Stock Watcher API as well as Yahoo Finance’s stock data to track the most popular stocks traded by U.S. Senators in 2023 and gather historical performance data for those stocks. Though the focus of the project was mainly on creating data pipelines rather than data analytics (this is a data engineering bootcamp, after all), we wanted to compile data that we’d enjoy analyzing.

The first step in our project was creating a Dockerfile we could use as the baseline image to later upload to AWS’s ECR. We used docker compose to build and run a corresponding container locally. We loaded environment variables through the docker compose file, which made it simple to set a few flags for truncated data ingestion when testing locally. Once we had the Docker-related stuff working, we began fleshing out the core business logic for our pipeline. We used a series of SQL scripts to create the appropriate database schema at the start of the program run, creating the necessary tables and columns for the data we invisioned.

Our data ingestion sources, as mentioned, were the Senator Stock Watcher Watch API and the yfinance library. We started with the Senator Watch data, compiling data on senator trades for 2023, and extracting the ticker symbols for the trades. This laid the foundation for the yfinance data we ingested, using the ticker symbol as the join column for additional data extracted from yfinance. We used pandas to load the data, remove n/a values, and rename and filter columns; we also concatenated a couple of separate data tables from yfinance for one specific table output. From there we loaded the dataframes into our Postgres database using pd.to_sql.

The last step in the pipeline logic was to create SQL views using the data loaded into the db. Though we weren’t focusing too heavily on data analytics, we still wanted to create a few views that would function as deliverables to an analytics end-user from our data pipeline. Similar to our schema creation process, we used SQL scripts invoked from our Python code to generate the necessary views.

With the logic wrapped up, we moved on to adding documentation and tests to the existing functions. We didn’t have enough time to set up integration tests, and would’ve liked to add more data quality tests. That said, we were able to get pytest set up with fixtures for mocking the response from the STW API’s response to test the data loading and transformation steps for that specific data source. At the very least, it created a working testing environment that could be expanded in the future.
The last (but not insignificant) step was actually deploying our work to AWS. We used 4 services as part of our deployment: ECR for storing our program’s Docker image, S3 for storing our .env file, RDS for our hosted Postgres database, and ECS for actually triggering and running our code. The only step that required a decent amount of troubleshooting and tinkering was the ECS work; there were a lot of configuration options when defining our Tasks that required some experimentation and careful review of the documentation, but we were ultimately able to get it working, with logging from our code giving us good visibility into the Task’s processing as the pipeline did its thing.

We demo’d our work to the cohort as part of the project presentation, and added a few tech diagrams to the README to help provide an overview of the way our code came together. Overall, this was a great first project. It took the concepts we’d been learning about and made them concrete, and also gave us very practical hands-on experience with production-level deployment tools. Though I was already fairly proficient with Python, SQL, and Docker heading into this first project, putting them altogether from scratch was a good test. Additionally, I learned a lot about specific AWS services in a very short amount of time. Actually deploying local code to a hosted environment that we could make accessible to others was a great learning experience, and it’s inspired me to pursue my first AWS certification this summer.

Some things that could be improved for a more robust data pipeline:

  • more test coverage, particularly around data quality. This article was a great read on some best practices for that kind of testing in future projects
  • better use of batch loading our data when extracting it. Currently, our pipeline loads all the data available from yfinance based on the given parameters, which was quite a bit of data. This wasn’t a problem when running the programming locally, but gave us trouble with the initial ECS size, which we had to upgrade. Backfilling data would likely help here as well
  • on a related note, incrementally loading data would help round out the available data. We only focused on the year 2023, but re-structuring the code to use a more incremental approach to data loading both historical data before that year as well as new data on a daily or weekly basis would be a great way to gain even more insight

Our project can be found Github. There’s more detailed information about the technicalities of the project in the readme.

Written by

Leo Rubiano

Reader, programmer, traveler. Experienced back-end dev proficient with Python, Go, Elixir, Ecto, and Postgres.