A few months back, I decided to dive a little more seriously into the world of data engineering, and enrolled in a data engineering camp to get some guided, hands-on experience. The first week is over, and though it’s mostly been a review of foundational material I’m pretty comfortable with, it was useful to see common software development practices within the context of data engineering.
We started with reviewing the typical ETL lifecycle, with a focus on breaking down the process into assets, connectors, and pipeline concepts. The data extraction and loading steps were modeled using pandas to read CSVs, as well as using the Alpaca and OpenWeather APIs to work with external data sources. It’d been a while since I’d used pandas or Jupyter notebooks; not only was it a good refresher, but it was also my first time running an .ipynb
file directly within VS Code - super convenient. For the transformation step, we kept it pretty simple - merging data frames, renaming columns, creating calculated columns, and then filtering on whole columns before writing to a parquet file to create snapshots of our data.
For the loading step, we used sqlalchemy to connect to a local postgres database; we discussed the benefits of composite keys and when to use them, as well as the use of common methods like on_conflict_do_update
to write robust SQL queries. One useful takeaway: thinking about pandas dataframes as intermediary data stores. The database table(s) are the eventual source of truth, and the dataframes are there to facilitate the loading post-extraction/transformation. At one point, I got an unconsumed column error
due to a mismatch between a dataframe column name and a postgres table column name. Renaming the column inplace fixed the issue, and it reinforced the aforementioned concept.
One thing I’ve appreciated about the class is how concretely we apply concepts we’re learning (or reviewing, at least so far). For example, environment variables were introduced pretty early on as a way to store the Alpaca API’s key and secret. Additionally, we used Python classes to model our own API client (as opposed to using some contrived “real-world” object as our lcass example). Lastly, unit tests and integration tests were introduced specifically to help with data quality and data modeling. I’m looking forward to learning more about how testing patterns differ in the world of data engineering compared to software development.