Data Exploration and

Working with Databases

Section Agenda

  • Ad-hoc data exploration
  • Writing production data to a database
  • Automating the pipeline

Project overview

What is this Data?

Our first introduction to pointblank


pointblank provides data quality assessment and metadata reporting for data frames and database tables. https://github.com/rstudio/pointblank

🧰 The pointblank::scan_data() function provides a HTML report of the input data to help you understand your data.

Sample data scan

pointblank::scan_data(palmerpenguins::penguins)



Activity Time!

Activity

👉 Open the file materials/activities/activity-01_raw_data_exploration.qmd

Activity objective: explore our Chicago Food Inspections data to get familiar with our data

Where should I put the data?

📣 Production data belongs in a database.

Database Connection Essentials

  • Use DBI to make a connection with DBI::dbConnect()
  • The first argument to DBI is your DBI-compliant R package
    1. Best case: one of the DBI-compliant db-specific packages
    2. Alternative: odbc::odbc() + an ODBC driver installed on your system
  • 📣 When possible, use a database-specific R package (e.g., RPostgres, RMariaDB, RSQLite, bigrquery, etc.) instead of odbc + a driver. In many cases, they are more performant (especially in writing data) and may have more translations available for query types.

Interacting with Databases

  • What tables are in a database? DBI::dbListTables(con)

  • Use dplyr to interact with the database table in the same manner you would a local data frame

    df <- dplyr::tbl(con, "my_table")
    df |> filter(...) |> mutate(...) |> group_by(...) |> summarise(...)
  • 📣 Do as much work as possible in the database to save time and resources before bringing the table into local memory.

  • Use dplyr::collect() to bring the table into memory. Try to use collect as late as possible in your queries / transformations

    df |> ... |> collect()

How do I automate this process?

🧰 Deploy and schedule your ETL and reports on Posit Connect


Activity Time!

Activity

👉 Open the file materials/activities/activity-02_publish_and_schedule_data_pull.Rmd

Activity objective: Write production data to database, then deploy and schedule this work on Posit Connect so it runs automatically.