Creating a Fact Table that Intersects with Multiple Dimensions Using R and/or SQL

Creating a Fact Table intersecting all dimensions using R and/or SQL

Introduction

In this article, we will explore how to create a fact table that intersects with multiple dimensions, using both R and SQL. The goal is to retrieve the rows for the fact table based on data from two files: Audiences and Spectators.

Dimensions and Files

To understand the problem better, let’s first describe the dimensions and files:

4 Dimensions

  • Dimension Spectators: Contains information about spectators, including ID, Spectator Code, Region, Genre, and Age Class.
ID Spectator Code Region Genre Age Class
1 S001 North Football U18
2 S002 South Basketball U20
3 S003 East Hockey U19
2000 entries
*   **Dimension Hour**: Contains information about hours, including ID, Hour, Minute, Complete Hour (HH:MM:SS), and Period of the day.
    ```markdown
ID | Hour | Minute | Complete Hour | Period of the day
----|------|--------|---------------|-------------------
1  | 8   | 30     | 08:30:00      | Morning
2  | 12  | 45     | 12:45:00      | Lunch
3  | 17  | 15     | 17:15:00      | Afternoon
...
1440 entries
  • Dimension Date: Contains information about dates, including ID, Year, Month, Day, Complete Date (YYYY:MM:DD), and Day of the week.
ID Year Month Day Complete Date Day of the week
1 2022 01 01 2022-01-01 Monday
2 2022 02 15 2022-02-15 Wednesday
3 2022 03 30 2022-03-30 Friday
365 entries
*   **Dimension Programs**: Contains information about programs, including ID, Station, Name of Program, Start hour of the program (HH:MM:SS), Duration (seconds), and Complete Date (YYYY:MM:DD).
    ```markdown
ID | Station | Name of Program | Start hour | Duration | Complete Date
----|---------|-----------------|------------|----------|-------------
1  | Channel1 | Football Match  | 14:00:00  | 90       | 2022-01-01
2  | Channel2 | Basketball Game| 18:45:00  | 120       | 2022-01-02
3  | Channel3 | Hockey Match   | 20:15:00  | 90        | 2022-01-03
...
60000 entries

2 Files

  • Audiences: Contains information about audiences, including ID, Complete Date (YYYY:MM:DD), Station, Duration of visualization (minutes), Start of visualization (HH:MM:SS), and End of visualization (HH:MM:SS).
ID Complete Date Station Duration of visualization Start of visualization End of visualization
1 2022-01-01 Channel1 60 14:00:00 15:00:00
2 2022-01-02 Channel2 120 18:45:00 20:05:00
3 2022-01-03 Channel3 90 20:15:00 21:15:00
*   **Spectators**: Contains information about spectators, including ID, Spectator Code, and Age Class.
    ```markdown
ID | Spectator Code | Age Class
----|----------------|----------
1  | S001           | U18
2  | S002           | U20
3  | S003           | U19
...

Creating the Fact Table

To create a fact table that intersects with all dimensions, we can use SQL. The goal is to retrieve the rows for the fact table based on data from both files.

We will assume that the Audiences and Spectators tables are already created in an RDBMS database, such as MySQL or PostgreSQL.

CREATE TABLE FactTable (
  ID INT PRIMARY KEY,
  Date DATE NOT NULL,
  Hour TIME NOT NULL,
  ProgramID INT,
  Duration INT NOT NULL,
  FOREIGN KEY (ProgramID) REFERENCES Programs(ID)
);

To insert data into the fact table, we can use a JOIN to merge the Audiences and Spectators tables with the FactTable.

INSERT INTO FactTable (ID, Date, Hour, ProgramID, Duration)
SELECT 
  a.ID,
  DATE(a.CompleteDate) AS Date,
  TIME(a.StartOfVisualization) AS Hour,
  p.ID AS ProgramID,
  a.DurationOfVisualization
FROM 
  Audiences a
JOIN 
  Spectators s ON a.ID = s.ID
JOIN 
  FactTable ft ON a.ID = ft.ProgramID AND ft.Date = DATE(a.CompleteDate)
WHERE 
  a.StartOfVisualization BETWEEN '14:00:00' AND '20:15:00';

Using R to Create the Fact Table

If we don’t want to use SQL, we can create the fact table in R using the dplyr library.

library(dplyr)

# Load data from Audiences and Spectators tables into R data frames
audiences <- read.csv("Audiences.csv")
spectators <- read.csv("Spectators.csv")

# Merge data from Audiences and Spectators tables with FactTable
FactTable <- merge(audiences, spectators, by = "ID") %>%
  mutate(Date = as.Date(ClosestCompleteDate)) %>%
  arrange(Date) %>%
  group_by(Date, Hour, ProgramID) %>%
  summarise(Duration = mean(DurationOfVisualization))

# Add foreign key to ProgramID
FactTable$ProgramID <- match(FactTable$ProgramID, programs$ID)

# Print FactTable
print(FactTable)

Using R to Insert Data into FactTable

To insert data into the fact table using R, we can use a combination of dplyr and DBI libraries.

library(dplyr)
library(DBI)

# Load data from Audiences and Spectators tables into R data frames
audiences <- read.csv("Audiences.csv")
spectators <- read.csv("Spectators.csv")

# Create connection to the database
conn <- dbConnect(RDBMS, host = "localhost", port = 3306, dbname = "mydatabase", user = "myuser", password = "mypassword")

# Insert data into FactTable
fact_table_data <- a %>% 
  group_by(ID) %>% 
  summarise(Date = as.Date(ClosestCompleteDate)) %>%
  arrange(Date) %>%
  mutate(Hour = time(a.StartOfVisualization))
  
dbWriteTable(conn, "FactTable", fact_table_data)

# Close the connection to the database
dbDisconnect(conn)

Conclusion

In this article, we have explored how to create a fact table that intersects with multiple dimensions using both R and SQL. We have used SQL to create the fact table and insert data into it, as well as dplyr to merge and summarize data in R.


Last modified on 2024-12-19