New York City drinking water is world-renowned for its quality. Each day, more than 1 billion gallons of fresh clean water is delivered from large upstate reservoirs. [1] But how safe is New York City’s drinking water? I wanted to investigate more on this matter because in recent years, there was the water crisis in Flint, Michigan, where lead leached from the lead water pipes into the drinking water. Since New York City has a population of 8.6 million people, compared to Flint, Michigan’s 100,000 people [2], a significant number of people would be affected if New York City had unsafe levels of lead and copper in the drinking water.
To determine how safe the water is, I’ve found data from NYC Open Data on data testing lead and copper from tap samples by the Department of Environmental Protection (DEP). This data is updated annually and this dataset has been recently updated on February 4, 2019. The Lead and Copper Rule defines the acceptable limits (action levels) for lead and copper in drinking water. The action level for copper is 1.3 milligrams per liter (mg/L). The action level for lead is 0.015 mg/L. [3]
If there are some homes in New York City that exceeds these acceptable limits, would the resident be able to replace it? I want to know if the socioeconomics of these locations have an influence on the lead levels because some homes might have older piping than other homes but people living in these locations cannot afford to replace it and put their own health at risk. To determine if median household income have an influence on the lead levels, I’ve gathered data from the US Census Bureau of the zip codes in New York City and their median income.
library(tidyverse)
library(leaflet)
library(plotly)
nyc_data <- read_csv("./Data/Free_Residential_at-the-tap_Lead_and_Copper_Data.csv") #lead and copper data
econ_data <- read_csv("./Data/NY_Census_Estimates.csv", skip = 1) #economic data of each zipcode
zipcode <- read_csv("./Data/Zip_Centroid.csv") #centroid of zipcode
econ_clean <- econ_data %>%
select(`GEO.display-label`, HC01_VC114) %>%
separate('GEO.display-label', into = c("zipletters", "zipcode"), sep = " ")
#separate the letters and zipcode to just grab the zipcode with the median income
econ_clean$HC01_VC114 <- (gsub(",","",econ_clean$HC01_VC114)) #remove the comma from median income
econ_clean <- econ_clean %>%
mutate(HC01_VC114 = if_else(HC01_VC114 == "250000+", "250000", HC01_VC114)) %>%
na_if("-")
#create a new row which removes the + sign from 250000+
options(scipen=999) #removes scientific numbering in R
nyc_mean <- nyc_data %>%
group_by(Zipcode) %>%
summarize(Lead_First_Draw= mean(`Lead First_Draw (mg/L)`, na.rm = TRUE),
Copper_First_Draw= mean(`Copper First Draw (mg/L)`, na.rm = TRUE),
Lead_2Min = mean(`Lead 1-2 Minute Flush (mg/L)`, na.rm = TRUE),
Copper_2Min = mean(`Copper 1-2 Minute Flush (mg/L)`, na.rm = TRUE),
Lead_5Min = mean(`Lead 5 Minute Flush (mg/L)`, na.rm = TRUE),
Copper_5Min = mean(`Copper 5 minute Flush (mg/L)`, na.rm = TRUE))
#create summary table of copper and lead
combined_data <- merge.data.frame(nyc_mean, econ_clean, by.x = "Zipcode", by.y = "zipcode", all.x = TRUE)
#merging data of average lead/copper and the median income
combined_data <- combined_data %>%
select(-("zipletters")) %>%
rename(Median_Income = HC01_VC114)
#rename the column to be called Median_Income
nyc_data$Zipcode <- as.character(nyc_data$Zipcode) #change the zipcode to a character
combined_data <- merge.data.frame(zipcode, combined_data, by.x = "ZIP", by.y = "Zipcode")
combined_data$Median_Income <- as.numeric(combined_data$Median_Income)
#merging data to get long and lat for each zipcode
ggplot(combined_data, aes(x = Lead_First_Draw, y = Copper_First_Draw)) +
geom_point(alpha = 0.4) +
geom_vline(xintercept = .015, color = "red") + #graphs vertical line of the EPA rating for lead
geom_hline(yintercept = 1.3, color = "red") + #graphs horizontal line of the EPA rating for copper
labs(title = "Mean of Copper vs Lead First Draw Flush",
x = "First Draw Lead (mg/L)",
y = "First Draw Copper (mg/L)",
caption = "Data source: NYC Open Data") +
theme_classic()
CopperLead_One <- ggplot(nyc_data) +
geom_point((aes(x = `Lead First_Draw (mg/L)`, y = `Copper First Draw (mg/L)`)), alpha = 0.5, size = 2.5) +
geom_vline(xintercept = .015, color = "red") +
geom_hline(yintercept = 1.3, color = "red") +
labs(title = "Copper vs Lead First Draw",
x = "Lead (mg/L)",
y = "Copper (mg/L)",
caption = "Data source: NYC Open Data") +
scale_y_log10() +
scale_x_log10() +
theme_classic()
CopperLead_One
LeadOne_Two <- ggplot(nyc_data) +
geom_point((aes(x = `Lead 1-2 Minute Flush (mg/L)`,
y = `Lead First_Draw (mg/L)`)),
alpha = 0.5, size = 2.5) +
geom_vline(xintercept = .015, color = "red") +
geom_hline(yintercept = .015, color = "red") +
labs(title = "Lead First Draw vs 1-2 Minute Flush",
x = "1-2 Minute Lead (mg/L)",
y = "First Draw Lead (mg/L)",
caption = "Data source: NYC Open Data") +
xlim(0,20) + #the limit for x axis is 20
ylim(0,20) + #the limit for y axis is 20
geom_abline(slope = 1, intercept = 0) + #create a line with slope = 1 and intercept = 0
scale_x_log10() +
scale_y_log10() +
coord_equal() +
theme_classic()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
LeadOne_Two
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
lead_tidy<- gather(nyc_data, key = Draw_Type, value = `Amount of Lead (mg/L)`,
`Lead First_Draw (mg/L)`,
`Lead 1-2 Minute Flush (mg/L)`,
`Lead 5 Minute Flush (mg/L)`)
#gather Lead and called it lead tidy
lead_tidy <- lead_tidy %>%
select(`Kit ID`, Zipcode, Draw_Type, `Amount of Lead (mg/L)`)
lead_tidy$Draw_Type <- factor(lead_tidy$Draw_Type, c("Lead First_Draw (mg/L)", "Lead 1-2 Minute Flush (mg/L)","Lead 5 Minute Flush (mg/L)"))
copper_tidy<- gather(nyc_data, key = Draw_Type, value = `Amount of Copper (mg/L)`,
`Copper First Draw (mg/L)`,
`Copper 1-2 Minute Flush (mg/L)`,
`Copper 5 minute Flush (mg/L)`)
copper_tidy <- copper_tidy %>%
select(`Kit ID` ,Zipcode, Draw_Type, `Amount of Copper (mg/L)`)
copper_tidy$Draw_Type <- factor(copper_tidy$Draw_Type, c("Copper First Draw (mg/L)", "Copper 1-2 Minute Flush (mg/L)","Copper 5 minute Flush (mg/L)"))
lead_boxplot <- ggplot(lead_tidy) +
geom_boxplot((aes(Draw_Type,`Amount of Lead (mg/L)`))) +
scale_y_log10() +
labs(title = "Box Plot of Lead",
caption = "Data source: NYC Open Data")+
theme_classic()
lead_boxplot
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 37121 rows containing non-finite values (stat_boxplot).
lead_tidy %>% filter(Draw_Type != "Lead 5 Minute Flush (mg/L)") %>%
ggplot() +
geom_line(aes(x = Draw_Type, y = `Amount of Lead (mg/L)`, group = `Kit ID`), alpha = 0.1) +
scale_y_log10() +
geom_hline(yintercept = .015, color = "red") +
labs(title = "Amount of Lead Over Time",
x = "Draw Type",
y = "Amount of Lead (mg/L)",
caption = "Data source: NYC Open Data") +
theme_classic()
LeadRisk1 <- nyc_data %>%
drop_na(lead_cat1) %>%
summarize(num_samples = n(),
Percent_Risk = mean(lead_cat1 =="risk", na.rm = TRUE) *100, Percent_Safe= mean(lead_cat1 =="safe", na.rm = TRUE) * 100) #breaks down the number of Risk and Safe
LeadRisk1
(sum(nyc_data$lead_cat1 == "risk" & nyc_data$copper_cat1 == "risk", na.rm = T)/sum(nyc_data$lead_cat1 == "risk", na.rm = T)) * 100
## [1] 7.488987
pal <- colorFactor(c("red","green"), domain = combined_data$lead_cat1) #Color Factor, red = risk, green = safe
map <- leaflet(combined_data) %>%
setView(lng = -73.976611, lat = 40.70844487, zoom = 10.1)
map %>% addProviderTiles(providers$CartoDB.Positron) %>%
addCircles(lng = ~LNG, lat = ~LAT, radius = ~sqrt(Median_Income), color = ~pal(combined_data$lead_cat1), label = ~ZIP)
Median Income is Proportional to the Median Color. Red indicates “risk” and Green indicates “safe”
fig_lead_income<- ggplot(combined_data2) +
geom_point((aes(y = Lead_First_Draw , x = Median_Income, color = Borough))) +
scale_y_log10() +
geom_hline(yintercept = .015, color = "red") +
geom_vline(xintercept = 68720, color = "blue") +
labs(title = "Average Lead 1st Draw vs Median Income by Borough ",
x = "Median Income ($)",
y = "Amount of Lead (mg/L)") +
theme(legend.position = "NA")
ggplotly(fig_lead_income)
From the investigation of the dataset, I was able to clean up the dataset and gather the information I needed to perform an analysis. By looking at a scatter plot of the average copper and lead concentration at different time intervals it was able to help me determine which of the elements should be closely observed. In this case, it was lead because there were many more points that fell outside of the EPA guidelines even after it was flushed for 5 minutes.
When looking at all the samples and comparing copper and lead at different intervals, many points fell outside the EPA guidelines as I expected, but many of the points were also inside the limits. As the water is flushed, there was a decrease of samples that were outside the range for both the lead and copper. Diving deeper into the analysis, I wanted to compare the First Draw vs. 1-2 Minutes and 5 Minutes for both copper and copper to see if there were any new information that I can learn. By using the geom_abline() function, I was able to see that for copper 1-2 minutes/5 minutes and lead five minutes, the concentration levels were lower than at first draw.
Since I was not able to tell whether or not lead first draw vs. lead 1-2 minutes made a difference, I created the box plots for lead and copper. The 50 percentile of the lead first draw is lower than lead 1-2 minute and 5 minutes flush because there were many outliers for the lead first draw. However, there was a decrease in the average copper concentration between each flush. Looking at the line graphs and the amount of lead and copper over time between the first draw and lead 1-2 minute flush, it was interesting to see whether or not the slope went up or down. There was a plenty amount of lead still outside the limit, but there were many negative slopes, which meant that many first draws were reaching to a lower concentration level during the 1-2 minute flush even though it didn’t fall below the EPA limits.
Summary Tables help showed the number of samples and the percentage of homes was at risk for lead and copper for different intervals. Copper was only a risk, exceed EPA limit, at first draw for about 0.38% of all homes, and after the water was flushed, 100% of the households were safe. However, lead had about 3% of all the samples that were at risk. If this data was taken from every home in NYC and similar results were to appear, this could potentially affect more than a quarter million people. When the water is flushed, the percentage of the sample that exceeds the EPA limit went down. Of all the homes that are at risk of lead, I wanted to determine how many of those samples also had a risk of copper and the percentage was 7.4% or about 34 homes of all samples.
The map shows the different centroid of each zip code in New York City, where the size of the circle represents the median household income of that zip code and the red is the average concentration of lead in that zip code that fails to meet the EPA limits. Most of Manhattan was considered safe except for the zip code 10003, where the median income is $183,657. There are also a few neighborhoods in the Brooklyn and Queens area where the concentration of lead is above EPA limits.
Two zip codes which were closely next to each other, 11385 and 11379, both exceeded these limits. I know these two areas are mostly single and multi-family residential homes. With further research, I was able to find out that homes in zip codes 11385 and 11379, had median year built of 1939 (or earlier) and 1950, respectively [4]. Interesting enough, for zip code 10003 the median year built for homes built was 1939 (or earlier). Lead pipes were used in the early 1900s, having a life expectancy of 100 years [5]. Since the homes in these zip codes are much older and probably used lead pipes, the average lead first draw concentration level is higher. These zip code could potentially experience a higher level of first draws when it reaches life expectancy.
Looking at the lead at first draw to median income by borough, many of points are below the EPA limit, which can be seen in the interactive map. However, nine points are outside of that limit. I used a blue line to represent what was considered “low-income” for a family of four in New York City and it is families who make below $68,720 a year [6]. Even if there was a lead problem, there are four zip codes that are considered “low-income” and they may not be able to afford changing their old lead pipes and would put their health at risk.
New York City drinking water is still very high-quality drinking water. Most of these places are not at risk for lead contamination. In most places in NYC, you can drink the water when you first turn on the tap. However, for areas that have older homes, it is best to flush the water for 1-2 minutes before having that refreshing cold water that runs out of the tap.
[1] https://www1.nyc.gov/html/dep/html/drinking_water/index.shtml
[2] https://www.census.gov/quickfacts/fact/table/newyorkcitynewyork/PST045217
[3] https://www.des.nh.gov/organization/divisions/water/dwgb/lead-copper/index.htm
[4] https://www.point2homes.com/US/Neighborhood/NY/New-York-City/11379-Demographics.html
[5] https://www.houselogic.com/organize-maintain/home-maintenance-tips/types-plumbing-pipes-and-their-lifespans/
[6] http://www.nyc.gov/html/mancb10/downloads/pdf/testimonials/cb10-testimony-on-area-median-income-calculation.pdf