The web is an ocean where data scientists can gather lots of useful and interesting data. However, this vastness usually means that data comes in a rather messy format

The web is an ocean where data scientists can gather lots of useful and interesting data. However, this vastness usually means that data comes in a rather messy format, and requires significant cleaning and wrangling before it can be used in an inferential study.

In this tutorial, I will walk the reader through the steps of obtaining, cleaning and visualizing data scraped from the web using R. As an example, I will consider online food blogs and illustrate how one can get insights on recipes, ingredients and visitors' food preferences. The tutorial will also illustrate the use of data wrangling by tidyverse, and natural language processing with tidytext packages in R. These packages offer an excellent set of tools, which have contributed to making R one of the go-to language for data scientists.

Web Scraping


First, we need to obtain the data from the blog posts. For this tutorial, I have chosen to scrape data from two sites:

1.Pinch of Yum
2.The Full Helping

These are excellent food blogs with lots of great recipes and nice photographs. Let us consider the first blog (Pinch of Yum) as an example, since it has more recipe entries. There are 51 pages (at the time when this tutorial was written) of recipes, each containing 15 recipe links. The first task is to collect all the links to these recipes, which we can do with the following code snippet:

get_recipe_links <- function(page_number){
  page <- read_html(paste0("",
   links <- html_nodes(page, "a")

    # Get locations of recipe links
    loc <- which(str_detect(links, "<a class"))
        links <-links[loc]
         # Trim the text to get proper links
         all_recipe_links <- map_chr(links, trim_)
         # Return

Given a page number (1 to 51), the function get_recipe_links first reads the given page, and then stores all the links to each recipe. In every page, the links to recipes are found within <a class="block-link" href="/ ... ">, so we first get the nodes associated with "a" by the html_nodes function of the rvest package. Then, using str_detect, we obtain the locations of each link as a list. The trim_ function is applied on all the links in the list by map_chr function, which returns a clean link without some unwanted characters like \ and <. trim_ is a user function which looks like:

trim_ <- function(link){
    temp1 <- str_split(link, " ")[[1]][3] %>%
     str_replace_all("\"", "") %>% # Remove \'s
     str_replace("href=", "") %>%
     str_replace(">", " ")

   # Return
   str_split(temp1, " ")[[1]][1]

To see how I came up with these, the reader should open one of the pages (e.g. this one) and look at the html source (which can be done by any browser's source view tool). The source code can be messy, but locating relevant pieces of information by repeating patterns becomes straightforward after looking through a few of the pages that are being scraped. The code for scraping all the recipe links from The Full Helping is almost identical, with a few small twists.

Now that we have all the links, the next step is to one by one connect to each link and gather the data from each recipe. This step is more tedious than the previous one, since every site stores its recipe data in a different format. For example, Pinch of Yum uses JSON format for each recipe, which is excellent, since the data is pretty much standard across all the pages. Instead, The Full Helping has the recipe information in html, so it requires a bit more work to collect.

Let's look at how we collect data from each recipe in Pinch of Yum. The below code snippet illustrates the main pieces of this process.

# Get the recipe page
page <- read_html (link_to_recipe)
# Get recipe publish date/time
meta <- html_nodes(page, "meta")
dt <- meta[map_lgl(meta, str_detect, "article:published_time")] %>%
    str_replace_all("\"|/|<|>", " ") %>%
    str_replace_all("meta|property=|article:published_time|content=", "") %>%
    str_trim() %>%

date <- dt[[1]][1]
time <- dt[[1]][2]

# JSON data
script_ <- html_nodes(page, "script")
loc_json <- which(str_detect(script_, "application/ld"))
if (length(loc_json) == 0){
 return(NULL) # If the link does not contain a recipe, just return null
# Load the data to JSON
recipe_data <- fromJSON(html_text(script_[loc_json][[1]]))

The above code snippet first reads the page from a given link_to_recipe, then collects the date and time when the recipe is published and finally reads the recipe data which is in JSON format. The date/time information is stored in the node "meta" and after we get it, we simply clean it with stringr operations like str_replace_all, str_trim and str_split. I recommend the reader to open the source of one of the recipes in her/his browser and locate the "meta" node with date/time information and compare with the code snippet to see how it all works.

Obtaining the JSON data is rather straightforward with the fromJSON function of the jsonlite package. While the JSON data comes rather clean after this step, there are a few nooks and crannies one has to deal with to put the data in a useful format. I will not discuss these steps in this post, but they are included in the repository. At the end, the user function get_recipe_data, which contains the above snippet and returns a data frame containing the recipe information from a given link_to_recipe.

    to read the full article :