How to Read a .csv.gz Filr in R
Raw data is typically not available as R data files. If your data has been extracted from a database, you volition near likely receive information technology as flat text files, e.g., csv files. Data recording web activeness is frequently stored in a format called json. In this section we will expect at how to hands read data files into R's memory and how to store them as R databases.
Yous can download the code and data for this module as an Rstudio project here (download this zipped file and extract information technology - you tin can then open up up the Rstudio project from Rstudio.)
Reading Raw Information into R
Ane of the most popular files formats for exchanging and storing data are comma-separated values files or CSV files. This is a left-over from the days of spreadsheets and is not a particularly efficient storage format for information but it is all the same widely used in businesses and other organizations. In a CSV file the showtime row contains the variable names. The next row contains the observations for the kickoff tape, separated past commas and then the next row for the second record and and so on.
CSV files are straightforward to read into R. The fastest method for doing so is using the read_csv function contained in the tidyverse library set then permit's load this library:
#install.packages("tidyverse") ## only if y'all have not already installed this package library(tidyverse) Example Study: Airbnb Data
Let's wait at some data from the peer-to-peer firm rental service Airbnb. In that location are different ways to obtain information on Airbnb listings. The near direct way is, of course, past working for Airbnb then you lot can access their servers directly. Alternatively, you can scrape data of the Airbnb website (equally long as it doesn't violate the terms of service of the site) or obtain the data through third parties. Here nosotros will follow the final approach. The website Inside Airbnb provides Airbnb data from nearly of the major markets where Airbnb operate. Permit's focus on the San Diego listings. The information for each market is provided on the Get the Data folio of the website. Become there and scroll down to San Diego. There are several files available - let's focus on the "detailed listings" data in the file listings.csv.gz. This is a compressed CSV file. Download this file and unzip the compressed file - this will generate a file called listings.csv (note: this has already been washed in the associated project and the files tin can be constitute in the data binder).
At present that you have downloaded the required file y'all tin read the data into R's memory past using the read_csv command:
SDlistings <- read_csv(file = 'information/listings.csv') This will create a information frame in R'south memory called SDlistings that will contain the data in the CSV file. This code assumes that the downloaded (and uncompressed) CSV file is located in the "data" sub-folder of the current working directory. If the file is located somewhere else that is not a sub-binder of your working directory, you need to provide the full path to the file then R can locate information technology, for instance:
SDlistings <- read_csv(file = 'C:/Mydata/airbnb/san_diego/listings.csv') Our workflow above was to starting time first download the compressed file, so extract it and then read in the uncompressed file. You actually down need to uncompress the file commencement - just can also only read in the compressed file directly:
SDlistings <- read_csv(file = 'M:/data/airbnb/listings.csv.gz') This is a huge advantage since y'all can then store the raw data in compressed format.
If y'all are online you don't even demand to download the file kickoff - R can read information technology off the source website directly:
file.link <- "http://data.insideairbnb.com/united-states/ca/san-diego/2016-07-07/information/listings.csv.gz" SDlistings <- read_csv(file = file.link) Once you accept read in raw data to R, it is highly recommended to store the original data in R's own compressed format called rds. R can read in these files extremely fast. This fashion you simply need to read in the original information file one time. So the recommended workflow would exist
## only run this section once! ------------------------------------------------------------------------ file.link <- "http://information.insideairbnb.com/united-states/ca/san-diego/2016-07-07/data/listings.csv.gz" SDlistings <- read_csv(file = file.link) # read in raw information saveRDS(SDlistings,file= 'information/listingsSanDiego.rds') # salvage as rds file ## outset here afterward you have executed the code above once --------------------------------------------- AirBnbListingsSD <- read_rds('data/listingsSanDiego.rds') # read in rds file What fields are available in this data? You can run across this as
## [1] "id" "listing_url" ## [iii] "scrape_id" "last_scraped" ## [5] "proper name" "summary" ## [vii] "space" "description" ## [nine] "experiences_offered" "neighborhood_overview" ## [11] "notes" "transit" ## [13] "access" "interaction" ## [fifteen] "house_rules" "thumbnail_url" ## [17] "medium_url" "picture_url" ## [19] "xl_picture_url" "host_id" ## [21] "host_url" "host_name" ## [23] "host_since" "host_location" ## [25] "host_about" "host_response_time" ## [27] "host_response_rate" "host_acceptance_rate" ## [29] "host_is_superhost" "host_thumbnail_url" ## [31] "host_picture_url" "host_neighbourhood" ## [33] "host_listings_count" "host_total_listings_count" ## [35] "host_verifications" "host_has_profile_pic" ## [37] "host_identity_verified" "street" ## [39] "neighbourhood" "neighbourhood_cleansed" ## [41] "neighbourhood_group_cleansed" "city" ## [43] "land" "zipcode" ## [45] "market" "smart_location" ## [47] "country_code" "state" ## [49] "breadth" "longitude" ## [51] "is_location_exact" "property_type" ## [53] "room_type" "accommodates" ## [55] "bathrooms" "bedrooms" ## [57] "beds" "bed_type" ## [59] "amenities" "square_feet" ## [61] "price" "weekly_price" ## [63] "monthly_price" "security_deposit" ## [65] "cleaning_fee" "guests_included" ## [67] "extra_people" "minimum_nights" ## [69] "maximum_nights" "calendar_updated" ## [71] "has_availability" "availability_30" ## [73] "availability_60" "availability_90" ## [75] "availability_365" "calendar_last_scraped" ## [77] "number_of_reviews" "first_review" ## [79] "last_review" "review_scores_rating" ## [81] "review_scores_accuracy" "review_scores_cleanliness" ## [83] "review_scores_checkin" "review_scores_communication" ## [85] "review_scores_location" "review_scores_value" ## [87] "requires_license" "license" ## [89] "jurisdiction_names" "instant_bookable" ## [91] "cancellation_policy" "require_guest_profile_picture" ## [93] "require_guest_phone_verification" "calculated_host_listings_count" ## [95] "reviews_per_month" Each record is an AirBnb list in San Diego and the fields tell us what nosotros know nigh each listing.
Reading Excel Files
Excel files tin can be read using the read_excel command. You can read separate sheets from the Excel file by providing an optional sheet number. The Excel file used beneath contains information from a 2015 travel decision survey conducted in San Francisco (the file was downloaded from Data.Gov from this link). Let's read in this data and plot the distribution of total number of monthly trips taken by the survey respondents:
library(readxl) # library for reading excel files file.name <- "data/TDS_202015_20Data-WEBPAGE.xlsx" survey.info <- read_excel(file.proper name,sheet = 1) survey.data <- read_excel(file.name,canvass = ii) survey.dict <- read_excel(file.name,canvas = three) ggplot(data=survey.data,aes(x=Trips)) + geom_histogram() + labs(championship= 'Distribution of Monthly Number of Trips', subtitle= paste0('North=',nrow(survey.data),' Respondents') )
The first canvass contains some basic data about the survey, the second contains the actual survey response data, while the third canvass contains the data lexicon. The trip distribution is quite skewed with near respondents taking very few trips (note: the ggplot command is the main plot command in the ggplot library - an extremely powerful visualization package. You lot will learn all the details about visualization using ggplot in a later module).
Reading JSON files
JSON is short for Java-Script-Object-Note and was originally developed as a format for formatting and storing data generated online. It is especially useful when handling irregular data where the number of fields varies by record. There are different methods to read these files - here nosotros will use the jsonlite bundle.
Case Study: Food and Drug Administration Data
The Nutrient and Drug Administration (FDA) is a government bureau with a number of different responsibilities. Hither nosotros will focus on the "F" part, i.e., food. The FDA monitors and records information on food prophylactic including product, retail and consumption. Since the FDA is a government agency - in this instance dealing with issues that are not related to national security - it means that YOU every bit a citizen can access the data the agency collects.
You can find a number of unlike FDA datasets to download on https://open.fda.gov/downloads/. These are provided as JSON files. Here we focus on the Food files. There are two: One for food events and one for food enforcement. The food upshot file contains records of individuals who have gotten ill later consuming skilful. The food enforcement file contains records of specific types of nutrient that has been recalled. Let's get these information into R:
library(jsonlite) enforceFDA <- fromJSON("data/nutrient-enforcement-0001-of-0001.JSON", flatten= True) ## or enforceFDA <- fromJSON(unzip("data/nutrient-enforcement-0001-of-0001.json.naught"), flatten= Truthful) eventFDA <- fromJSON(unzip("information/food-effect-0001-of-0001.json.zippo"), flatten= TRUE) The fromJSON part returns an R listing with two elements: one chosen meta which contains some information almost the data and another called results which is an R data frame with the bodily information.
Let'south accept a look at the enforcement events. Nosotros tin can come across the kickoff records in the data past
glimpse(enforceFDA$results) ## Observations: 13,991 ## Variables: 24 ## $ classification <chr> "Grade Two", "Class II", "Class II",... ## $ center_classification_date <chr> "20160415", "20160415", "20160420",... ## $ report_date <chr> "20160427", "20160427", "20160427",... ## $ postal_code <chr> "78218-5415", "19044-3424", "10010-... ## $ recall_initiation_date <chr> "20160316", "20150826", "20160304",... ## $ recall_number <chr> "F-1083-2016", "F-1088-2016", "F-11... ## $ metropolis <chr> "San Antonio", "Horsham", "New York... ## $ event_id <chr> "73576", "72085", "73471", "72916",... ## $ distribution_pattern <chr> "Texas", "AL, FL, GA, KY, NC, OH, Due south... ## $ recalling_firm <chr> "HEB Retail Back up Center", "BIMBO... ## $ voluntary_mandated <chr> "Voluntary: House Initiated", "Volun... ## $ state <chr> "TX", "PA", "NY", "WA", "NJ", "", "... ## $ reason_for_recall <chr> "The product may have been under pr... ## $ initial_firm_notification <chr> "Ii or more of the following: Emai... ## $ status <chr> "Ongoing", "Ongoing", "Ongoing", "O... ## $ product_type <chr> "Food", "Food", "Food", "Food", "Fo... ## $ country <chr> "United States", "United States", "... ## $ product_description <chr> "Hill Country Fare Chunk Light Tuna... ## $ code_info <chr> " UPC 041220653355 lot code 6O9FZ S... ## $ address_1 <chr> "4710 N Pan Am Expy", "255 Business organization... ## $ address_2 <chr> "", "", "Ste 1604", "", "", "", "",... ## $ product_quantity <chr> "5,376 units", "17,974 units", "4,7... ## $ termination_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ more_code_info <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,... So in that location is a total of 13991 records in the this information with 24 fields. Let's encounter what the get-go recalled product was:
enforceFDA$results$product_description[ane] ## [one] "Hill State Fare Chunk Calorie-free Tuna in pure vegetable oil NET WT. 5 OZ (142g) packaged in a metal tin can." enforceFDA$results$reason_for_recall[ane] ## [one] "The product may take been under candy." enforceFDA$results$recalling_firm[1] ## [ane] "HEB Retail Back up Middle" So the outset record was a production recalled by HEB (a grocery retail chain in Texas) which recalled a canned tuna product which may accept been under-processed.
Which company had the virtually product recalls enforced by the FDA? We can easily get this by counting upward the recalling_firm field:
RecallFirmCounts <- tabular array(enforceFDA$results$recalling_firm) sort(RecallFirmCounts,decreasing = T)[1 : x] ## ## Garden-Fresh Foods, Inc. Expert Herbs, Inc. ## 633 353 ## Bluish Bell Creameries, L.P. Sunland, Incorporated ## 291 219 ## Reser'south Fine Foods, Inc. High Liner Foods Inc. ## 215 187 ## Sunset Natural Products Inc. Whole Foods Market ## 173 171 ## Health One Pharmaceuticals Inc Spartan Central Kitchen ## 147 140 Ok - and so Garden-Fresh Foods had 633 recall events. This was a simple counting exercise. Afterwards on we will look at much more sophisticated methods for summarizing a database.
Copyright © 2017 Karsten T. Hansen. All rights reserved.
Source: http://lab.rady.ucsd.edu/sawtooth/business_analytics_in_r/DataManagement.html
0 Response to "How to Read a .csv.gz Filr in R"
Post a Comment