Tuesday, October 23, 2018

R - Data Types , Data Frames and Vectors

Posted by fullstacktips on October 23, 2018 R No comments

Variables in R can be different types and we need to distinguish numbers from character strings and tables from simple lists of numbers .To distinguish numbers from characters we always use quotes("") for characters. The function class() helps us determine the type of an object .

> a <- 2

>class(a)

Storing Data in R – Data Frames

Up to now we have defined variables in R , but the most common way of storing data in R is by using data frames . We can think of data frames as tables. Rows represent observations , and different columns are represented by different variables . R shows the data type for such objects as data.frame and to find out more about these objects we can use the function str() which stands for structure.In our first post we have seen how to load datasets.The example below shows a dataset for CO2 emissions data for each plant.Here is the structure of this dataset:

The output shows us the no.of observations , no.of rows , and variables , variable names etc. This is what is going to help us answer data analysis questions on this data.We can show the first 6 lines of this data frame using the fn head().

Data Accessor - '$'

For our analysis , we will need to access the different variables in this data. We use the accessor symbol ‘$’ to access these variables. To access the variable Plant in this dataser we type the dataset name i.e CO2 followed by the accessor ($) and the variable name(Plant) as shown below :

>CO2$Plant

We can also use the below syntax to access a variable .

>b <- CO2[["Plant"]]

Vectors

The output shown in above is what is called a vector . It is not a single value . A vector may have several entries and the function length() tells you how many entries it has .The vector above is of length 84 .

>length(CO2$Plant)

Other data types

Logicals

Besides Numeric and Character vectors we also have logical vectors which store the value TRUE or FALSE . We will see
these examples in later posts.

Factors

There is one more important data type which is called “Factors”. In the CO2 dataset we have the columns Type and Treatment .
Seeing the data we would think that the class for these columns would be Character , but it’s actually “factor”. This data type
appears frequently in R and data science. Factors are useful for storing categorical data.There are only 2 categories for
these variables Type and Treatment . Storing categorical data this way is more memory efficient and in the background R stores
the levels as integers.Integers are smaller memory wise than characters.If we want to see the different categories we use :

>levels(CO2$Type)

>levels(CO2$Treatment)