Jake Feldman
-Data comes in many different forms
-Some of the common typed include
Numeric: 5,4.2, 5.432
Character: “hello”, “5”
Logical or Boolean: True or False
Factor: categories like letter grades
-Store multiple pieces of info in a data structure
Vectors: single column, all data must be the same type
Data Frames: two dimensions, data can be multiple types
-Variables are used to store or keep track of information
-When you read data into R, you’ll need to use a variable to store this info
#We use the = command to set a variable equal to something.
#Now the variable a, for example, will be equal to 5 as long as
#I don't change it
a=5
b=4.2
c=5.432
#Running the name of the variable will show you what is equal
a
[1] 5
#The class command shows the type of the variable
class(c)
[1] "numeric"
-In order to create variables you need to run the line of code where they are defined or manipulated
-I will show you how to run code after this introduction (this is one advantage of R Studio)
#Notice the use of quotes (Can be single or double) to
#create a string variable
e="TRUE"
#You should think of this variable as 4 letters
e
[1] "TRUE"
#Get the type
class(e)
[1] "character"
#Needs to be all caps.
v=TRUE
v
[1] TRUE
class(v)
[1] "logical"
-Make variable names meaningful - makes code more readable.
-Variables are case sensitive.
-No spaces but you can use numbers, ‘.’, ‘_’
-The # stands for comments. Tells you and others what you are tring to accomplish. All of your code should be commented.
-Commands in the editor are seperated by semi-colon or new line. We will generally use the latter.
#Vectors store data that is all of the same type in single dimension;
#think a single row or column of data. The c() means combine and is how
#we create a vector
colors= c("yellow", "green", "red")
colors
[1] "yellow" "green" "red"
class(colors)
[1] "character"
#Trying to create a vector with different data types
weirdColors = c(5, "yellow")
weirdColors
[1] "5" "yellow"
#Create a factor. Notice the input is a vector
crazyColors = factor(c("yellow", "red", "yellow", "green", "red", "yellow"))
#Printing a factor gives you the various categories
crazyColors
[1] yellow red yellow green red yellow
Levels: green red yellow
#The levels command gives us the labels in a vector
levels(crazyColors)
[1] "green" "red" "yellow"
#Factors store a vector as well as the distinct elements of the vector
#as categories or labels
crazyColors = factor(c("yellow", "red", "yellow", "green", "red", "yellow"))
#Get the number of different categories using the nlevels command
nlevels(crazyColors)
[1] 3
#Get the counts under each label
table(crazyColors)
crazyColors
green red yellow
1 2 3
#Data frames let us store data is 2D (rows and columns) like an Excel
#spreadsheet. Here I show to create one from scratch.
QBA = data.frame(names=c("Jake", "Jonny", "Jill"),
height = c(152, 171.5,165), fromCali = c(TRUE,TRUE, FALSE) )
QBA
names height fromCali
1 Jake 152.0 TRUE
2 Jonny 171.5 TRUE
3 Jill 165.0 FALSE
#Use str() to the structure of the data frame
str(QBA)
'data.frame': 3 obs. of 3 variables:
$ names : Factor w/ 3 levels "Jake","Jill",..: 1 3 2
$ height : num 152 172 165
$ fromCali: logi TRUE TRUE FALSE
-When working with a data frame the first command you should run is the str() command
You can’t do arithemtic operations on a piece of data unless it’s stored as a numeric data type
Names with title stored as strings versus factors
Will be very important when we start plotting
# Addition
5+3
[1] 8
c(3,3) + c(1,1)
[1] 4 4
a=5
b=3
c=a+b
c
[1] 8
#Subtraction
5-3
[1] 2
c(3,3) - c(1,1)
[1] 2 2
#Multiplication
5*3
[1] 15
c(3,3) * c(1,1)
[1] 3 3
#Division
5/3
[1] 1.666667
c(3,3) / c(1,1)
[1] 3 3
#Raising to a power
5^3
[1] 125
c(3,3) ^ c(1,1)
[1] 3 3
#Getting remainder
5%%3
[1] 2
c(3,3) %% c(1,1)
[1] 0 0
#Getting quotient
5%/%3
[1] 1
c(3,3) %/% c(1,1)
[1] 3 3
-This syntax will be most useful when we start writing SQL queries
#Greater than
5>3
[1] TRUE
c(3,3) > c(1,1)
[1] TRUE TRUE
#Greater than or equal
5>=3
[1] TRUE
c(3,3) >= c(1,1)
[1] TRUE TRUE
#Less than
5<3
[1] FALSE
c(3,3) < c(1,1)
[1] FALSE FALSE
#Less than or equal
5<=3
[1] FALSE
c(3,3) <= c(1,1)
[1] FALSE FALSE
#Greater than
5>3
[1] TRUE
c(3,3) > c(1,1)
[1] TRUE TRUE
#Equal
5==3
[1] FALSE
c(3,3) == c(1,1)
[1] FALSE FALSE
#Greater than
5>3
[1] TRUE
c(3,3) > c(1,1)
[1] TRUE TRUE
#Not equal
5!=3
[1] TRUE
c(3,3) != c(1,1)
[1] TRUE TRUE
-Pieces of data/info come in different formates, which we need to conscious of.
-Varible help us both store and keep track of data.
-We can use variables to store
single pieces of data (one number)
A row of data (vector)
2 dimensions of data (data frame)
-We can do basic arithmetic on variables.
-We are building up the necessary tools to create something meaningful
-Creating and manipulating variables will be at the core of all the code we write
We’ll have to make sure we create the correct variables and put them in the correct order
For all of the assignments, there are infinite number of ways to get to the correct answer
Obviously, some answer are better than others: more concise, more readable, more efficient
-Slicing and indexing data frames
-How do I pick out specific parts (columns or rows)?
-How do I change specific parts?
-How do I access/change column names?
-Reading in our first data set.