R Basics

(Updated 2-19-2025)

This article lists the basics of the R language. Please let me know if there is anything else I should include!

General Maintenance
- Updating
Data Types
- Convert character variable to numeric
- Converting Character Variable to Factor
Converting character variable into class date
Simple Plotting
- Boxplot
Data Frame Manipulation

General Maintenance

If there is a new package that I don’t yet have installed on my computer, I can do:

install.packages("plotly")

To update a package, I do:

update.packages(ask=FALSE)

To load a library, I do:

library(lubridate)

pacman installs and loads packages, which is much easier than the standard R routine.

pacman::p_load(pacman, dplyr, GGally, ggplot2, ggthemes, ggvis, 
    httr, lubridate, plotly, rio, rmarkdown, shiny, stringr, tidyr)

To unload packages, type

p_unload(all)

To check current directory

getwd()

To set working directory

setwd("/path/to/my/directory")

Updating

Updating R

Data Types

Check data type

str(my.data)

Convert character variable to numeric

dataset$prop_camp <- as.numeric(dataset$prop_camp)

Converting Character Variable to Factor

dataset$cookstove_assigned2 <- factor(dataset$cookstove_assigned2, 
                                      levels = c("Already users", 
                                                 "Intervention group",
                                                 "Waitlisted controls"),
                                      ordered=TRUE)

Converting character variable into class date

Say that my date variable looks like this in string format: “2023-06-13”, then I can use the library lubridate

library(lubridate)
df$Date <- ymd(df$Date)

Simple Plotting

Boxplot

plot(iris$Species, iris$Petal.Width)

The resulting figure is

Data Frame Manipulation

Read in data

Read Excel files into R

Recode empty cells as missing value during data import

df_midline <- import("midline_04032021.csv", na.strings="")

Rows

Remove rows with NaN values

df_midline <- df_midline[!is.na(df_midline$prop_camp), ]

Check number of rows in data frame

nrow(dataset)

Select rows with certain conditions

df[df$unique_id=="157-00045552",]

Reorder rows by Sepal.Length in ascending order and Petal.Length in descending order

my_data %>% arrange(Sepal.Length, desc(Petal.Length))

Reorder rows by Sepal.Length in descending order. Use the function desc():

my_data %>% arrange(desc(Sepal.Length))

Find unique values:

unique(df$col)

Columns

Multiply two columns

df$c <- df$a * df$b

Change column type from character to numeric

df_midline <- transform(df_midline, 
                        employment_woman = as.numeric(employment_woman))

Change column type from integer to categorical

mydata$COR <- as.factor(mydata$COR)

Rename columns

my_data %>% 
  rename(
    sepal_length = Sepal.Length,
    sepal_width = Sepal.Width
    )

Reorder columns

df[,c(1,2,3,4)]

Drop columns

df = subset(mydata, select = -c(x,z) )

Select unique values from column

unique(df$column)

Data frames

Examine a Data Frame in R with 7 Basic Functions:

dim(): shows the dimensions of the data frame by row and column
str(): shows the structure of the data frame
summary(): provides summary statistics on the columns of the data frame
colnames(): shows the name of each column in the data frame
head(): shows the first 6 rows of the data frame
tail(): shows the last 6 rows of the data frame
View(): shows a spreadsheet-like display of the entire data frame

Check for missing values in a dataframe:

sapply(airquality, function(x) sum(is.na(x)))

Combine two cross sectional data sets into a panel data set

library(dplyr)
data1 %>%
  bind_rows(data2) %>%
  arrange(ID, Yr)

Copy data from one cross sectional data variable to another one (time-fixed variables)

df <- df %>%
  group_by(unique_id) %>%
  mutate(cookstove_assigned2 
         = ifelse(n()==2, cookstove_assigned2[!is.na(cookstove_assigned2)],
                  cookstove_assigned2)) %>%
  ungroup

Model-building

x <- as.matrix(data[-12])
y <- data[, 12]

Logit panel regression

See here for a reference.

result_1 <- clogit(resp_stovetype_n ~ indep_var + strata(unique_id), data = df)

If you think some of the variations are due to overall time trends or other time series patterns (reference here), then you should add time dummies in the data. Just be aware that the log-likelihood may not converge if time dummies are added.

result_1 <- clogit(resp_stovetype_n ~ indep_var + strata(unique_id) 
    + strata(time_period), data = df)

This version and this version can handle factor variables.

Save data

write.csv(df, 'df.csv', row.names = FALSE)

Research

Writing

Methods

Python-related

R-related

Latex-related

Stata-related

SQL

Github

Linux-related

Conda-related

AWS-related

Webscraping

Interview Prep

Other