玩转R语言08-数据框（最常用的数据结构！）

今天介绍R语言里面的数据框，

Note

这是对一个初学者来说，R语言中最常用，重要的内容，没有之一。

数据框

数据框，data.frame，可能是大家最常用的数据结构了。我们的excel数据读进来一般默认都是数据框结构。

数据框由不同的行和列构成，不同的列可以是不同类型（数值型、字符型、逻辑型等）的数据，比如可以其中一列是数值型，另一列是逻辑型，另一列是字符型，等。但是同一列中必须是相同的类型。

创建数据框

数据框可通过函数data.frame()创建，使用方式如下：

mydata <- data.frame(col1, col2, col3,...)

其中的列向量col1、col2、col3等可为任何类型（如字符型、数值型或逻辑型）。

以下代码会创建一个数据框，这个数据框有4列，第一列的名字是patientID，是数值型；第二列的名字是age，也是数值型；第三列的名字是diabetes，是字符型；第4列的名字是status，也是字符型：

# 创建4个向量
patientID <- c(1, 2, 3, 4)
age <- c(25, 34, 28, 52)
diabetes <- c("Type1", "Type2", "Type1", "Type1")
status <- c("Poor", "Improved", "Excellent", "Poor")

# 把4个向量放到一个数据框中
patientdata <- data.frame(patientID, age, diabetes, status)
patientdata
##   patientID age diabetes    status
## 1         1  25    Type1      Poor
## 2         2  34    Type2  Improved
## 3         3  28    Type1 Excellent
## 4         4  52    Type1      Poor

我们首先建立了4个向量，然后使用函数data.frame()将4个向量组合在一起，就变成了一个数据框data.frame，所以你也可以把数据框看成是多个向量的组合。

当你对这个操作足够熟悉后，你也可以直接这样写：

patientdata <- data.frame(
  patientID = c(1, 2, 3, 4), 
  age = c(25, 34, 28, 52), 
  diabetes = c("Type1", "Type2", "Type1", "Type1"), 
  status = c("Poor", "Improved", "Excellent", "Poor")
  )

patientdata
##   patientID age diabetes    status
## 1         1  25    Type1      Poor
## 2         2  34    Type2  Improved
## 3         3  28    Type1 Excellent
## 4         4  52    Type1      Poor
class(patientdata)
## [1] "data.frame"

上一章中从外部读取的数据，默认就是数据框结构：

library(readxl)
brca_expr <- read_xlsx("datasets/brca_expr.xlsx", col_names = T)

class(brca_expr)
## [1] "tbl_df"     "tbl"        "data.frame"

Note

readxl是tidyverse系列R包的一员，如果你使用了这个系列的R包，那么默认的数据框就不再单单是data.frame类型，还会增加tbl_df和tbl类型，这会在某些操作中报错，千万要注意。

如何变成单纯的data.frame类型？只需要使用as.data.frame()函数即可：

# 转换类型
brca_expr <- as.data.frame(brca_expr)
class(brca_expr) # 查看类型
## [1] "data.frame"

探索数据框

查看数据框的基本信息，比如，有几行几列？每一列是什么类型？列名是什么？

# 像Excel一样查看数据框，注意，数据太大会直接卡死
View(patientdata) # 或者在环境面板中直接点击数据框名字

# 查看类型
class(patientdata) # 数据框
## [1] "data.frame"

# 查看数据框的维度，dim是dimension的缩写
dim(patientdata)
## [1] 4 4

# 查看几行几列
ncol(patientdata)
## [1] 4
nrow(patientdata)
## [1] 4

# 查看数据框的结构：str是structure的缩写
str(patientdata)
## 'data.frame':    4 obs. of  4 variables:
##  $ patientID: num  1 2 3 4
##  $ age      : num  25 34 28 52
##  $ diabetes : chr  "Type1" "Type2" "Type1" "Type1"
##  $ status   : chr  "Poor" "Improved" "Excellent" "Poor"

# 查看前6行和最后6行
head(patientdata)
##   patientID age diabetes    status
## 1         1  25    Type1      Poor
## 2         2  34    Type2  Improved
## 3         3  28    Type1 Excellent
## 4         4  52    Type1      Poor
tail(patientdata)
##   patientID age diabetes    status
## 1         1  25    Type1      Poor
## 2         2  34    Type2  Improved
## 3         3  28    Type1 Excellent
## 4         4  52    Type1      Poor

# 默认是6行，可以更改，比如改成2行
head(patientdata, 2)
##   patientID age diabetes   status
## 1         1  25    Type1     Poor
## 2         2  34    Type2 Improved

# 查看数据框的列名
names(patientdata)
## [1] "patientID" "age"       "diabetes"  "status"

# 或者使用以下方法查看列名
colnames(patientdata)
## [1] "patientID" "age"       "diabetes"  "status"

# 查看行名
rownames(patientdata)  
## [1] "1" "2" "3" "4"

# 总结数据信息，摘要
summary(patientdata)
##    patientID         age          diabetes            status         
##  Min.   :1.00   Min.   :25.00   Length:4           Length:4          
##  1st Qu.:1.75   1st Qu.:27.25   Class :character   Class :character  
##  Median :2.50   Median :31.00   Mode  :character   Mode  :character  
##  Mean   :2.50   Mean   :34.75                                        
##  3rd Qu.:3.25   3rd Qu.:38.50                                        
##  Max.   :4.00   Max.   :52.00

行列选择

如果我们要选择其中的某些行或者某些列，或者某个元素（比如，第2行第3列的值），有多种不同的方法实现。

可以通过方括号实现，就像访问向量的元素一样。

数据框和向量不一样，向量是一维的，数据框既有行也有列，数据框是二维的，所以在使用方括号时，我们也要指定行和列，行和列之间用,隔开，,前面表示行，后面表示列。

以下是常见方法，必须要记住：

patientdata[1, 2] # 取第1行第2列的值 
## [1] 25

patientdata[1:2,] # 取第1行到第2行，以及所有列，省略数字就是取所有
##   patientID age diabetes   status
## 1         1  25    Type1     Poor
## 2         2  34    Type2 Improved

patientdata[, 1:3] # 取所有行，以及第1列到第3列
##   patientID age diabetes
## 1         1  25    Type1
## 2         2  34    Type2
## 3         3  28    Type1
## 4         4  52    Type1

patientdata[c(1,4), c(1,3)] # 取第1行和第4行，以及第1列和第3列
##   patientID diabetes
## 1         1    Type1
## 4         4    Type1

如果你在方括号中不写,，那么默认是选取其中的列和所有行：

patientdata[c(1,3)] # 取第1列和第3列，所有的行
##   patientID diabetes
## 1         1    Type1
## 2         2    Type2
## 3         3    Type1
## 4         4    Type1

除了使用数字序号这种，也可以直接使用列名进行选取（是不是和向量操作非常像？）：

patientdata[, "patientID"] # 取patientID这一列和所有的行 ,可省略，下面的也是
## [1] 1 2 3 4
patientdata[, c("patientID", "diabetes")] # 取patientID和diabetes两列及所有行
##   patientID diabetes
## 1         1    Type1
## 2         2    Type2
## 3         3    Type1
## 4         4    Type1

除了使用方括号，还可以使用美元符号$选取列：

# 选取patientID这一列
patientdata$patientID
## [1] 1 2 3 4

如果要同时选择部分行和列，还有一个专门的函数subset()：

subset(patientdata, # 数据
       age > 30, # 选择行，age>30的行
       select = c("patientID","diabetes","age") # 选择列
       )
##   patientID diabetes age
## 2         2    Type2  34
## 4         4    Type1  52

tidyverse中的filter和select也能完成此操作

patientdata这个数据集有4列，每一列都有一个列名，我们可以通过列名很轻松的选取其中的列，但是这个数据集没有行名，我们可以给它添加行名，这个数据共有4行，所以我们要准备4个名字，然后使用rownames()添加行名：

# 准备4个名字
rws <- c("第一行", "第二行", "第三行", "第四行")

# 添加行名：
rownames(patientdata) <- rws
patientdata
##        patientID age diabetes    status
## 第一行         1  25    Type1      Poor
## 第二行         2  34    Type2  Improved
## 第三行         3  28    Type1 Excellent
## 第四行         4  52    Type1      Poor

这样就可以通过行名选择你想要的行了，比如选择第1行和第3行，所有的列：

patientdata[c("第一行", "第三行"), ]
##        patientID age diabetes    status
## 第一行         1  25    Type1      Poor
## 第三行         3  28    Type1 Excellent

选择年龄大于30岁的行以及第2列和第3列：

patientdata[patientdata$age > 30, c(2,3)]
##        age diabetes
## 第二行  34    Type2
## 第四行  52    Type1

此时其实是通过TRUE/FALSE进行选择，首先看patientdata$age > 30：

# 第2个和第4个是TRUE，所以就是选择第2行和第4行
patientdata$age > 30
## [1] FALSE  TRUE FALSE  TRUE

这种方法非常有用，大家一定要记住。

增加/删除行列

增加行列，删除行列：

# 增加1列weight
patientdata$weight <- c(20,30,40,50)
patientdata
##        patientID age diabetes    status weight
## 第一行         1  25    Type1      Poor     20
## 第二行         2  34    Type2  Improved     30
## 第三行         3  28    Type1 Excellent     40
## 第四行         4  52    Type1      Poor     50

# 删除第2列，
# 注意此时不重新赋值的话其实patientdata是没有变化的，和向量一样
patientdata[,-2]
##        patientID diabetes    status weight
## 第一行         1    Type1      Poor     20
## 第二行         2    Type2  Improved     30
## 第三行         3    Type1 Excellent     40
## 第四行         4    Type1      Poor     50

# 或者使用以下方法删除列，这种方法不用重新赋值
patientdata$weight <- NULL
patientdata
##        patientID age diabetes    status
## 第一行         1  25    Type1      Poor
## 第二行         2  34    Type2  Improved
## 第三行         3  28    Type1 Excellent
## 第四行         4  52    Type1      Poor

#patientdata[,- "age"] # 这种写法是错误的

修改和重编码

patientdata <- data.frame(
  patientID = c(1, 2, 3, 4), 
  age = c(25, 34, 28, 52), 
  diabetes = c("Type1", "Type2", "Type1", "Type1"), 
  status = c("Poor", "Improved", "Excellent", "Poor")
  )

patientdata
##   patientID age diabetes    status
## 1         1  25    Type1      Poor
## 2         2  34    Type2  Improved
## 3         3  28    Type1 Excellent
## 4         4  52    Type1      Poor

比如，把diabetes这一列中的Type1变成类型1：

# 选择要修改的列，选择要修改的值，重新赋值即可
patientdata$diabetes[patientdata$diabetes == "Type1"] <- "类型1"
patientdata
##   patientID age diabetes    status
## 1         1  25    类型1      Poor
## 2         2  34    Type2  Improved
## 3         3  28    类型1 Excellent
## 4         4  52    类型1      Poor

把age这一列中大于30岁的变成中年人，小于等于30岁的变成青年人：

patientdata$age[patientdata$age > 30] <- "中年人"
patientdata$age[patientdata$age <= 30] <- "青年人"
patientdata
##   patientID    age diabetes    status
## 1         1 青年人    类型1      Poor
## 2         2 中年人    Type2  Improved
## 3         3 青年人    类型1 Excellent
## 4         4 中年人    类型1      Poor

行列转置

行变成列，列变成行，也就是旋转90°：

patientdata
##   patientID    age diabetes    status
## 1         1 青年人    类型1      Poor
## 2         2 中年人    Type2  Improved
## 3         3 青年人    类型1 Excellent
## 4         4 中年人    类型1      Poor

pdt <- t(patientdata) # transpose缩写
pdt
##           [,1]     [,2]       [,3]        [,4]    
## patientID "1"      "2"        "3"         "4"     
## age       "青年人" "中年人"   "青年人"    "中年人"
## diabetes  "类型1"  "Type2"    "类型1"     "类型1" 
## status    "Poor"   "Improved" "Excellent" "Poor"
class(pdt)
## [1] "matrix" "array"

数据框转置后会变成矩阵，如果要变成数据框，使用as.data.frame()即可：

pdt <- as.data.frame(t(patientdata))
pdt
##               V1       V2        V3     V4
## patientID      1        2         3      4
## age       青年人   中年人    青年人 中年人
## diabetes   类型1    Type2     类型1  类型1
## status      Poor Improved Excellent   Poor
class(pdt)
## [1] "data.frame"

行列拼接

可以分为:

拼接列：把列拼起来，也就是对多个数据框水平堆叠，也就是在一个数据框的右侧添加另一个数据框，要求行数相同
拼接行：把行拼起来，也就是对多个数据框垂直堆叠，也就是在一个数据框的下方添加另一个数据框，要求列数相同

比如现在如下两个数据框df1和df2：

df1 # 3行4列
## # A tibble: 3 × 4
##   barcode                      patient      sample           sample_type        
##   <chr>                        <chr>        <chr>            <chr>              
## 1 TCGA-BH-A1FC-11A-32R-A13Q-07 TCGA-BH-A1FC TCGA-BH-A1FC-11A Solid Tissue Normal
## 2 TCGA-AC-A2FM-11B-32R-A19W-07 TCGA-AC-A2FM TCGA-AC-A2FM-11B Solid Tissue Normal
## 3 TCGA-BH-A0DO-11A-22R-A12D-07 TCGA-BH-A0DO TCGA-BH-A0DO-11A Solid Tissue Normal
df2 # 3行5列
## # A tibble: 3 × 5
##   initial_weight ajcc_pathologic_stage days_to_last_follow_up gender
##            <dbl> <chr>                 <chr>                  <chr> 
## 1            260 Stage IIA             NA                     female
## 2            220 Stage IIB             NA                     female
## 3            130 Stage I               1644                   female
## # ℹ 1 more variable: age_at_index <dbl>

现在我们把两个数据框按列拼接：

cbind(df1, df2) # column bind，变成3行9列
##                        barcode      patient           sample
## 1 TCGA-BH-A1FC-11A-32R-A13Q-07 TCGA-BH-A1FC TCGA-BH-A1FC-11A
## 2 TCGA-AC-A2FM-11B-32R-A19W-07 TCGA-AC-A2FM TCGA-AC-A2FM-11B
## 3 TCGA-BH-A0DO-11A-22R-A12D-07 TCGA-BH-A0DO TCGA-BH-A0DO-11A
##           sample_type initial_weight ajcc_pathologic_stage
## 1 Solid Tissue Normal            260             Stage IIA
## 2 Solid Tissue Normal            220             Stage IIB
## 3 Solid Tissue Normal            130               Stage I
##   days_to_last_follow_up gender age_at_index
## 1                     NA female           78
## 2                     NA female           87
## 3                   1644 female           78

假如还有两个数据框df3和df4：

df3 # 3行2列
## # A tibble: 3 × 2
##   barcode                      patient     
##   <chr>                        <chr>       
## 1 TCGA-BH-A1FC-11A-32R-A13Q-07 TCGA-BH-A1FC
## 2 TCGA-AC-A2FM-11B-32R-A19W-07 TCGA-AC-A2FM
## 3 TCGA-BH-A0DO-11A-22R-A12D-07 TCGA-BH-A0DO
df4 # 2行2列
## # A tibble: 2 × 2
##   barcode                      patient     
##   <chr>                        <chr>       
## 1 TCGA-E2-A1BC-11A-32R-A12P-07 TCGA-E2-A1BC
## 2 TCGA-BH-A0BJ-11A-23R-A089-07 TCGA-BH-A0BJ

进行按行拼接：

rbind(df3,df4) # row bind，变成5行2列
## # A tibble: 5 × 2
##   barcode                      patient     
##   <chr>                        <chr>       
## 1 TCGA-BH-A1FC-11A-32R-A13Q-07 TCGA-BH-A1FC
## 2 TCGA-AC-A2FM-11B-32R-A19W-07 TCGA-AC-A2FM
## 3 TCGA-BH-A0DO-11A-22R-A12D-07 TCGA-BH-A0DO
## 4 TCGA-E2-A1BC-11A-32R-A12P-07 TCGA-E2-A1BC
## 5 TCGA-BH-A0BJ-11A-23R-A089-07 TCGA-BH-A0BJ

数据框合并

具有共同信息的两个数据框可以合并到一个数据框中。比如表格1中包含甲乙丙丁4个病人的年龄和性别信息，表格2中包含甲乙丙丁4个病人的生化指标，那么这样的两个表格就可以合并到一个表格中。

df1 <- data.frame(
  patientID = c("甲","乙","丙","丁"),
  age = c(23,43,45,34),
  gender = c("男","女","女","男")
)
df2 <- data.frame(
  patientID = c("甲","乙","丙","丁"),
  hb = c(110,124,138,142),
  wbc = c(3.7,4.6,6.4,4.2)
)

df1
##   patientID age gender
## 1        甲  23     男
## 2        乙  43     女
## 3        丙  45     女
## 4        丁  34     男
df2
##   patientID  hb wbc
## 1        甲 110 3.7
## 2        乙 124 4.6
## 3        丙 138 6.4
## 4        丁 142 4.2

这两个数据框储存着同一批病人（都是甲乙丙丁4个人）的不同信息，直接拼接起来不是我们想要的结果，因为patientID会出现两次：

cbind(df1,df2)
##   patientID age gender patientID  hb wbc
## 1        甲  23     男        甲 110 3.7
## 2        乙  43     女        乙 124 4.6
## 3        丙  45     女        丙 138 6.4
## 4        丁  34     男        丁 142 4.2

正确的方法是根据两个数据框共有的信息把它们合并起来。这两个数据框共有的信息是patientID这一列，表示的都是甲乙丙丁4个病人，这种情况的正确做法是使用merge函数。

# 根据共有信息合并2个数据框
df3 <- merge(df1, df2, by = "patientID")
df3
##   patientID age gender  hb wbc
## 1        丙  45     女 138 6.4
## 2        丁  34     男 142 4.2
## 3        甲  23     男 110 3.7
## 4        乙  43     女 124 4.6

顺序不一样也没有影响，会自动对应好。比如下面这个数据框，它的顺序不是甲乙丙丁，而是甲丁乙丙：

df4 <- df2[order(df2$wbc),]
df4
##   patientID  hb wbc
## 1        甲 110 3.7
## 4        丁 142 4.2
## 2        乙 124 4.6
## 3        丙 138 6.4

这个数据框和df1也可以直接合并，不会出错：

merge(df1, df4)
##   patientID age gender  hb wbc
## 1        丙  45     女 138 6.4
## 2        丁  34     男 142 4.2
## 3        甲  23     男 110 3.7
## 4        乙  43     女 124 4.6

两个数据框共有信息的名字不一样也可以！

# 这个数据框表示病人ID的列名是id，不是patientID
names(df4)[1] <- "id"
df4
##   id  hb wbc
## 1 甲 110 3.7
## 4 丁 142 4.2
## 2 乙 124 4.6
## 3 丙 138 6.4
df1
##   patientID age gender
## 1        甲  23     男
## 2        乙  43     女
## 3        丙  45     女
## 4        丁  34     男

也可以直接合并，只要指定各自的列名即可：

merge(df1,df4, by.x = "patientID", by.y = "id")
##   patientID age gender  hb wbc
## 1        丙  45     女 138 6.4
## 2        丁  34     男 142 4.2
## 3        甲  23     男 110 3.7
## 4        乙  43     女 124 4.6

这就是表格的合并，也就是把存储相同信息的多个表格合并到一个表格中，注意共同的列(上面的例子是patientID，可以列名不同，但是必须是相同的一类信息或者标识符)不能有重复的观测（比如不能有多个“甲”，只能有一个）。

根据具体情况不同，可以分为以下几类：

内连接：保留两个表格中共有的观测。
左连接：保留表1中的所有观测。
右连接：保留表2中的所有观测。
全连接：保留表1和表2中的所有观测。

假设两个数据框存储了病人的不同信息，如下所示：

df1 <- data.frame(
  patientID = c("甲","乙","丙","丁"),
  age = c(23,43,45,34),
  gender = c("男","女","女","男")
)
df2 <- data.frame(
  patientID = c("甲","乙","戊","几","庚"),
  hb = c(110,124,138,142,108),
  wbc = c(3.7,4.6,6.4,4.2,5.6)
)

df1
##   patientID age gender
## 1        甲  23     男
## 2        乙  43     女
## 3        丙  45     女
## 4        丁  34     男
df2
##   patientID  hb wbc
## 1        甲 110 3.7
## 2        乙 124 4.6
## 3        戊 138 6.4
## 4        几 142 4.2
## 5        庚 108 5.6

内连接，即保留两个表格中共有的观测：

merge(df1, df2, by = "patientID")
##   patientID age gender  hb wbc
## 1        甲  23     男 110 3.7
## 2        乙  43     女 124 4.6

左连接，即保留表1中的所有观测，表2中没有的信息自动填充NA：

merge(df1, df2, all.x = T)
##   patientID age gender  hb wbc
## 1        丙  45     女  NA  NA
## 2        丁  34     男  NA  NA
## 3        甲  23     男 110 3.7
## 4        乙  43     女 124 4.6

右连接，和左连接刚好相反，保留表2中的所有观测：

merge(df1, df2, all.y = T)
##   patientID age gender  hb wbc
## 1        庚  NA   <NA> 108 5.6
## 2        几  NA   <NA> 142 4.2
## 3        甲  23     男 110 3.7
## 4        戊  NA   <NA> 138 6.4
## 5        乙  43     女 124 4.6

#等价于
merge(df2, df1, all.x = T)
##   patientID  hb wbc age gender
## 1        庚 108 5.6  NA   <NA>
## 2        几 142 4.2  NA   <NA>
## 3        甲 110 3.7  23     男
## 4        戊 138 6.4  NA   <NA>
## 5        乙 124 4.6  43     女

全连接，即保留表1和表2中的所有观测：

merge(df1, df2, all = T)
##   patientID age gender  hb wbc
## 1        丙  45     女  NA  NA
## 2        丁  34     男  NA  NA
## 3        庚  NA   <NA> 108 5.6
## 4        几  NA   <NA> 142 4.2
## 5        甲  23     男 110 3.7
## 6        戊  NA   <NA> 138 6.4
## 7        乙  43     女 124 4.6

上面4种连接是最常见的，其实还有半连接和反连接，暂不介绍，感兴趣的可以自己学习一下。

排序

patientdata <- data.frame(
  patientID = c(1, 2, 3, 4), 
  age = c(25, 34, 28, 52), 
  diabetes = c("Type1", "Type2", "Type1", "Type1"), 
  status = c("Poor", "Improved", "Excellent", "Poor")
  )

patientdata
##   patientID age diabetes    status
## 1         1  25    Type1      Poor
## 2         2  34    Type2  Improved
## 3         3  28    Type1 Excellent
## 4         4  52    Type1      Poor

根据某一列进行排序，比如根据age从小到大对数据框重新排序：

patientdata[order(patientdata$age),]
##   patientID age diabetes    status
## 1         1  25    Type1      Poor
## 3         3  28    Type1 Excellent
## 2         2  34    Type2  Improved
## 4         4  52    Type1      Poor

根据age从大到小对数据框重新排序：

patientdata[order(patientdata$age, decreasing = T),]
##   patientID age diabetes    status
## 4         4  52    Type1      Poor
## 2         2  34    Type2  Improved
## 3         3  28    Type1 Excellent
## 1         1  25    Type1      Poor

# 或者
patientdata[order(- patientdata$age),]
##   patientID age diabetes    status
## 4         4  52    Type1      Poor
## 2         2  34    Type2  Improved
## 3         3  28    Type1 Excellent
## 1         1  25    Type1      Poor

还可以根据多个变量进行排序，比如，按照年龄从小到大，并且diabetes从1型到2型的顺序排序：

patientdata[order(patientdata$age, patientdata$diabetes),]
##   patientID age diabetes    status
## 1         1  25    Type1      Poor
## 3         3  28    Type1 Excellent
## 2         2  34    Type2  Improved
## 4         4  52    Type1      Poor

计数

如果要查看某一列有几个类别及数量：

table(patientdata$status)
## 
## Excellent  Improved      Poor 
##         1         1         2

如果你想生成diabetes和status的列联表，可以使用table()函数：

table(patientdata$diabetes, patientdata$status)
##        
##         Excellent Improved Poor
##   Type1         1        0    2
##   Type2         0        1    0

医学生应该都能看懂这个结果什么意思吧？

在每个变量名前都键入一次patientdata$的写法是不符合编程思想的，是很繁琐的，所以给大家介绍一个with()函数，在with()函数内部，你可以不用写数据集的名字：

with(patientdata,
     table(diabetes, status)
     )
##         status
## diabetes Excellent Improved Poor
##    Type1         1        0    2
##    Type2         0        1    0

数据框实战

这个数据是我从TCGA官网（官网，不是XENA）下载的乳腺癌患者的临床信息，包含：患者ID、样本ID、样本类型（normal还是tumor？）、年龄、性别等。

# 读取文件
library(readxl)
brca_clin <- read_xlsx("F:/R_books/R_beginners/brca_clin.xlsx", col_names = T)

# 查看数据基本情况
dim(brca_clin)
## [1] 20  9
head(brca_clin)
## # A tibble: 6 × 9
##   barcode        patient sample sample_type initial_weight ajcc_pathologic_stage
##   <chr>          <chr>   <chr>  <chr>                <dbl> <chr>                
## 1 TCGA-BH-A1FC-… TCGA-B… TCGA-… Solid Tiss…            260 Stage IIA            
## 2 TCGA-AC-A2FM-… TCGA-A… TCGA-… Solid Tiss…            220 Stage IIB            
## 3 TCGA-BH-A0DO-… TCGA-B… TCGA-… Solid Tiss…            130 Stage I              
## 4 TCGA-E2-A1BC-… TCGA-E… TCGA-… Solid Tiss…            260 Stage IA             
## 5 TCGA-BH-A0BJ-… TCGA-B… TCGA-… Solid Tiss…            200 Stage IIB            
## 6 TCGA-E2-A1LH-… TCGA-E… TCGA-… Solid Tiss…             60 Stage I              
## # ℹ 3 more variables: days_to_last_follow_up <chr>, gender <chr>,
## #   age_at_index <dbl>
colnames(brca_clin)
## [1] "barcode"                "patient"                "sample"                
## [4] "sample_type"            "initial_weight"         "ajcc_pathologic_stage" 
## [7] "days_to_last_follow_up" "gender"                 "age_at_index"

变量名修改：

colnames(brca_clin)[c(5,6,7,9)] <- c("weight","stage","days","age")
colnames(brca_clin)
## [1] "barcode"     "patient"     "sample"      "sample_type" "weight"     
## [6] "stage"       "days"        "gender"      "age"

选择第5~9列：

brca_clin <- brca_clin[,c(5:9)]
brca_clin
## # A tibble: 20 × 5
##    weight stage      days  gender   age
##     <dbl> <chr>      <chr> <chr>  <dbl>
##  1    260 Stage IIA  NA    female    78
##  2    220 Stage IIB  NA    female    87
##  3    130 Stage I    1644  female    78
##  4    260 Stage IA   501   female    63
##  5    200 Stage IIB  660   female    41
##  6     60 Stage I    3247  female    59
##  7    320 Stage IIB  NA    female    60
##  8    310 Stage IIIA NA    female    39
##  9    100 Stage IIB  1876  female    54
## 10    250 Stage IIB  707   female    51
## 11    130 Stage IIA  5749  female    51
## 12    110 Stage IA   NA    female    44
## 13    470 Stage IIA  1972  female    64
## 14     90 Stage I    1321  female    56
## 15    200 Stage IIA  385   female    71
## 16     70 Stage IIA  1800  female    71
## 17    130 Stage IIB  214   female    63
## 18    770 Stage IIIA 1206  female    47
## 19    200 Stage IA   2442  female    54
## 20    250 Stage IIIC NA    female    36

按照age从小到大排序数据框：

brca_clin[order(brca_clin$age),]
## # A tibble: 20 × 5
##    weight stage      days  gender   age
##     <dbl> <chr>      <chr> <chr>  <dbl>
##  1    250 Stage IIIC NA    female    36
##  2    310 Stage IIIA NA    female    39
##  3    200 Stage IIB  660   female    41
##  4    110 Stage IA   NA    female    44
##  5    770 Stage IIIA 1206  female    47
##  6    250 Stage IIB  707   female    51
##  7    130 Stage IIA  5749  female    51
##  8    100 Stage IIB  1876  female    54
##  9    200 Stage IA   2442  female    54
## 10     90 Stage I    1321  female    56
## 11     60 Stage I    3247  female    59
## 12    320 Stage IIB  NA    female    60
## 13    260 Stage IA   501   female    63
## 14    130 Stage IIB  214   female    63
## 15    470 Stage IIA  1972  female    64
## 16    200 Stage IIA  385   female    71
## 17     70 Stage IIA  1800  female    71
## 18    260 Stage IIA  NA    female    78
## 19    130 Stage I    1644  female    78
## 20    220 Stage IIB  NA    female    87

按照从大到小的顺序排列：

brca_clin[order(brca_clin$age, decreasing = T),]
## # A tibble: 20 × 5
##    weight stage      days  gender   age
##     <dbl> <chr>      <chr> <chr>  <dbl>
##  1    220 Stage IIB  NA    female    87
##  2    260 Stage IIA  NA    female    78
##  3    130 Stage I    1644  female    78
##  4    200 Stage IIA  385   female    71
##  5     70 Stage IIA  1800  female    71
##  6    470 Stage IIA  1972  female    64
##  7    260 Stage IA   501   female    63
##  8    130 Stage IIB  214   female    63
##  9    320 Stage IIB  NA    female    60
## 10     60 Stage I    3247  female    59
## 11     90 Stage I    1321  female    56
## 12    100 Stage IIB  1876  female    54
## 13    200 Stage IA   2442  female    54
## 14    250 Stage IIB  707   female    51
## 15    130 Stage IIA  5749  female    51
## 16    770 Stage IIIA 1206  female    47
## 17    110 Stage IA   NA    female    44
## 18    200 Stage IIB  660   female    41
## 19    310 Stage IIIA NA    female    39
## 20    250 Stage IIIC NA    female    36

查看有几个类别，以及每个类别的数量：

table(brca_clin$stage)
## 
##    Stage I   Stage IA  Stage IIA  Stage IIB Stage IIIA Stage IIIC 
##          3          3          5          6          2          1

变量重编码和修改：

brca_clin$stage[brca_clin$stage == "Stage IIB"] <- "Stage_2"
brca_clin
## # A tibble: 20 × 5
##    weight stage      days  gender   age
##     <dbl> <chr>      <chr> <chr>  <dbl>
##  1    260 Stage IIA  NA    female    78
##  2    220 Stage_2    NA    female    87
##  3    130 Stage I    1644  female    78
##  4    260 Stage IA   501   female    63
##  5    200 Stage_2    660   female    41
##  6     60 Stage I    3247  female    59
##  7    320 Stage_2    NA    female    60
##  8    310 Stage IIIA NA    female    39
##  9    100 Stage_2    1876  female    54
## 10    250 Stage_2    707   female    51
## 11    130 Stage IIA  5749  female    51
## 12    110 Stage IA   NA    female    44
## 13    470 Stage IIA  1972  female    64
## 14     90 Stage I    1321  female    56
## 15    200 Stage IIA  385   female    71
## 16     70 Stage IIA  1800  female    71
## 17    130 Stage_2    214   female    63
## 18    770 Stage IIIA 1206  female    47
## 19    200 Stage IA   2442  female    54
## 20    250 Stage IIIC NA    female    36

全部修改：

brca_clin$stage[brca_clin$stage == "Stage IIA"] <- "Stage_2"
brca_clin$stage[brca_clin$stage == "Stage IA"] <- "Stage_1"
brca_clin$stage[brca_clin$stage == "Stage I"] <- "Stage_1"
brca_clin$stage[brca_clin$stage == "Stage IIIA"] <- "Stage_3"
brca_clin$stage[brca_clin$stage == "Stage IIIC"] <- "Stage_3"

brca_clin
## # A tibble: 20 × 5
##    weight stage   days  gender   age
##     <dbl> <chr>   <chr> <chr>  <dbl>
##  1    260 Stage_2 NA    female    78
##  2    220 Stage_2 NA    female    87
##  3    130 Stage_1 1644  female    78
##  4    260 Stage_1 501   female    63
##  5    200 Stage_2 660   female    41
##  6     60 Stage_1 3247  female    59
##  7    320 Stage_2 NA    female    60
##  8    310 Stage_3 NA    female    39
##  9    100 Stage_2 1876  female    54
## 10    250 Stage_2 707   female    51
## 11    130 Stage_2 5749  female    51
## 12    110 Stage_1 NA    female    44
## 13    470 Stage_2 1972  female    64
## 14     90 Stage_1 1321  female    56
## 15    200 Stage_2 385   female    71
## 16     70 Stage_2 1800  female    71
## 17    130 Stage_2 214   female    63
## 18    770 Stage_3 1206  female    47
## 19    200 Stage_1 2442  female    54
## 20    250 Stage_3 NA    female    36

查看修改后的stage:

table(brca_clin$stage)
## 
## Stage_1 Stage_2 Stage_3 
##       6      11       3

根据年龄进行分组，大于60岁是old，小于等于60岁是young:

# 先看下年龄是不是数值型
str(brca_clin)
## tibble [20 × 5] (S3: tbl_df/tbl/data.frame)
##  $ weight: num [1:20] 260 220 130 260 200 60 320 310 100 250 ...
##  $ stage : chr [1:20] "Stage_2" "Stage_2" "Stage_1" "Stage_1" ...
##  $ days  : chr [1:20] "NA" "NA" "1644" "501" ...
##  $ gender: chr [1:20] "female" "female" "female" "female" ...
##  $ age   : num [1:20] 78 87 78 63 41 59 60 39 54 51 ...
is.numeric(brca_clin$age)
## [1] TRUE

开始修改：

brca_clin$age[brca_clin$age > 60] <- "old"
brca_clin$age[brca_clin$age <= 60] <- "young"
brca_clin
## # A tibble: 20 × 5
##    weight stage   days  gender age  
##     <dbl> <chr>   <chr> <chr>  <chr>
##  1    260 Stage_2 NA    female old  
##  2    220 Stage_2 NA    female old  
##  3    130 Stage_1 1644  female old  
##  4    260 Stage_1 501   female old  
##  5    200 Stage_2 660   female young
##  6     60 Stage_1 3247  female young
##  7    320 Stage_2 NA    female young
##  8    310 Stage_3 NA    female young
##  9    100 Stage_2 1876  female young
## 10    250 Stage_2 707   female young
## 11    130 Stage_2 5749  female young
## 12    110 Stage_1 NA    female young
## 13    470 Stage_2 1972  female old  
## 14     90 Stage_1 1321  female young
## 15    200 Stage_2 385   female old  
## 16     70 Stage_2 1800  female old  
## 17    130 Stage_2 214   female old  
## 18    770 Stage_3 1206  female young
## 19    200 Stage_1 2442  female young
## 20    250 Stage_3 NA    female young

NA修改(这个数据框里的NA是字符型，并不是真正的NA)：

brca_clin$days_1 <- ifelse(brca_clin$days == "NA", NA, brca_clin$days)
brca_clin
## # A tibble: 20 × 6
##    weight stage   days  gender age   days_1
##     <dbl> <chr>   <chr> <chr>  <chr> <chr> 
##  1    260 Stage_2 NA    female old   <NA>  
##  2    220 Stage_2 NA    female old   <NA>  
##  3    130 Stage_1 1644  female old   1644  
##  4    260 Stage_1 501   female old   501   
##  5    200 Stage_2 660   female young 660   
##  6     60 Stage_1 3247  female young 3247  
##  7    320 Stage_2 NA    female young <NA>  
##  8    310 Stage_3 NA    female young <NA>  
##  9    100 Stage_2 1876  female young 1876  
## 10    250 Stage_2 707   female young 707   
## 11    130 Stage_2 5749  female young 5749  
## 12    110 Stage_1 NA    female young <NA>  
## 13    470 Stage_2 1972  female old   1972  
## 14     90 Stage_1 1321  female young 1321  
## 15    200 Stage_2 385   female old   385   
## 16     70 Stage_2 1800  female old   1800  
## 17    130 Stage_2 214   female old   214   
## 18    770 Stage_3 1206  female young 1206  
## 19    200 Stage_1 2442  female young 2442  
## 20    250 Stage_3 NA    female young <NA>