R使用技巧
Statistics for Biologists - Nature Collection
Quick-R
修改RStudio的help文档样式
找到RStudio安装目录,一般为C:\Program Files\RStudio
,打开resources
文件夹,先备份R.css
文件为R.css.bak
,再用管理员权限将其替换为如下内容(可能需要在别处新建文件并写入内容再拷贝至该目录覆盖原文件):
R参数控制: options()
- 示例
options()
- 示例R语言修改临时文件目录
在~/.Renviron
中添加:TMP = /home/lyj/Data/tmp/Rtmpdir
即可。
或,在R中运行:write("TMP = '<your-desired-tempdir>'", file=file.path(Sys.getenv('R_USER'), '.Renviron'))
,与之类似。
利用R语言解压与压缩 .tar.gz
.zip
.gz
.bz2
等文件
.tar.gz
.zip
.gz
.bz2
等文件.zip
.zip
压缩:
zip()
解压:
unzip()
若要压缩文件,就直接在 zip()
函数的第一个参数里面输入压缩后的文件名,第二个参数输入压缩前的文件名。解压文件直接在 unzip()
里面加上需要解压的文件名称即可。
.tar.gz
.tar.gz
压缩:
tar()
解压:
untar()
同 .zip
后缀的压缩文件。
.gz
与 .bz2
.gz
与 .bz2
这两个压缩文件与前面的相比,是最与众不同的,因为这两种后缀的文件,可以称之为压缩文件,也可以直接作为一个数据文件,当成 data frame
直接进行读取。因为其本身就是数据文件。
(1) 直接解压
R 中默认没有解压相关文件的函数,需要使用一个包:R.utils
,然后如下述代码所示,利用 gunzip()
函数,即可解压。
注意是这个函数里面多了一个 remove
参数,选择 TRUE
就会只保留解压后的文件,原压缩包会被删除,默认是 TRUE
。
解压之后,我们可以直接用 read.table()
对其进行读取。
(2) 直接读取
当然,如果我们的目的只是读取其中的数据,而不是一定需要解压,则可以使用两个默认函数组合的形式,直接对数据进行读取:
而针对 2.10 版本之后的 R,还有另一种更方便的读取方式,就是直接使用 read.table()
对其进行读取。
Excel中像dplyr::left_join
那样连接两个工作表
dplyr::left_join
那样连接两个工作表最近处理数据时遇到需要将Excel中两个表数据按指定列作为条件进行连接合并的需求,而Excel内置函数VLOOKUP
可以方便地处理这种需求。
示例
现在有两个表:
Sheet1:
1001
12
1002
15
Sheet2:
1
1001
test1
2
1002
test2
希望合并后新得到的Sheet1:
1
userid
level
username
2
1001
12
test1
3
1002
15
test2
处理方法
在C2
位置插入函数
=VLOOKUP(A2,Sheet2!$B:$C,2,FALSE)
敲回车,然后自动填充就都有数据了
VLOOKUP
参数
VLOOKUP
参数第一个参数
A2
指以Sheet1的A2
单元格中数据作为查找的字符,指定查找的值第二个参数
Sheet2!$B:$C
指在工作表Sheet2中指定查找的范围第三个参数是需要引用的数据在查找范围中的列号,因为需要引用username,在C列,因查找范围为B至C列,故为第2列
第四个参数为模糊查找开关,FALSE为精确匹配,TRUE为非精确
另外Sheet2中的数据不需要和Sheet1中完全相同,可以多也可以少,排序也不需要相同,查找不到的行会显示"#N/A"
Sheet2中需要比对查找的列要放在第二个参数指定的比对区域的最前面,此处是B列
Chi-square test of independence in R
https://statsandr.com/blog/chi-square-test-of-independence-in-r/
Introduction
This article explains how to perform the Chi-square test of independence in R and how to interpret its results. To learn more about how the test works and how to do it by hand, I invite you to read the article “Chi-square test of independence by hand”.
To briefly recap what have been said in that article, the Chi-square test of independence tests whether there is a relationship between two categorical variables. The null and alternative hypotheses are:
H0 : the variables are independent, there is no relationship between the two categorical variables. Knowing the value of one variable does not help to predict the value of the other variable
H1 : the variables are dependent, there is a relationship between the two categorical variables. Knowing the value of one variable helps to predict the value of the other variable
The Chi-square test of independence works by comparing the observed frequencies (so the frequencies observed in your sample) to the expected frequencies if there was no relationship between the two categorical variables (so the expected frequencies if the null hypothesis was true).
Data
For our example, let’s reuse the dataset introduced in the article “Descriptive statistics in R”. This dataset is the well-known iris
dataset slightly enhanced. Since there is only one categorical variable and the Chi-square test of independence requires two categorical variables, we add the variable size
which corresponds to small
if the length of the petal is smaller than the median of all flowers, big
otherwise:
We now create a contingency table of the two variables Species
and size
with the table()
function:
The contingency table gives the observed number of cases in each subgroup. For instance, there is only one big setosa flower, while there are 49 small setosa flowers in the dataset.
It is also a good practice to draw a barplot to visually represent the data:
If you prefer to visualize it in terms of proportions (so that bars all have a height of 1, or 100%):
This second barplot is particularly useful if there are a different number of observations in each level of the variable drawn on the xxx-axis because it allows to compare the two variables on the same ground.
If you prefer to have the bars next to each other:
See the article “Graphics in R with ggplot2” to learn how to create this kind of barplot in {ggplot2}
.
Chi-square test of independence in R
For this example, we are going to test in R if there is a relationship between the variables Species
and size
. For this, the chisq.test()
function is used:
Everything you need appears in this output:
the title of the test,
which variables have been used,
the test statistic,
the degrees of freedom and
the p-value of the test.
You can also retrieve the χ2 test statistic and the p-value with:
If you need to find the expected frequencies, use test$expected
.
If a warning such as “Chi-squared approximation may be incorrect” appears, it means that the smallest expected frequencies is lower than 5. To avoid this issue, you can either:
gather some levels (especially those with a small number of observations) to increase the number of observations in the subgroups, or
use the Fisher’s exact test
The Fisher’s exact test does not require the assumption of a minimum of 5 expected counts in the contingency table. It can be applied in R thanks to the function fisher.test()
. This test is similar to the Chi-square test in terms of hypothesis and interpretation of the results. Learn more about this test in this article dedicated to this type of test.
Talking about assumptions, the Chi-square test of independence requires that the observations are independent. This is usually not tested formally, but rather verified based on the design of the experiment and on the good control of experimental conditions. If you are not sure, ask yourself if one observation is related to another (if one observation has an impact on another). If not, it is most likely that you have independent observations.
If you have dependent observations (paired samples), the McNemar’s or Cochran’s Q tests should be used instead. The McNemar’s test is used when we want to know if there is a significant change in two paired samples (typically in a study with a measure before and after on the same subject) when the variables have only two categories. The Cochran’s Q tests is an extension of the McNemar’s test when we have more than two related measures.
For your information, there are three other methods to perform the Chi-square test of independence in R:
with the
summary()
functionwith the
assocstats()
function from the{vcd}
packagewith the
ctable()
function from the{summarytools}
package
As you can see all four methods give the same results.
If you do not have the same p-values with your data across the different methods, make sure to add the correct = FALSE
argument in the chisq.test()
function to prevent from applying the Yate’s continuity correction, which is applied by default in this method.1
Conclusion and interpretation
From the output and from test$p.value
we see that the p-value is less than the significance level of 5%. Like any other statistical test, if the p-value is less than the significance level, we can reject the null hypothesis. If you are not familiar with p-values, I invite you to read this section.
In our context, rejecting the null hypothesis for the Chi-square test of independence means that there is a significant relationship between the species and the size. Therefore, knowing the value of one variable helps to predict the value of the other variable.
Combination of plot and statistical test
I recently discovered the mosaic()
function from the {vcd}
package. This function has the advantage that it combines a mosaic plot (to visualize a contingency table) and the result of the Chi-square test of independence:
As you can see, the mosaic plot is similar to the barplot presented above, but the p-value of the Chi-square test is also displayed at the bottom right.
Moreover, this mosaic plot with colored cases shows where the observed frequencies deviates from the expected frequencies if the variables were independent. The red cases means that the observed frequencies are smaller than the expected frequencies, whereas the blue cases means that the observed frequencies are larger than the expected frequencies.
An alternative is the ggbarstats()
function from the {ggstatsplot}
package:
From the plot, it seems that big flowers are more likely to belong to the virginica
species, while small flowers tend to belong to the setosa
species. Species and size are thus expected to be dependent.
This is confirmed thanks to the statistical results displayed in the subtitle of the plot. There are several results, but we can in this case focus on the p-value which is displayed after p =
at the top (in the subtitle of the plot).
As with the previous tests, we reject the null hypothesis and we conclude that species and size are dependent (p-value < 0.001).
Thanks for reading. I hope the article helped you to perform the Chi-square test of independence in R and interpret its results. If you would like to learn how to do this test by hand and how it works, read the article “Chi-square test of independence by hand”.
As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion.
Related articles
Last updated