Añadir esta página al libro
Eliminar esta página del libro
# Bioestadística con R para Ciencias Médicas
Initially written in 1991 by two statistics professors, Ross Ihaka and Robert Gentleman (the two Rs behind the name R) of the University of Auckland, New Zealand, the programming language R was developed to make it easier for scientists to organize, analyze, and distribute data. R can be used for many of the same functions as commercial software, such as SPSS or SAS, but it has the added advantages of being free and having an incredibly active community of users who are constantly coming up with new features and “packages.” Anyone can write up a new function and save it as an R package in the Comprehensive R Archive Network (CRAN) repository. With more than 1 million users worldwide and more than 10,000 packages, R is emerging as a popular tool for scientists of all stripes — including social and behavioral scientists.
R is a programming language for statistical computing and graphics that you can use to clean, analyze, and graph your data. It is widely used by researchers from diverse disciplines to estimate and display results and by teachers of statistics and research methods. It’s free, making it an attractive option, but does rely on programming code — instead of drop down menus or buttons — to get the job done. Programming languages can be intimidating. Maybe you like the comfort and familiarity of whatever statistics program you’ve been working with. Maybe you don’t have the time to learn a new skill. Maybe you just don’t know where to start. These are all valid reasons for putting off using R.
Now the question remains: What should you use R for? Everything. No seriously, everything. Toss out SPSS, SAS, and STATA, because R can do all the descriptive analyses, regression equations, (M)AN(C)OVA, and hierarchical linear modeling you want. No need to buy MPlus, because R has structural equation modeling covered. Don’t bother opening Excel, because merging data sets, cleaning data, identifying important rows or columns, and even updating your gradebook can be done in R. Save money on colored pencils, because R will create whatever plot or graphic you can imagine, even if it’s 3D or interactive or both. R can be used with text processors like LaTeX, so you can integrate your results right into the manuscript itself. Stuck using Microsoft Word because your collaborators like track changes? R will create APA formatted tables, complete with significance stars and horizontal lines and export them as .doc files for your convenience. R can do both frequentist and Bayesian statistics. R can make use of your multi-core processor and run analyses in parallel. Search for a “bit of fun with R” and learn how to make a winking elephant. R can bootstrap, simulate, randomize, resample, multiply, impute, and park your car. Well, R can’t park your car — yet.
## Ventajas y limitaciones Ventajas de R
- código abierto
- requerimientos mínimos de hardware
- condigo reproducible: scripts
- extensiones: paquetes
- comunidad y apoyo
- cursos online
- curva de aprendizaje
- línea de comandos puede asustar al principio
- código sucio es lento
### Rstudio ### Otras interfaces
## Primeros pasos con R y RStudio
### Instalación R + Rstudio ### Interfaz de Rstudio
### Cargar datos
## Análisis exploratorio de datos
### Exploración del dataset ### Missing data and outliers ver en https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
#### 2. Missing Value Treatment
##### Why missing values treatment is required?
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification.
##### Data Exploration, Missing Values
Notice the missing values in the image shown above: In the left scenario, we have not treated missing values. The inference from this data set is that the chances of playing cricket by males is higher than females. On the other hand, if you look at the second table, which shows data after treatment of missing values (based on gender), we can see that females have higher chances of playing cricket compared to males.
##### Why my data has missing values?
We looked at the importance of treatment of missing values in a dataset. Now, let’s identify the reasons for occurrence of these missing values. They may occur at two stages:
Data Extraction: It is possible that there are problems with extraction process. In such cases, we should double-check for correct data with data guardians. Some hashing procedures can also be used to make sure data extraction is correct. Errors at data extraction stage are typically easy to find and can be corrected easily as well. Data collection: These errors occur at time of data collection and are harder to correct. They can be categorized in four types: Missing completely at random: This is a case when the probability of missing variable is same for all observations. For example: respondents of data collection process decide that they will declare their earning after tossing a fair coin. If an head occurs, respondent declares his / her earnings & vice versa. Here each observation has equal chance of missing value. Missing at random: This is a case when variable is missing at random and missing ratio varies for different values / level of other input variables. For example: We are collecting data for age and female has higher missing value compare to male. Missing that depends on unobserved predictors: This is a case when the missing values are not random and are related to the unobserved input variable. For example: In a medical study, if a particular diagnostic causes discomfort, then there is higher chance of drop out from the study. This missing value is not at random unless we have included “discomfort” as an input variable for all patients. Missing that depends on the missing value itself: This is a case when the probability of missing value is directly correlated with missing value itself. For example: People with higher or lower income are likely to provide non-response to their earning.
##### Which are the methods to treat missing values ?
Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion. In list wise deletion, we delete observations where any of the variable is missing. Simplicity is one of the major advantage of this method, but this method reduces the power of model because it reduces the sample size. In pair wise deletion, we perform analysis with all cases in which the variables of interest are present. Advantage of this method is, it keeps as many cases available for analysis. One of the disadvantage of this method, it uses different sample size for different variables.
##### Data Exploration, Missing Values, Deletion Methods Deletion methods are used when the nature of missing data is “Missing completely at random” else non random missing values can bias the model output. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. It can be of two types:- Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or median. Like in above table, variable “Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and then replace missing value with it. Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and “Female” (25) individually of non missing values then replace the missing value based on gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female” with 25. Prediction Model: Prediction model is one of the sophisticated method for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values. First data set become training data set of the model while second data set with missing values is test data set and variable with missing values is treated as target variable. Next, we create a model to predict target variable based on other attributes of the training data set and populate missing values of test data set.We can use regression, ANOVA, Logistic regression and various modeling technique to perform this. There are 2 drawbacks for this approach: The model estimated values are usually more well-behaved than the true values If there are no relationships with attributes in the data set and the attribute with missing values, then the model will not be precise for estimating missing values. KNN Imputation: In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function. It is also known to have certain advantage & disadvantages. Advantages: k-nearest neighbour can predict both qualitative & quantitative attributes Creation of predictive model for each attribute with missing data is not required Attributes with multiple missing values can be easily treated Correlation structure of the data is taken into consideration Disadvantage: KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset looking for the most similar instances. Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what we need whereas lower value of k implies missing out of significant attributes.
After dealing with missing values, the next task is to deal with outliers. Often, we tend to neglect outliers while building models. This is a discouraging practice. Outliers tend to make your data skewed and reduces accuracy. Let’s learn more about outlier treatment.
#### 3. Techniques of Outlier Detection and Treatment
##### What is an Outlier?
Outlier is a commonly used terminology by analysts and data scientists as it needs close attention else it can result in wildly wrong estimations. Simply speaking, Outlier is an observation that appears far away and diverges from an overall pattern in a sample.
Let’s take an example, we do customer profiling and find out that the average annual income of customers is $0.8 million. But, there are two customers having annual income of $4 and $4.2 million. These two customers annual income is much higher than rest of the population. These two observations will be seen as Outliers.
###### What are the types of Outliers?
Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of univariate outlier. These outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space. In order to find them, you have to look at distributions in multi-dimensions.
Let us understand this with an example. Let us say we are understanding the relationship between height and weight. Below, we have univariate and bivariate distribution for Height, Weight. Take a look at the box plot. We do not have any outlier (above and below 1.5*IQR, most common method). Now look at the scatter plot. Here, we have two values below and one above the average in a specific segment of weight and height.
Outlier, Multivariate Outlier
###### What causes Outliers?
Whenever we come across outliers, the ideal way to tackle them is to find out the reason of having these outliers. The method to deal with them would then depend on the reason of their occurrence. Causes of outliers can be classified in two broad categories:
Artificial (Error) / Non-natural Natural. Let’s understand various types of outliers in more detail:
Data Entry Errors:- Human errors such as errors caused during data collection, recording, or entry can cause outliers in data. For example: Annual income of a customer is $100,000. Accidentally, the data entry operator puts an additional zero in the figure. Now the income becomes $1,000,000 which is 10 times higher. Evidently, this will be the outlier value when compared with rest of the population. Measurement Error: It is the most common source of outliers. This is caused when the measurement instrument used turns out to be faulty. For example: There are 10 weighing machines. 9 of them are correct, 1 is faulty. Weight measured by people on the faulty machine will be higher / lower than the rest of people in the group. The weights measured on faulty machine can lead to outliers. Experimental Error: Another cause of outliers is experimental error. For example: In a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused him to start late. Hence, this caused the runner’s run time to be more than other runners. His total run time can be an outlier. Intentional Outlier: This is commonly found in self-reported measures that involves sensitive data. For example: Teens would typically under report the amount of alcohol that they consume. Only a fraction of them would report actual value. Here actual values might look like outliers because rest of the teens are under reporting the consumption. Data Processing Error: Whenever we perform data mining, we extract data from multiple sources. It is possible that some manipulation or extraction errors may lead to outliers in the dataset. Sampling error: For instance, we have to measure the height of athletes. By mistake, we include a few basketball players in the sample. This inclusion is likely to cause outliers in the dataset. Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier. For instance: In my last assignment with one of the renowned insurance company, I noticed that the performance of top 50 financial advisors was far higher than rest of the population. Surprisingly, it was not due to any error. Hence, whenever we perform any data mining activity with advisors, we used to treat this segment separately.
What is the impact of Outliers on a dataset?
Outliers can drastically change the results of the data analysis and statistical modeling. There are numerous unfavourable impacts of outliers in the data set:
It increases the error variance and reduces the power of statistical tests If the outliers are non-randomly distributed, they can decrease normality They can bias or influence estimates that may be of substantive interest They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions. To understand the impact deeply, let’s take an example to check what happens to a data set with and without outliers in the data set.
Outlier, Mean, Median, Mode
As you can see, data set with outliers has significantly different mean and standard deviation. In the first scenario, we will say that average is 5.45. But with the outlier, average soars to 30. This would change the estimate completely.
How to detect Outliers?
Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization). Some analysts also various thumb rules to detect outliers. Some of them are:
Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier Data points, three or more standard deviation away from mean are considered outlier Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding Bivariate and multivariate outliers are typically measured using either an index of influence or leverage, or distance. Popular indices such as Mahalanobis’ distance and Cook’s D are frequently used to detect outliers. In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and influential observation, we also look at statistical measure like STUDENT, COOKD, RSTUDENT and others. How to remove Outliers?
Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations, transforming them, binning them, treat them as a separate group, imputing values and other statistical methods. Here, we will discuss the common techniques used to deal with outliers:
Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the process of assigning weights to different observations.
Variable Transformation, LOG
Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median, mode imputation methods. Before imputing values, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute it with predicted values.
Treat separately: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.
Till here, we have learnt about steps of data exploration, missing value treatment and techniques of outlier detection and treatment. These 3 stages will make your raw data better in terms of information availability and accuracy. Let’s now proceed to the final stage of data exploration. It is Feature Engineering.
### Resúmenes numéricos
#### Tablas Una variable
Más de dos variables
Medidas de tendencia central
Medidas de tendencia central agrupados
Medidas de dispersión
### Resúmenes visuales
COMPONENTS OF A GGPLOT2 PLOT Plots convey information through various aspects of their aesthetics. Some aesthetics that plots use are:
- x position
- y position
- size of elements
- shape of elements
- color of elements
The elements in a plot are geometric shapes, like
- line segments
Some of these geometries have their own particular aesthetics. For instance:
- point shape
- point size
- line type
- line weight
- y minimum
- y maximum
- fill color
- outline color
- label value
The ggplot2 implies “Grammar of Graphics” which believes in the principle that a plot can be split into the following basic parts -
Plot = data + Aesthetics + Geometry
- Data refers to a data frame (dataset).
- Aesthetics indicates x and y variables. It is also used to tell R how data are displayed in a plot, e.g. color, size and shape of points etc.
- Geometry refers to the type of graphics (bar chart, histogram, box plot, line plot, density plot, dot plot etc.)
Also, other components of a plot are
- Faceting implies the same type of graph can be applied to each subset of the data. For example, for variable gender, creating 2 graphs for male and female.
- Annotation lets you to add text to the plot.
- Summary Statistics allows you to add descriptive statistics on a plot.
- Scales are used to control x and y axis limits
#### Important plots
|Bar Chart||geom_bar(), geom_errorbar()|
|Histogram||geom_histogram(), stat_bin(), position_identity(), position_stack(), position_dodge()|
|Box Plot||geom_boxplot(), stat_boxplot(), stat_summary()|
|Line Plot||geom_line(), geom_step(), geom_path(), geom_errorbar()|
## Test estadísticos básicos ### ¿Cuál tratamiento es mejor? Comparar grupos
a <- rnorm(100)*5+12 b <- rnorm(100)*2+13 ab <- data.frame(a, b) ab ab <- ab %>% gather(key = "group", value = "value") ab
ab %>% ggplot(aes(x = group, y = value)) + geom_boxplot()
t.test(a, mu = 15) t.test(a, b) t.test(value~group, data = ab)
#### ¿Cuál tratamiento tiene más éxitos? Comparar proporciones ##### Chi-cuadrado y F de Fisher #### ¿Cuál tratamiento disminuye más el dolor? Comparar promedios ##### t-test y ANOV A ### ¿Se asocia x con y? Explorar asociaciones
#### Regresión logística
### ¿Qué factores se asocian con el riesgo a enfermedar?
### ¿Cuál tratamiento tiene mejor pronóstico? Análisis de sobrevida