r project

Introduction: This project contain exploratory data analysis using r and Ggobi sofrware. r software is more powerful tools for analysis the data. Using this we can analyze all kind of descriptive and statistical inference. Addition to this also we can use model like regression analysis.

In this we used employee data for analyzing the salary of associates. The salary of associates depends on so many factors. In this data we included variables are education(in years), salary begin, Previous experience (in years), Job time, minority.

The statistical tools used,

1. Mean, median, Mode

2. cross tab analysis

3. Scatter plots

4. Correlation Analysis

5.  Regression Analysis.

The below is the list of r option we used

R option

Uses

read.table() To read employee text file
attach() To get attached all variable into r console
colnames() To list the name present in employee data
summary() To get the descriptive statistics like Minmum, maximum, 1st quarter, 3rd Quarter,
plot() To plot the scatter plot
pairs() To plot the scatter plot including all variable together
tables() To do the cross tab analysis
cor() To perform the correlation Analysis between two variables
lm() To perform the linear regression.

 

 

 

#To read the text file into R console we used read.table option.

Emp=read.table(“D:/CARS/Project-cars/Ranadeeb/emp.txt”,header=TRUE)

 

# Attach the tables imported we need to use attach option.

#Once we attach option used, then we can give Individual variable name for analysis

#and plotting the graph.

 

attach(Emp)

 

#To find the column names we need to use the colnames() option. This will display

the column names of the attached file.

 

colnames(Emp)

#The variables prsent in the emp data are

#1. Educ ( this is the number of years of education)

#2. jobcat ( this will measure the category of job they are doing. 1 reprsent

#trainee. 2 will reprsent Team Members. 3rd will reprsent Manaager Level.

#3. Salary

#4.Salbegin (this salary when started job first time)

#5.Jobtime

#6. Prevexp (It is the experience of the candiates)

#7. Minority (0 means no and 1 means yes)

 

#The summary will give the overall descrptive statistics for all quantitative data.

#The descrptive statistics are given below.

#1. Minimum Value: This will give the minimum valve for the given data.

#2.1st Qu: the first quarter the 25th percentile data.

#3.Median:It is the middle value of the data set when data is arranged in either

#in ascending or descending order.

#4. Mean: It is the sum of the observation divided by total number of observation.

#5. 3rd Qu: this will give the 75% of the data.

#6. Max: It will give the maximum valuve of data set.

 

summary(Emp)

summary(salary)

 

 

 

# To get the cross tabulation we can use table option. this will give the cross tabunlation of jobcategory versus minority.

table(jobcat,minority)

 

# To plot the scatter plot we can use the command plot(x1,x2).

plot(salary,salbegin)

# there is a postive correlation between salary and salarybegin.

# AS salarybegin is high then salary of current also be high.

 

plot(educ,salary)

#when we plot the correlation between educ and salary, there is a posstive correlation

#between these two variable.

 

plot (educ,salbegin)

# there is a postive correlation between education and salary of begin.

#To combinaly see the all the correaltion  using scater plot, then we can use pair option.

pairs(Emp)

 

#to calculate the correlation between all variable then we can use cor option.

#the correlation is the degree of association between two variable.

# The correlation is classfied into three types.

#1. Posstive correlation (Where correlation is close to one)

#2. Negative correlation (where correlation is close to – one)

#3. Zero correlation ( where correlation is close to zero)

 

cor(Emp)

#As the name suggests, in simple terms correlation is the relationship between 2 variables.

#Correlation is defined as the “degree of strength of association between 2 variables”.

# Here 2 variables, does not mean 2 values in the same variable, but two different variables.

# Correlation explains the degree of the relation the 2 variables share among themselves, but does not explain the cause and effect of each other.

 

#Usually, correlation is required to be found between at least one dependent variable and another independent variable

#and its relation or association on the dependent variable.

#The researcher looks at things that already exist and determines if and in what way those things are related to each other.

#The purpose of doing correlations is to allow us to make a prediction about one variable based on what we know about another variable.

 

#Correlation co-efficient is denoted by small r . Always co-efficient of correlation value r ranges between -1 and +1.

# To find out the correlation, we usually do a scatter plot of both x values (independent variable) and the y values (dependent variable).

 

#A correlation tells us that the two variables are related, but we cannot say anything about whether one caused the other.

#This method does not allow us to come to any conclusions about cause and effect. Correlation is not Causation!

 

 

#regression Analysis

#Regression is said to be said an extension of Correlation.

#Regression is the “Functional relationship between dependent and independent variables”.

#Regression gives you the cause & effect relationship of the variables which is left incomplete by Correlation.

#We regress the pattern of the future occurrence of the variables based on the actual prior values.

#Regression forms a basis for future estimating & Forecasting while taking into account #the cause & effect of the relationship of the variables.

 

#Definition:-  Regression analysis is the process of constructing a mathematical #/statistical model or function that can be used to predict or determine one variable by #other variables.

 

#Regression involves two or more variables in which one variable is predicted (called the dependent variable) and designated as Y,

# The variable/s which help the prediction are called Independent variables  or #Explanatory variables and are designated as x1   x2   x3  …. xn.

#The values of x variables will be known or have actually happened, based on which we will be predicting the Y variable.

#Here, only a straight line relationship between the dependent and independent  variables is assumed.

 

# regression analysis was carried out for employee data

# dependent variable is salary and Indpendent variable is salary_begin, Education, #Previous_Experience.

 

lm(salary~educ+salbegin+prevexp)

 

#The above model used to run the linear regression of salary as the dependent variable #and education, salary begin and previous experiance was considered as a indpendent variable. The model shows that,

#salary=-3.661.51+735.956*educ+1.1749*salbegin-16.730*prevexp