Nature and Sources of Statistical Data
Statistical investigations are of different types. The most important of these include economic and social inquiries such as human population census, family budget inquiries, housing surveys, public opinion polls, survey of industrial production, rural economic surveys, health surveys and a host of others. The conduct of these inquiries demands, among other things, the collection of relevant data.
The data we collect for any statistical study may be qualitative or quantitative. Qualitative data are those used for describing characteristics which cannot be defined in numerical terms; for example, colour of hair, colour of eyes, items described as defective or non-defective, performances graded as excellent, good, average, poor. These characteristics are called attributes. Quantitative data, on the other hand, are data that are capable of numerical description. Examples include data on the wages of workers in naira, weights of students in kilogrammes, heights of students in meters, scores of students in per cent, etc.
In carrying out a statistical inquiry, the investigator could collect the necessary data himself or decide to extract data from already published sources if available. Hence, with regard to the source of data, we can distinguish two categories of statistical data; namely, primary data and secondary data.
Primary data are data collected by the investigator himself for the purpose of statistical analysis. For example, every year the Ministry of Finance and Economic Development, Enugu, collects data on the various aspects the rural economy of Anambra State – the rural population, land are yield of crops, livestock, etc. Such data are published by the Ministry in h Annual Rural Economic Survey Report. To this particular Ministry the data are primary. An outsider who is interested in any aspect of the rural economy of Anambra State for any particular year may decide to consult the -Ministry’s Rural Economic Survey Report for that year for the necessary data. To this outsider such data or official statistics, as they are sometimes called, are secondary.
The Federal Office of Statistics (F.O.S.) is the main publisher c official statistics in Nigeria, Its quarterly Digest of Statistics, Annual Abstract of Statistics and monthly Economic Indicators cont statistical data on the economic and social aspects of the country – its population, population density, age distribution, data on manpower resources, industrial and agricultural production, price movements, external trade, migration statistics, etc. Any investigator interested in any c these aspects of the country’s economy would find it more convenient less expensive to extract data from these published sources of the F.O.S. Such data would be secondary to this investigator because he did not collect them himself.
METHODS OF DATA COLLECTION
Data collection is a very important stage in any statistical investigation. fact, the soundness of the method employed in the collection of statistical data determines, to a large extent, the success of the inquiry.
The most commonly used methods of data collection are:
Interview Method
By this method data are collected from the informants by trained agents called enumerators. These agents visit the informants in their houses offices, in the markets or on the street as the case may be ask the necessary questions and enter the replies in special blanks called schedules. is method is mostly used for such surveys as human population census, al and urban household inquiries, market prices survey, family expenditure survey and other statistical inquiries where personal contact of interviewer with the respondent is necessary.
Mailing of Questionnaires
The method of mailing of questionnaires is usually employed by Social Survey Units of research oranisations and public opinion institutions to gather information regarding the views of respondents on some important socio-economic and political problems of the country. Market research units of any modern business concern employ this method to assess people’s attitude towards their products or services.
By the mailing of questionnaire method a list of questions, usually called a questionnaire, is mailed to the informant who is expected to answer them and return completed forms to the office of origin. The success of this method depends very much on the efficiency f the postal and telecommunication facilities available in the country. There is no doubt, therefore, that this method is more efficient in the developed countries than i the under-developed ones.
Method of Registration
This is a method whereby data are collected by keeping records of events immediately they occur, or soon after their occurrence. By this method information is collected through registration of birth, deaths, marriages, divorces, immigration and emigration, motor accidents, hdustria1 accidents, etc. The registration method is more efficient in developed countries than in the developing ones. Mass illiteracy in the latter leads to incomplete record keeping. For this method to be effective it would be necessary that the governments of developing countries make registration of vital events legally compulsory.
FREQUENCY DISTRIBUTION
By a frequency distribution we mean a table showing the values of a variable together with the number of times these values occur (their frequencies).
A variable or a variate refers to any entity that varies. Measures like weight, height, income, are variables because they vary from one person to another. Yield is a variable associated with crops. We can think of many other examples of a variable. A variable can assume different values. We use the term variate-values to refer to the values taken by a variable.
Consider the variable ‘monthly income of a group of workers”. Let us denote this variable by X. Suppose there are five workers in this group with monthly incomes given as follows:
N60, N75, N83, N78, N86. Then, we say that the variable X (monthly income of a sample of five workers) assumes values as follows:
These values are the variate-values of X.
In a similar manner, the weights of four students would be the variate values of the variable X- “weight of students”. These values could be, say as follows:
All the values which the variable X can take on are called the possible values of X. Those possible values of X that have actually been observed in the process of data collection are referred to as actually observed values of the variable X.
The Summation Sign
In the analysis of statistical data, we very often have to work with t, sums of variate-values. For example, we may wish to find the sum of the monthly incomes of a sample of five workers mentioned in section.
The Greek letter (sigma) is the notation for summation; and it is read as “summation of”. We can therefore write the sum of the monthly incomes of a sample of five workers in the last section as:
Discrete and Continuous Var1ables
A variable could be discrete or Continuous.
By a discrete variable we mean a variable that can assume only discrete (integral) values like 0,1,2,3,…, etc. For example, let the variable X be the number of children in a group of families in a certain city. The possible values of X are 0,1,2,3,…, 10, etc. There is a jump or break between say 1 and 2, 2 and 3, and so on. The variable X (number of children) does not assume any fractional values such as 2.5, 3.5, etc. Hence, the discrete variable is represented by count data presented as whole numbers.
A continuous variable is one that assumes a continuum of values within the range of its observed lowest and highest values. This means that a continuous variable can take on both integral and fractional values. It ca assume values with fractions such as: 62.5 kg, 74.6 kg, 84,3 kg, etc. The variable X-height of workers is a continuous variable. In statistics a continuous -variable is represented by measurement data.
Distribution of the Monthly income of a Group of Workers
This table is called a frequency distribution, or simply, a frequency table. It shows the values of the variable X (income of workers) and the number of times these values occur (frequency).
CONSTRUCTION OF A GROUPED FREQUENCY DISTRIBUTION
When we study large sets of data as in table below, our main objective is, first and foremost, to present the information in a more precise usable form. This is done by grouping our data into a number of – resulting in a grouped frequency distribution as we have in table below.
The format for constructing a grouped frequency table could be summarized as follows:
We shall illustrate the construction of a grouped frequency table by considering the following example.
Suppose we have data on the scores of a group of 100 students in a statistics examination as in Table above in order prepare a grouped frequency distribution for the data on scores, we flail first decide on the number of classes to create.
A formula exists for determining approximately the number of classes to form In constructing a grouped frequency table. It is known as Sturge’s rule, and is given as:
No. of groups = 1+ 3.322 log n
where n = the number of observed values. In our own case, n = too. Hence, by Sturge’s rule, the appropriate number of classes to create is approximately equal to
1 + 3.322 log 100 = 1 + 3.322 x 2
= 1 + 6.644
= 7.644
8 groups
This rule only gives a guide and does not provide a substitute for the sound judgement of the investigator in deciding on the number of classes. In fact, Sturge’s rule gives unreasonable result when n is very small or very large. A decision as to the appropriate number of classes to create is more or less a commonsense one and depends, among other things, on the nature of data, the range of. values covered, that is to say, the dispersion of the values, and of course on the total number of items being studied. For samples of about 30 to 50 observations, five to seven classes will be sufficient. For large temples of about 100 observations, we can create between 10 to 12 classes. For samples of up to 150 or more observations, we can have up to 15 classes. For very large samples, fewer than ten classes will generally result in much loss of information. For example, grouping the data in Table 5.4.1 on the scores of 100 students into three groups as follows: 0-40, 40-80, and over 80, would have been very unsatisfactory. Obviously, much information is lost.
THE MODE
The mode of a set of discrete variate-values is the value that occurs frequently; that is, the value that has the highest frequency. Suppose we have values as follows: 50, 61, 42, 75, 75, 64, The mode here is 75. The value 75 occurs twice while the other values occur only Once. r the above set of data we have one mode. When a distribution has de3 it is said to be a unimodal distribution. Diagrammatically, a unimodal distribution is one with one peak like in diagram below:
Consider the following values: 50, 75, 65, 75, 50, 81. Here, there. two modes: 50 and 75. These values occur twice each while the other values occur only once. When a distribution has two modes, it is said to be a bimodal distribution. Fig. 6.2.2 is the curve of a distribution with two modes.
A Bimodal Distribution
A distribution that has three modes is called a trimodal distribution. A distribution may have no mode at all. This occurs when all the vavariate values have the same frequency, i.e., none of the values occurs frequently than the others. A distribution that has no mode is called rectangular or uniform distribution.
THE MEDIAN
The median of a series is the value that divides the series into two equal halves. In order to find the median of a group of values, the values have to be arranged in an ascending or descending order of magnitude. If we have an odd number of items, the median will simply be the middle value. For example, suppose we are given the following data on the daily earnings of five daily paid workers as follows: N2, N5, N3, N8, N6. We first rearrange the values in an ascending order of magnitude as follows: N , N3, N5, N6, N8; and we readily notice that the median is the value of the 3rd item, that is N5. The 3rd item is indeed the one that divides the series into two equal parts.
When we have an even number of items, then the median will be half of the sum of the two middle values. For example, consider the value of the daily earnings of six daily paid workers as follows: $2, $3, $5, I – $10. In order to find the median of this series, we rearrange the values in an ascending order to magnitude as follows: N2, N3, N6, N8, N10. The median will thus be the average of the 3rd and 4th values, i.e (5+6)/2 N5.50.
We can therefore formulate a general rule for locating the median value of a discrete series as follows: when we have n items in the series, the median value is the value of the (n + l)th /2 item. Hence, in the example of a series of the daily earnings of five workers, (n = 5); the median was found to be the value of the (5+ 1)th/2 = 3rd item.
The Median for Grouped Data
In practice, we are often required to find the median for grouped frequency distribution. Like the mode, the median for grouped data could be found either:
Estimating the Median from the Ogive
We shall illustrate the graphic estimation of the median by using the ogive. The “less than” graph is reproduced thus
The curve AB is the ogive of the distribution. The estimated value of the median is found by drawing a horizontal line from the 50% level (i.e. ½-way along the Y-axis) to cut the curve AB at a point C. from the point C, we drop a perpendicular to cut the X-axis at the point M. the point M is the median we seek. The estimated value of the median is 54%.
THE ARITHMETIC MEAN
The arithmetic mean (Or simply the mean) of a group of values is the sum of these values divided by their number.
Suppose we have a variable X (say, monthly earnings of a group of workers). Let X assume the values x1, x2, x3, …, xn. Then, the arithmetic f these values donated as ? (read as x bar or bar x) is given as