Frequently Asked Questions
Data Variables and Codes
What do the missing values codes .A, .B, .C, and .D mean?
SAS provides 27 values: .A, .B, .C, .Z and ., that are all treated as missing by statistical procedures. The RLMS data take advantage of this feature to code missing values as follows:
.A = not applicable
.B = does not know
.C = refuses to answer
.D = does not answer
. = legitimate missing (due to skip instruction)
Those who are converting the data to a software product that does not provide multiple missing value codes, and who want to preserve these distinctions, may use the following code in SAS to convert the missing values to numeric values. First confirm that the variables you are converting do not contain any legitimate negative values in the range -6 to -9 before making these changes and edit the code as needed.
|
libname in xport 'c:\rlms\rminanth.906.xpt'; |
Since "not applicable" and "legitimate missing" are equivalent, some files use the "." missing value for both meanings (that is, you will not find ".A" in those files).
Why are some questions in the questionnaire but not in the data file?
These questions were added at the request of a funding agency in Russia and are not for public distribution.
What are the "Text" data?
Many responses to questions were recorded verbatim and were not coded into categories. These questions can be identified in the questionnaire by the note "(char)" under the variable name to the left of the question. Since these responses are unique to the respondent, the likelihood of disclosing the respondent's identity is high. Therefore, the text data are not distributed with the remaining variables. Moreover, these responses are in Russian. Access to the text data requires stricter IRB review than does access to most of the data.
Sample Design and Methods
Does the sampling method use stratification?
Yes. A multistage probability sample was employed to draw the sample of dwelling units. First, a list of 2,029 consolidated raions (similar to counties) was created from which to draw primary sample units (PSUs). These were allocated into 38 strata, based largely on geographical factors and level of urbanization, but were also based on ethnicity where there was salient variability. As in many national surveys involving face-to-face interviews, some remote areas were eliminated to contain costs; also, Chechnya was eliminated because of armed conflict. From among the remaining raions (containing more than 95.5 percent of the population), three very large population units were selected with certainty: Moscow city, Moscow Oblast, and St. Petersburg city each constituted a self-representing (SR) stratum. The remaining non-self-representing raions (NSRs) were allocated to 35 strata of roughly equal size. One raion was then selected from each NSR stratum using the method "probability proportional to size" (PPS). That is, the probability that a raion in a given NSR stratum was selected was directly proportional to its measure of population size.
The target sample size was set at 4,000 dwelling units. They were distributed as follows: a total of 584 units was allocated to the three SR strata, which contained 14.6 percent of the Russian population. In accordance with the principles of PPS, the remaining 3,416 dwelling units were allocated fairly equally across the 35 NSR primary sampling units, since they were drawn from fairly equal-sized strata using PPS. However, to allow for a non-response rate of approximately 15 percent, in actuality we drew a sample of 4,718 dwelling units, with 940 allocated to the three SR strata. Oversampling was concentrated in large urban areas, where the highest non-response rate was expected.
Since there was no consolidated list of households or dwellings in any of the 38 selected PSUs, an intermediate stage of selection was then introduced, as usual. The selection of second-stage units (SSUs) differed depending on whether the population was urban (located in cities and "villages of the city type," known as "PGTs") or rural (located in villages). That is, within each selected PSU the population was stratified into urban and rural substrata, and the target sample size was allocated proportionately to the two substrata. For example, if 40 percent of the population in a given region was rural, 40 of the 100 dwelling units allotted to the stratum were drawn from villages.
In rural substrata, villages served as the SSUs. In urban substrata, SSUs were defined by the boundaries of 1989 census enumeration districts, if possible. If the necessary information was not available, the boundaries of 1994 microcensus enumeration districts, voting districts, or residential postal zones were employed--in decreasing order of preference. Approximately one SSU was selected for each 10 dwellings in the sample, using PPS where the SSUs differed appreciably in size. After SSUs were selected, an enumeration of dwelling units was made by visual inspection and recourse to official documents. Finally, the required number of dwellings was selected systematically starting with a random address in the list.
What census did the post-stratification weights adjustment use?
The post-stratification adjustment first used the 1989 census (1994 microcensus). Starting with Round 13, we used the 2002 census results for calculating the post-stratification weights.
How was the sample adjusted when dwellings were demolished?
In the districts where old houses were demolished (where our old respondents used to live), we replaced the demolished buildings with the new ones, built on the same site, and the occupants of the new units were included in the sample. We did not make a special variable for these new households. The new households were sampled according to the same procedure: from a list of all dwellings on the survey site, we did a systematical sampling by an even interval. And the occupants of the old households (movers from the demolished buildings) were followed to their new addresses where possible.
Is there a variable that identifies whether a household was part of the original follow up?
Yes, there are special variables in the RLMS data sets to define original sample respondents (individuals and households). These variables have words "inmover" or "hhmover" in their variable names (for individual data use inmover* and for household data use hhmover*). Movers are the respondents not from the original sample. So cases with *mover*=0 are good for cross-sectional analysis, but the others (where *mover*=1) are for follow-up purposes only. You also can use post-stratification weight variables (like hhwgt_15) to identify movers, since they have post-stratification weight equal to zero (=0).
How do I identify an individual who changed households?
If, for example, person number 1 in a household roster in round 6 left the household before round 7, then in the household roster for round 7 person number 1 will be coded as absent (h7inhh01=2) in that household. The reason for their absence in round 7 will be coded in h7whyn01 (1= they moved, 2= the household split, 3= they died).
If, for example, a household splits between rounds, then in the round 7 roster all members of the round 6 household are duplicated and will keep their round 6 roster number. They will be coded in the round 7 roster as absent in their old household and present with all personal characteristics in their new household. This approach simplifies keeping track of individuals between rounds.
What are the effects of loss to follow up?
The main effects are in the Moscow/SPB sample. Because of high attrition the Moscow/SPB sample in round 10 was replaced with a new sample. And starting with 2001 the Moscow/SPB observations from 1994 sample are no longer a part of the cross-sectional RLMS sample. Most of these people actually did not move from their original addresses. But in terms of the RLMS we still should mark them as movers (="movers from the cross-sectional sample"). However, we cannot calculate post-stratification weights for them anymore, because the weights adjust the cross-sectional sample (which, as a whole, is to represent all-Russia population) to the census data. And for the non-cross-sectional part of the sample, we just have no data to adjust to.
What is the main reason for losing respondents?
Refusals are much more common that the inability to find movers. Along with the refusals, another important reason is "no contact, nobody home during at least 3 visits." We can only count refusals or no-contacts at the address of a previously interviewed household, but we almost never can say whether the same people were not home/refused, or whether the dwelling now has a new household. So it is difficult to differentiate between refusals and no-contacts.
What are the non-response rates by round?
The response rate in the survey of households in the sample of dwelling units was 87.6 percent in Round 5, 82.1 percent in Round 6, 79.4 percent in Round 7, 77.7 percent in Round 8, 75.3 percent in Round 9, 57.9 percent in Round 10, 57.3 percent in Round 11, 54.8 percent in Round 12, 54.3 percent in Round 13, and 50.8 percent in Round 14. The response rate for individuals within interviewed households exceeded 97 percent in each round; thus the response rate for all individuals within sampled dwellings units was most likely just slightly lower than the corresponding figure for dwelling units.
The response rate for Rounds 10 through 14 cannot be directly compared with the response rate of the previous rounds. Because of the high attrition in the cross-sectional sample in Moscow and St. Petersburg during rounds 5 through 9, in Round 10 the cross-sectional sample in Moscow and St. Petersburg was replaced by a 100 percent new sample (using the same sample design). The comparison of the response rate can be made only for all other areas except Moscow and St. Petersburg cities. The response rate in the survey of households in the sample of dwelling units in all other areas except Moscow and St. Petersburg cities was 91.8 percent in Round 5, 87.3 percent in Round 6, 84.9 percent in Round 7, 83.4 percent in Round 8, 82.0 percent in Round 9, 80.3 percent in Round 10, 78.8 percent in Round 11, 76.8 percent in Round 12, 76.1 percent in Round 13, and 72.2 percent in Round 14.
Because of the decline in response rate in big cities, the proportion of the big cities in the sample became less than needed and continued to decrease each round, so in Round 15 another part of sample repair was done. We added new households to reconstruct the share of each region in the sample (to make it equal to that of 1994 sample). We used the same procedure for drawing the new addresses as in 1994. And, no wonder, the response rate on the new addresses was lower than on our old addresses. So, the response rate for the whole sample in Round 15 decreased: 44.9 percent for the whole RLMS sample and 55.9 percent for all other areas except Moscow and St. Petersburg cities. Also, regarding the comparable part of Round 15 (that which can be compared with previous rounds of response data, without new addresses added in Round 15) the response rate for Round 15 was 50.6 percent for the whole RLMS sample and 69.9 percent for all other areas except Moscow and St. Petersburg cities.
How do I link household-level data with data from individual household members?
For rounds 5 through 17, the combination of three variables: site, censusid, and family uniquely identifies each household within a round. For rounds 18 and 19, the combination of two variables: region and family uniquely identifies each household within a round. These variables are in files at both the household and individual levels.
How was the household head assigned?
The head of household is assigned according to the following demographic hierarchy: (1) the oldest working-aged male in the household, (2) if no working-aged males, then the oldest working-age female, (3) if no working-age females, then the youngest retirement-age male, (4) if no retirement-age males, then the youngest retirement-age female, and finally (5) if no retirement-age females, then the oldest child.
Longitudinal vs. Cross-Sectional Samples
Why does the RLMS have two types of samples?
The RLMS Phase 2 sample has two parts: (1) original sample addresses (drawn in 1994) and the follow-up addresses. The original sample addresses make a representative all-Russia sample, while the follow-up addresses are needed for panel analyses of individual changes. The panel is composed of people who were interviewed as the original sample in their original locations for at least one round, and then they moved to a new address. When they moved, they left the representative sample. When they were interviewed at their new address, they were retained as part of the panel sample. Follow-ups allow us to observe individual changes during a number of years for more people. They help us to see what happens to people with given characteristics in 1994 by 2000 or 2005. But if we do a cross-sectional analysis for a particular round, for which we need a representative all-Russia sample for the given year, we do not need the follow-ups in our analysis.
Are different types of households in the panel?
Yes, there are two types of panel households: (1) "movers," where all previously interviewed household members move to a new location, and (2) "split households," where a previously interviewed household makes two different households, with "old" hh members in both parts, and both parts are interviewed. With split households only one of the parts can still remain in the original sample, and the other parts are interviewed as follow-ups.
For example, let's suppose that in Round 5 we interviewed a household that consisted of a couple (parents) and their adult son. Before Round 6 the son marries and starts to live in a separate household. Perhaps he and his wife move away from the parents, or they stay at the parents' address but as two distinct households who pay for food separately. Then, in Round 6 we interview the parents as one household and the son and his wife as another household. Both of these households have previously interviewed people, so both are "old" households. These two households have different BIDs (B identifier) but the same AIDs (A identifier) (as they come from one household of Round 5).
Note that in Round 6 there are no duplicates of the current round identifier, BID, but that there are duplicates of the previous round identifier, AID, because of this household split. If you would like to merge a dataset with a previous round data, use the previous round's *ID to avoid the problems associated with duplicates. For example, if you want to merge Round 6 data with Round 5, use AID (not BID).
How do round identifiers change over time?
In Round 5 we had AIDs composed with household numbers from 1 to n within a population point. Starting with Round 6, BID, CID, DID, and next-rounds *IDs have household numbers from 1 to n within a census district. So, for many Round 6 households BID and AID will not match, although the household is the same. And we made another global change in numeration in Round 15: two-digit household numbers in *IDs were replaced with three-digit household numbers.
Each data file has IDs for all previous rounds. A very good tip is NEVER calculate a previous round ID based on next-round ID. It is strictly prohibited! (Although it can give good results for many cases within Rounds 6 through 14, it also gives wrong results too!) We supply all previous *IDs each round to make sure that the linking is made correctly.
How are mover and split household identifiers constructed?
If a household moves as a whole (no split parts), it keeps its previous round *ID (like, CID=BID). If a household splits, one of the parts keeps the "old" *ID (like CID=BID), and other parts are given a new *ID. The new ID number starts with 51, as a rule. This practice eliminates duplicates in current round IDs. For example, if a person in household 40008 in round 6 moved out of the household, and he was successfully followed to his new household, he was assigned a new ID, 40051, in round 7.
Are identifiers unique over time?
Only current round’s IDs are unique. For the split hhs, previous rounds’ *IDs will be duplicated. But if you use only original sample observations, both the current round's and previous rounds' *IDs will be unique (there are no split parts in original sample). If you would like to merge a dataset with a previous round data, use the previous round's *IDs. In SPSS choose the option "table" while matching (="link one observation from previous round to several observations of current round"), using the previous round's *ID as a key for matching, e.g., "match files /file=rlms7data /table=rlms6data /by BID."
An individual *ID number is calculated as the hh *ID number multiplied by 100 plus person number (=number of hh member in hh roster).
What is the easiest way to construct a panel file?
There is no "easy" way at the moment. However, we supply a file that allows you to link a single identifier, IDIND, with each individual so that you can link individuals over time. That variable, and the variables needed to link with each cross-sectional file, is in RLMS_longi_identifiers.zip. Look for it at the bottom of the Data Downloads page under Household and Individual data.
That file also includes variables part5, part6, etc., indicating whether the respondent participated in Round 5, Round 6, etc. You can cross tabulate any two part* variables to find out how many respondents you can expect in a panel composed of those two rounds of data.
To link households across rounds of data, always use the household identifier (AID, BID, etc.) from the earlier round. Here's a simple example linking R14 with R15:
use rnhhhous.dta merge 1:m jid using rohhhous.dta
Note that this is a "one-to-many" merge. The term "1:m" in the merge statement allows for households in R14 to split in R15 by recognizing that there may be duplicates of the R14 identifier (JID) in R15.
Economic Constructed Variables
How have the income and expenditure variables been adjusted?
In the household file the income and expenditure variables have a deflated value. Non-deflated variables are marked with 'n' (nominal), deflated marked with 'r' (real). For example, tincm_nm and tincm_rm. The inflation index converts values to June 1992 (the start of the survey). They have not been adjusted for regional differences. Note that the RLMS sample has NOT been designed to be regionally representative, so the researcher is cautioned not to interpret the data at the regional level. The regional deflator (CPI) for individual earnings can be found at: http://www.gks.ru/bgd/regl/b08_17/IssWWW.exe/Stg/02-06.htm.
The grid below translates the CPI for 7 Districts. Unfortunately there is information only for 2004 through 2007.
Consumer price indexSubjects of Russian Federation
(December to December of previous year, in %%)
| 2004 | 2005 | 2006 | 2007 | |
|---|---|---|---|---|
| Russian Federation | 111,7 | 110,9 | 109,0 | 111,9 |
| Central Federal District | 112,1 | 110,5 | 109,0 | 112,2 |
| Northwestern Federal District | 112,3 | 111,2 | 109,5 | 112,6 |
| Southern Federal District | 112,0 | 112,1 | 109,0 | 112,1 |
| Volga Federal District | 112,4 | 110,2 | 108,7 | 113,1 |
| Urals Federal District | 110,4 | 111,7 | 110,2 | 110,9 |
| Siberian Federal District | 111,2 | 110,5 | 108,6 | 110,8 |
| Far Eastern Federal District | 111,3 | 113,3 | 108,8 | 109,6 |
Has the basket of goods for the inflation index been changed since 1992?
Yes. The basket of goods is calculated by the federal statistics service (http://www.gks.ru/). They use the same goods most of the time. To better reflect the real structure of consumption they can change the basket (to delete, add, or replace goods), but it should not influence the results greatly.
How were each of these variables constructed?
The variables were constructed from a variety of sources. They are expressed as mean per month. While we don't have documentation on these variables, we supplied the code, which is well annotated. Please see the file constructed_variable_code.zip on the data downloads page under Household and Individual data.
Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.
