How to use the SHRUG location keys⚓︎

We provide keys to link pairs of location IDs within the SHRUG. These keys also contain the weights (usually population and land area) we used to impute and aggregate data.

The keys have two distinct purposes. First, you can use keys to link together data within the SHRUG. Second, the included keys make it easy to merge in data from external Census-linked data. For example, you could use shrid-level poverty rate and merge in district-level child anemia rates from the National Family and Health Survey 4 (2015-2016), which was linked to 2011 Population Census districts. For details and a tutorial on linking the SHRUG to external aggregate, locality, or micro-data, see this page.

What's included in each key module⚓︎

The core key module⚓︎

This set of keys is included with every download of the SHRUG data.

Population Census - shrid keys To make it easy to link the SHRUG to the underlying data, we include keys that link shrids to each Population Census in a single step. They are packaged separately by sector ("u" for urban and "r" for rural), and take the form pc[year][u/r]_shrid_key.dta. For example, pc01r_shrid_key.dta links villages in the 2001 Population Census to shrid. Each row is uniquely identified by Population Census town and village identifiers. They are not necessarily unique on shrids.

These keys also include the population and land area of each census town/village. Population was drawn from the Primary Census Abstract in each census year. When locality population was missing population in one census year but nonmissing in another year, we imputed missing population by assuming population growth rate was constant across localities in the same shrid. Land area was drawn from the Town and Village Directories in each census year, but set to missing when reported area was erroneously small (less than 0.1 sq. km.). We then filled in missing land area by calculating polygon area from the Population Census 2011 locality boundary map, then imputing land area backward for 2001 and 1991.

For data completeness, we dropped all localities still missing population after the imputation process. Thus all localities and shrids have nonmissing and nonzero population. A very small portion are missing land area, and the missing rate is highest in 1991 census localities.

Economic census - shrid keys These keys link towns and villages in the Economic Census to shrid, for each EC year. They are packaged separately by sector ("u" for urban and "r" for rural) and take the form ec[year][u/r]_shrid_key.dta. For example, ec05u_shrid_key.dta linked towns in the 2005 Economic Census to shrid. Eac row is uniquely identified by Economic Census town and village identifiers. They are not necessarily unique on shrids.

In the earlier Economic Census years, a large portion of EC towns and villages could not be linked to Population Census, and therefore to the SHRUG. Therefore the match rate between Economic Census and shrid is low. Users are cautioned that merging Economic Census data to the SHRUG will drop between 5-20% of the EC data.

shrid - subdistrict and shrid - district keys These keys link all shrids to Population Census subdistricts and districts in each census year. The keys take the form shrid_pc[year][subdist/dist]_key.dta. For instance, the key shrid_pc01subdist_key.dta links shrids to the 2001 census subdistricts. Because a small minority of shrids cross subdistrict and district boundaries, the keys are unique on shrid and subdistrict/district IDs, i.e. all location identifiers.
shrid - constituency keys These keys link all shrids to legislative assembly constituencies, separately for pre-2007 ("2007" in the SHRUG) and post-2007 ("2008"). They take the form shrid_frag_con[07/08]_key.dta. Importantly, each row in these keys is a shrid fragment, not a shrid. For detailed description of shrid fragments and their weights, see this page's section on the shrid-constituency keys. In short, each row is uniquely identifiers by shrid and shrid fragment ID.
shrid1 - shrid2 key This is a key with an incomplete match between shrids in SHRUG versions 1 and 2. Users are cautioned that this is not a complete match, and that shrid2 is more granular. For more information, see the page on linking shrid versions.
shrid location names This is a dataset with name of the state, district, subdistrict, and town or village associated with each shrid. To avoid confusion, shrids that combine multiple locations have location names ending with “”. For example, if a shrid combines 2 villages named “Attawa” (population 2,000) and “Badheri” (population 1,000), then the shrid village_name variable will be “Attawa”. The same is true in the rare case that a shrid has locations in different subdistricts or districts. The Delhi and Chandigarh shrids, each of which spans the whole state, are exceptions in which the shrid name is just the state name.

The location weights module⚓︎

This module contains the location identifiers, population, and land area of all locations in the SHRUG data (shrid, subdistrict/district, and constituency). In shrid, subdistrict, and district, population and land area are calculated by aggregating those weights from census towns and villages contained in each unit. In constituencies, population weights are aggregated from shrid fragments (detailed description here and land area was calculated as polygon area in our proprietary constituency maps. User are cautioned that although the majority of our constituency population weights are precise, population estimates in urban constituencies can be substantially distorted.

Tutorial: how to link locations within SHRUG⚓︎

The SHRUG was designed for maximum interoperability across geographic units and years. Linking any two locations in the SHRUG can be done in a few lines. In the example below, we link the data on rural consumption in 2011 census districts to shrid-level 1998 Economic Census data on firms. Then the user can re-collapse the merged shrid-level data to district, creating custom EC variables.

/* open the dataset of rural consumption in 2011 districts */
use secc_cons_rural_pc11dist.dta, clear

/* merge in the key linking shrids and 2011 districts */
merge 1:m pc11_state_id pc11_district_id using shrid_pc11dist_key.dta, gen(_m_keymatch)
/* note: now the dataset has expanded to shrid-level. More precisely, because
there are a few cross-district shrids, each row is ID'd by shrid and district IDs */

/* merge in the shrid dataset of economic census 1998 */
merge m:1 shrid using ec90_shrid.dta, gen(_m_ecshrid) keepusing(...)

Note that some datasets within SHRUG are not unique on their location identifiers. These data are "long", because they have one row per location and value of secondary dimension, like time. For example, the shrid-level forest cover (VCF) dataset has one row per shrid-year. These long data cannot be merged in with the assumption that the using dataset is unique on shrid, as in the example.