Coauthors who participate in the meeting are bolded. For a full schedule, click here.
Anna Aizer, Brown U; Shari Eli, U of Toronto; Adriana Lleras-Muney, UCLA:
Marrying the Right Man: The Effects of Cash Transfers on the Behavior and Outcomes of Poor Mothers
We investigate the effect of the Mothers’ Pension (MP) Program, operating between 1911 and 1935, on the life experience of recipient mothers. We examine women whose outcomes can be traced by linking administrative records of the MP program to the 1940 census, marriage certificates, birth certificates and death certificates. We assess if receiving a pension affected re-marriage rates, time to re-marriage, characteristics of new spouses (new spousesí education, income and SES background), and subsequent fertility. We also investigate if mothers were subsequently more likely to work, if her own mortality was affected and how the mothers’ decisions relate to her childrenís outcomes. Comparing those accepted by MP to the rejected, we find that MP pensions did not affect re-marriage rates, which were low – approximately 25%. However, women receiving transfers took an average of 4.7 years to find a new spouse, 1.7 years longer than those denied an MP pension.
Özgür Akgün, Tom Dalton, Alan Dearle, and Graham Kirby, U of St. Andrews:
Evaluating Data Linkage: Creating longitudinal synthetic data to provide ‘gold-standard’ linked data sets for comprehensive linkage evaluation
‘Gold-standard’ data to evaluate linkage algorithms are rare. Synthetic data have the advantage that all the true links are known. In the domain of population reconstruction, the ability to synthesise populations on demand, with varying characteristics, allows a linkage approach to be evaluated across a wide range of data sets.
We present a micro-simulation model for generating such synthetic populations, taking as input a set of desired statistical properties. It then outlines how these desired properties are verified in the generated populations, and the intended approach to using generated populations to evaluate linkage algorithms. We envisage a sequence of experiments where a set of populations are generated to consider how linkage quality varies across different populations: with the same characteristics, with differing characteristics, and with differing types and levels of corruption. The performance of an approach at scale is also considered.
Özgür Akgün, Tom Dalton, Alan Dearle, Eilidh Garrett, and Graham Kirby, U of St Andrews:
Probabilistic linkage of Vital Event Records in Scotland using familial groups
We report on the assembly of longitudinal data from Scottish birth, death and marriage records representing eighteen million individuals. An experimental approach based on familial groups starts by gathering parents and their siblings into bundles with the aim of (as near of possible) partitioning the certificates into familial groups. This may be achieved by bundling marriage and birth certificates according to a signature derived from their attributes. This is similar to but different from blocking used in most entity resolution schemes where certificates of one kind are gathered together. We have experimented with these techniques using hand coded data from an historic Scottish dataset as a gold standard for comparison. In this paper we will report on our techniques and some preliminary results from our experiments.
Ahmad Alsadeeqi and Alasdair J G Gray, Heriot-Watt U:
Systematically corrupting data to assess data linkage quality
Computer algorithms use string matching techniques to assess how likely two historical records are to be the same. The quality of linkage is unclear without knowing the correct links or ground truth. Synthetically generated datasets for which ground truth is known are helpful but the data typically are too clean to be representative of historical records. We assess data linkage algorithms under different data quality scenarios, e.g. with errors typical of historical transcriptions. A data corrupting model injects four types of mistakes: character level (e.g. an f is represented as an s – OCR Corruptions), attribute level (e.g. male changed to female due to false entry), record level (e.g. missing records), and group of records level (e.g. coffee spilt over a page, lost parish records in fire). We then evaluate record linkage algorithms over synthetically generated datasets with known ground truth and data corruptions matching a given profile.
Trygve Andersen, University of Tromsø and Gunnar Thorvaldsen, Norwegian Historical Data Centre:
Linking 19th century individuals and farms for Norway: an update
The principal aim of national record linkage strategy for the Norwegian Historical Population Register (HPR) is to link as many records as possible, using unstable information like family relations, addresses and occupations (Population Reconstruction, Bloothooft et al 2015). We flag the criteria, thus researchers can select them according to research questions similar to another population register linked along similar lines (Dillon et al, The History of the Family 2017). The HPR database comprises mainly censuses and church records 1801-1964. The censuses 1815-1855 were numeric. Cross-sectional farm tax lists from 1838 contain farm names and numbers, the farmers’ names and taxation. We aim to link these to the nominative censuses in 1801 and 1865. Farmers changed during the periods between the censuses and the tax list. We link the farms using a numbering system and the farm names plus the patronymics corresponding to the first name in the previous source.
Francisco Anguita and Diogo Paiva, International Institute for Social History:
Linking the Historical Sample of the Netherlands into the American censuses, 1850-1940
The Historical Sample of the Netherlands (HSN) contains standardized information on the life histories of a representative portion of the Dutch nineteenth and twentieth century population. These life histories were collected from the Dutch population registers introduced in 1850 for the whole of the Netherlands. They allow us to trace persons from the cradle to the grave. Given the total number of emigrants to the USA we can estimate that about 400 persons from the HSN disappeared from the sample to be found in the American censuses. In our paper we will describe how we have linked the HSN persons with the Dutch-born persons in the American censuses and how far these emigrants reflected the general population of the Netherlands and/or how far they were different. We will compare this group with the already known emigrants to the Dutch East Indies.
Luiza Antonie and Kris Inwood, U of Guelph: Bias, accuracy and sample size in the systematic linking of historical records
Linking historical records with time-invariant personal characteristics minimizes bias or departures from representativeness even though a wider set of features might generate more links. But there are many dimensions of bias, and even time-invariant criteria typically generate some bias. We illustrate by comparing records from Canadian censuses that have been linked twice ñ once using time-invariant individual characteristics and then using family information. The latter produces a larger linked sample, lower error rate and different patterns of bias. Both methods understate the Quebec-born, French-ethnicity, the unmarried and adolescents. Unexpectedly, the bias in favour of married people is larger using individual than family information. Family-based linking over-represents young children. These results suggest that neither method will be universally preferable. Rather, the choice of research question may affect the preferred balance of biases and link rate, and hence preferred methodology for record linking.
Peter Baskerville, University of Alberta:
Mobility Studies: The Importance of Timing and Sources
The central contention of this paper is that first arrivals in a frontier community exhibited dramatically higher mobility/lower persistence rates than did those who followed them. This proposition stands on its head a traditional emphasis to the effect that first arrivals enjoyed privileged status and long term power and persistence in the subsequent evolution of frontier communities. Doubtless some did, but most had no desire to stay long enough to reap such rewards. By emphasizing the persistence of a few the central dynamic of frontier development is misunderstood. Focussing on the persisters has allowed Canadian historians to paint a picture of frontier development that was more peaceful, better regulated and less speculative than that of the United States. A closer look however suggests that that frontier was more a site of unregulated, disruptive speculation fueled by hordes of petty capitalists. Evidence for this claim is presented from a study of the persistence rates of first settlers on two agricultural frontiers in nineteenth century Canada: homesteaders in Alberta and settlers in Perth County, Ontario.
In addition to questioning one of the central myths of Canadian history, this paper points to the importance of timing and sources in mobility studies. In the North American context such studies are generally census based. Censuses, however, were most often taken after significant settlement had occurred. Unless a census coincided with the first influx of people into a particular region, assessing the behaviour of first arrivals is problematic. Oneís starting point for tracing people matters. Preliminary results strongly suggest that those who came after the first settlers examined in this paper, exhibited higher persistence rates. While censuses are of central importance to North American mobility/longitudinal studies, they can mask the degree to which movement took place. We need to search out complimentary sources if we are ever to have a convincing sense of the timing and degree of movement in the North American past.
Mats Berggren, National Archives of Sweden; Maria Larsson, Umeå U:
Group linking and the evaluation of multiple linked Swedish Censuses
A common problem encountered when doing census record linkage is ambiguous links. One reason for ambiguous links is name similarity. Methods proposed to overcome the problem of ambiguous links have in common is that not only information about individuals is used in the linking process but also information about the households they belong to. We apply individual and group-based linking to the Swedish censuses 1880-1910 for select parishes which permits a comparison with linked records in the well-known POPUM database at UmeÂ University. The group-based method increases the number of linked persons; 91% of the new links are confirmed by the POPUM-linking. We then proceed to link the records in the 1910 census to the 1880, 1890 and 1900 censuses with the group-based method, and evaluate. We consider linkage rate, linkage bias and an analysis of possible causes of errors in the links between censuses. It will also be asked how the linkage rate is affected by events such as changed civil status and altered family relations.
Kees Mandemakers, IISH and Gerrit Bloothooft, U. Utrecht:
LINKS. Linking Dutch marriages into pedigrees over five generations, 1795-1938.
As a step toward reconstructing all nineteenth and early twentieth century families in the Netherlands the LINKS database has linked together 24 million marriage certificates containing over 100 million appearances of persons from 1795 to 1938. Coverage is complete for the entire country 1812-1922. Each certificate contains not only the names of marriage partners, but also the names of their parents, places of birth, ages and partly their occupational titles. In this paper we report on the method of linking a marriage certificate with the marriages of the parents of the bride and groom. We explain the cleaning of data, discuss how we link the marriage certificates into pedigrees and summarize national marital patterns. We also describe future enlargements of the system by including birth and death registrations, church registers, address books, tax registers and other large nominal administrative sources.
Jean-Sébastien Bournival and Marc St-Hilaire, U Laval; Hèlëne Vèzina, U du Quèbec à Chicoutimi:
Comparing information from vital events to census data in the Saguenay region of Quebec (1852-1911): a critical appraisal of two longitudinal datasets
We compare civil registers of the Quebec population with seven Canadian censuses from 1852 to 1911 for the Saguenay region in order to examine unlinked people and their household/family, to better understand who they are and why we could not link them. Preliminary work suggests that a share of these unlinked individuals reflect undercounting in each source. Some characteristics of these individuals point to the phenomena otherwise difficult to detect such as child transfers between families. The comparison of civil records and census data enables us to carry a critical evaluation of both sources and to comprehensively assess their reliability over a long and significant period of time. Exploring the characteristics of unlinked individuals also provides a valuable insight into potential biases or limitations which cannot be observed in projects relying on only one data source.
Jeanne Cilliers, Lund U; Johan Fourie, Stellenbosch U; and Auke Rijpma, Utrecht U:
Record Linkage in the Cape of Good Hope Panel
We describe probabilistic record linkage of households in the opgaafrollen tax records collected annually between 1663 and 1834 in the Cape Colony. Household-level information includes name and surname of household head and spouse, the number of children present in the household, the number of slaves (and, in some cases, indigenous Khoesan) employed, and several agricultural inputs and outputs, including cattle, sheep, horses, wheat sown, wheat reaped, vines and wine produced. We evaluate a number of statistical models and deterministic algorithms to best identify households over time using a subset of records already linked manually. The preferred model then creates a panel of 42,354 records from the Graaff Reinet opgaafrollen 1787-1828. We compare analyses of the linked panel and original cross-sectional data paying attention to year-to-year heterogeneity in the opgaafrollen and the scalability of this approach to the full Cape population.
Angela R Cunningham, U of Colorado, Boulder:
Enabling geographies of militarism: using individual civilian and First World War military records to link home and front
Military history has been described as the last refuge of the positivist grand narrative, supported by a military geography myopically concerned with terrain. While the project of problematizing this narrative has largely been carried forward through cultural approaches, newly accessible data, adequately processed, holds promise for social science methodologies to inform a more critical understanding of the ways in which military activities and ideologies affect people and locations seemingly far removed from the battlefield. North Dakotaís Great War roster, containing both continuous military service records and biographical details that can be used to link some of these 32,000 veterans to their civilian lives as recorded in the census, will support an examination of how servicemembersí movements traversed the boundaries between home and front and how their wartime experiences shaped the postwar places and households that they helped to constitute. This paper will first describe the process of constructing and selecting from a pool of candidate matches between the roster and census datasets using a weighed combination of edit distance and boolean comparisons. Informed by constructive critique of previously presented work, data preparation and linkage protocols will be systematically tested. Finally, this paper will present preliminary visual and statistical explorations of the patterns that emerge from individual level processes of mobility and interaction, processes only traceable through linked data. Reconstructing the multifaceted life courses of soldiers themselves will allow more nuanced and complex appreciation of a war that has left living memory while illuminating the insinuation of militarism into everyday life.)
E. Engberg and Maria Larsson, Umeå U:
How much do link metrics matter?
When building large and longitudinal databases, some kind of linkage is an indispensable part of the process. The prerequisites for linkage varies between different databases, which of course has consequences for the methods and strategies used, and this in turn also for the linkage rates which are possible to achieve. In Sweden, with its detailed and informative individual level records, the opportunities to reach a very high linkage rate are very good. With a two-step process, combining an automatic and a semi-automatic linkage, the Demographic Data Base has been able to reach very high linkage rates: up to 95 % of all records only with the automatic linkage and up to > 99 % with the semi-automatic step. But what does that actually mean? Is 99% per definition always better than 95 % or maybe even than 80% when it comes to analysing the data? This paper is an attempt to test and evaluate how different link metrics affect the analysis. Similar data, but with different linkage rates will be analysed and the outcome will be compared, focusing on different variables. Are there variables that are particularly sensitive to linkage rates, and, the other way round, are there variables where the linkage rate appears to be of less importance?
Björn Eriksson, Lund U, Sweden:
False positives and faulty estimates: Linked census data and bias to estimates of social mobility
This paper investigates how the quality of linked historical data may bias estimates of intergenerational social mobility (defined as differences in occupation between father and son). We address this question by assessing the impact of false positive links. In the analytical section two characteristics of linked historical data are considered that may bias estimates of social mobility. The first relates to the use of randomly drawn samples rather than full count data when linking individuals between different sources. Using samples is problematic because it limits the extent to which alternative links may be identified. The second concerns the detail by which identifying variables used for matching individuals are recorded.
Chad Gaffield, University of Ottawa and President, Royal Society of Canada:
The Deep Complexity of Historical Change, the Deep Complexity of Research Collaboration
Ron Goeken, Yu Na Lee, Tom Lynch, Minnesota Population Center; Diana Magnuson, Bethel U:
Evaluating the Accuracy of Linked U. S. Census Data, 1870-1880: A Household Linking Approach
Most historical census links are constructed without using corroborative evidence derived from co-resident kin and migration status. This minimizes bias and reduces the linkage rate. A potentially more significant concern might be an increase in the error rate. If the true link cannot be identified because of under-enumeration, death, emigration or misreported census characteristics, then any link to this record will be false. In order to assess this risk, we use the presence of common kin and residential stability in successive decennial censuses to supplement similarity at the individual level and thereby establish a set of verified links. We are able to verify many true links although, admittedly, some linked records lack corroborative household or residential information. These verified links allow us to optimize blocking strategies and to test procedures used to classify potential links generated by individual level classifiers, primarily by constructing linkage and error rates.
Chuck Humphrey, University of Alberta and Director of the Portage Network, a data stewardship initiative of the Canadian Association of Research Libraries:
The Long-term Stewardship of Canada’s Historical Census Microdata
The data lifecycle is an amalgamation of project and institutional-level curation activities that together represent the full range of the stewardship responsibilities for data. Until recently, Canada has lagged in the development of institutional data services, which has been a threat to Canada’s historical census microdata. The research data landscape in Canada, however, has undergone substantial changes since the January 2014 Digital Infrastructure Summit in Ottawa. This presentation will describe the evolving mosaic of research data services in Canada and the institutional backing it will provide valuable historical data collections.
Kris Inwood, U of Guelph; Rebecca Kippen, Monash U; Hamish Maxwell-Stewart, U of Tasmania; Rick Steckel, Ohio State U:
Inter-generational Trajectories of Occupation and Stature for Prisoners and Soldiers
The only systematic sources of historical anthropometric data describe soldiers and prisoners, who typically represent very different selections of society. This complicates inference from the sample to the experience of the population. A comparison of Australian military and prison records demonstrates just how different were the soldiers and prisoners. We link soldiers and prisoners born between 1871 and 1900 to their birth records and thence to the marriages of their parents. This allows us to compare occupation, birthplace, stature and other characteristics across soldiers, prisoners, fathers of soldiers and fathers of prisoners. We also compare intergenerational trajectories for the men who ended up as prisoners and those who ended up as soldiers. This evidence permits a more nuanced understanding than hitherto possible of the use of linked samples to represent the experience of a population.
Martha Bailey, Morgan Henderson and Catherine Massey, U of Michigan:
How well do automated linking methods perform? Evidence from the Life-M Project
Longitudinal data fashioned by linking census data are transforming the study of economic and demographic history. This paper uses two ground truth samples to evaluate four automated record-linking algorithms and two commonly used phonetic name-cleaning methods. We find high match rates for each algorithm, but we document important shortcomings of each. No matched sample is representative of the underlying population. The incidence of type I errors is distressingly high ranging from 19 percent to 81 percent. Phonetic name cleaning increases type I errors by 60 to 100 percent. Erroneous links are strongly correlated with baseline sample characteristics, implying systematic measurement error that will have substantial (and difficult to sign) effects on parameter estimates. As an illustration, we show that different methods lead to diverse estimates of intergenerational income elasticities for the 1920-1940 period. We conclude with constructive suggestions for improving automated methods without clerical review or genealogical methods.
Alexander Persaud, U of Michigan:
Non-western name matching with (proto) administrative data