Incomplete historical data can provide unique challenges to data analysis. However, it is crucial to explore this data for the information it can provide. In this project, personal property tax records from 1867 Virginia are operationalized. The data is cleaned, modeled appropriately, and made publicly accessible with a Flask web application. The project is made scalable by storing the data in a cloud database and writing query and visualization functions that can adapt to new data features. The data is demonstrated to be useful and informative despite its sparse nature. This is partially done with methods that show how incomplete historical data can be leveraged in the broader context of exploratory analysis, and partially done by making the data explorable for the public, showing that the data can also be useful for individuals and historians.
As technology advances, historical records are digitized at an increasing rate. Digital access to historical data has allowed companies like Ancestry.com to provide genealogy information to the public. Ancestry manages 10 Terabytes of data on more than 13 billion historical profiles.[1] One Shared Story (OSS), a Virginia nonprofit, builds on this work. In typical historical data, women and people of color are poorly recorded and sometimes altogether excluded. Before the American Civil War, black people in Virginia were enslaved and viewed as property, which excluded them from typical records like census data. This means that black families today have less capability to locate their own family records compared to white families. OSS works to digitize tax records, birth records, and other historical documents from Virginia that can fill this gap in the currently available data.
Data for this project is newly-digitized personal property tax data from five Virginia counties in 1867. OSS’ goal is that this data be accessible to the public and searchable, so that historians can study the data and families can locate their ancestors. In addition, OSS is currently transcribing more data that can be added to the database. The data itself poses a challenge; it is sparsely populated, and its context and features are difficult to interpret.
The contribution of this project is two-fold: first, a database for data storage and a web application for accessing and searching the data must be built. This component of the project addresses OSS’ practical goals. Second, the data must be explored and summarized to show what meaning can be extracted from it. This component situates this project in the broader context of Data Science. Ultimately, this project demonstrates the operationalizability of incomplete historical data through preprocessing, exploration, and user-friendly presentation while working within a scalable pipeline for the client’s data storage and retrieval.
Genealogy data can be searched and represented in various ways. Ancestry.com,[2] the industry standard, allows for searching for individuals by name, location, and relationship to the searcher. Results are displayed as individual names along with other pertinent information, like an individual’s parents’ names, birth date and date of death, and their residence.
The difficulty of statistical analysis for incomplete historical data is widely acknowledged. Data from the incomplete historical record is challenging for social network analysis (Wetherell, 2010)[3] and semantic extraction (Migliorini & Grossi, 2017). [4] In the case of historical data, missing values are often reflective of either the data collection method or the time period during which it was reflected (Heitkamp, 2018).[5] In this case, missing values both challenge attempts to explore the data and have meaning themselves.
In this case, an attempt to standardize and model historical data has been previously made. This attempt was made by On These Grounds, [6] which is a joint project done by several universities to evaluate, describe and make accessible the information regarding enslaved persons who labored in places of higher education. On These Grounds has developed and introduced a data model for historical event data called OTG Descriptive Model. This data model organizes information relating to a historical occurrence into four main categories: Event, Person, Organization and Place. The design of the data model created for OSS’ data is heavily influenced by OTG’s Descriptive Model.
The data is composed of 12,363 observations and 64 features. All observations have at least one missing value, denoted by a ‘0’ when the missing feature is numeric and an empty string when the missing feature is non-numeric. These placeholders were added during data cleaning. The raw data also included unique values that shared the same information, like ‘Louisa ’ and ‘Louisa,’ where the values are different because of an extra space in the string. In addition, the data included some string values like ‘####’ in numeric features. These issues were also addressed during data cleaning.
This personal property tax data from 1867 comes from Buckingham, Fluvanna, Louisa, Cumberland, and Orange Counties, which are located in Central Virginia. This was a tumultuous time in history, as black men and women had been recently emancipated from slavery and for the first time were largely recorded as people rather than property. It is reasonable to expect that the effects of slavery would show huge disparities between wealth by race; however, the data does not have a clear feature representing race. In addition, it was customary for enslaved people to be given their enslaver’s last name. After emancipation, many freedmen and women changed their surnames as an exercise of freedom.[7] However, in Virginia some former enslaved people remained on the same plantation where they had been enslaved and worked for their previous enslaver. The data contains the surnames of employees and employers, as well as several features that give clues pertaining to race of the individuals. This contextual information can be leveraged to extract meaning.
Because the raw data included dirty data, each feature was viewed with the Python Pandas `value_counts()` method in order to check for duplicate values. Where found, duplicate values were corrected. In addition, when erroneous values like strings in numeric features were located, these were removed. Finally, null values were replaced with either 0 or an empty string depending on the datatype of the feature.
To better communicate the information that each feature contains, a convention for renaming the features was developed in conjunction with OSS. In this project, this is known as the ‘data model.’ It is built on the concept of tags.
A tag indicates an attribute referenced by a feature. The attributes were defined by OSS, and were developed to honor the attributes used by OTG in its data model. The most prominent attributes referenced in the project data model are ‘Person,’ ‘Event,’ ‘Date,’ ‘Source,’ ‘Loc,’ ‘Count’ and ‘Value.’ These are also the tags seen most frequently.
Functionally, tags are assigned based on whether a feature contains information pertaining to that attribute. If a feature pertains to a person, that feature has the tag ‘Person’ in its name. A feature can have multiple tags in its name if the information refers to multiple attributes. Through the implementation of the data model, the features were given new names like ‘PersonGivenNames’ and ‘PersonTaxValueAggregatePersonlProperty.’ As more data is added in the future, this data model will provide an example of feature definition. To document the data model, a data dictionary was created to record further detailed information on each feature.
The cleaned and modeled data were then moved to MongoDB, a cloud unstructured database system, for storage. Unstructured database systems have high flexibility and scalability, which will prove useful for OSS’ future endeavors to expand the database with new data containing new features.
The second portion of the project entailed extracting meaning from the data. Given its historical context and its high frequency of missing data, capability for prediction was fairly limited. However, semi-supervised clustering, which entailed choosing the desired clusters, working backwards to check whether the clustering was appropriate, and determining significant differences between clusters based on visual comparison, was used to explore the data that was present.
The feature ‘PersonEventRole’ was identified as a potential differentiating feature when it came to race, as the feature is not present on images of the data source, but was also undefined by the client. This feature had no missing values, and classified observations as either ‘taxpayer’ or ‘resident and taxpayer.’ Semi-supervised clustering was performed, comparing measures of wealth between the two roles.
The data contains four features specifying race. The features ‘PersonTaxCountNMalesOver16’ and ‘PersonTaxCountWMalesOver16’ contain the counts of black and white males over sixteen years of age reported for each taxpayer. The features ‘PersonsTaxedCountNMalesOver21’ and ‘PersonsTaxedCountWMalesOver21’ contain the counts of black and white males over twenty-one years of age reported. These last two features are defined by the client as the ‘tax record count of males [of that race] 21 years and older covered by this payment.’ One of these two features is generally populated with at least a count of one, and in this project is assumed to represent the race of the taxpayer themself. Semi-supervised clustering was performed with these features as well to see whether the two roles exhibit differences with regard to these race-related variables.
Once a proxy for race was developed, further clustering was performed by dividing the ‘resident and taxpayer’ cluster into two. A dummy variable called ‘FormerlyEnslaved’ was created within the web application that contains the string ‘confirmed’ if an individual’s surname matches their employer’s surname, and the string ‘unconfirmed’ if this is not the case. Then semi-supervised clustering was conducted, comparing wealth across all three clusters to determine whether clustering by name was a meaningful adjustment.
To make the data accessible and searchable per OSS’ request, a web application was made with the Python package Flask. The application also highlights the data processing and analysis that was conducted. The user view and functionality of the application are discussed in the following Results section, but the underlying functions will be discussed here.
A search function was created for the database searching page in the web application. Per the client’s specifications, the search function takes a user-specified given name, surname, date range, and location, as well as a list of columns the user wishes to view. The function then generates and sends an appropriate query to the MongoDB database and returns a table containing the resulting data. Each user input is then compared with all features that contain relevant information. For example, a user input for location must be compared to all features that contain information about record location. To do this, the input is compared with all features with the tag ‘Loc.’ In this way, the data model developed with OSS is used to access appropriate features given user inputs. See Figure 1 for the function schema.
The function for generating visualization with various combinations of user inputs was also created. Because users may select different visualization types, features, and aggregation functions, the function checks for validity of the combination of user inputs before generating the function to avoid unexpected errors.
After performing the first semi-supervised clustering, Figure 3 was created, representing total tax value compared across roles and grouped by county. Clearly, the ‘taxpayer’ role generally holds more wealth than the ‘resident and taxpayer’ role, and this clustering is appropriate.
Figure 4 shows the count of black males over 21 reported by role and grouped by county. Figure 5 shows the count of white males over 21 reported by role and grouped by county. It is clear that the role ‘resident and taxpayer’ almost exclusively reports on black males, and the role ‘taxpayer’ almost exclusively reports on white males. It is also worth noting that some black males are reported by individuals with the role of ‘taxpayer,’ and investigation of the source documents indicate that a few of these records may be transcriber errors. However, some do not appear to be errors, indicating that the black males reported by ‘taxpayer’ individuals may still have been considered property of their enslavers. Overall, this difference further confirms the clustering that was conducted.
Figure 6 shows the hierarchical structure developed after clustering by ‘confirmed’ or ‘unconfirmed’ former enslavement status. Figure 7 shows the results of comparing these three clusters by wealth. It is clear that when checking for wealth differences in this second clustering, no significant difference is present. This indicates that the second round of clustering is not helpful for extracting meaning. In addition, the similarity in wealth between ‘confirmed’ and ‘unconfirmed’ individuals with the role ‘resident and taxpayer’ indicates that there is no real difference between people in the ‘resident and taxpayer’ category. The lack of difference between ‘confirmed’ and ‘unconfirmed’ black people suggests that, in keeping with historical accounts of the time, there is no real difference between these groups, and that the entire ‘resident and taxpayer’ cluster represents black people who were recently emancipated, and had little wealth.
The web application created to display results, make the database searchable, and make sense of the data has four pages. Figure 8 shows the top portion of the home page, which summarizes the meaning of the data as detailed in the previous section. Figure 9 shows the input and output of the searching page, which has input fields aligning with the query construction. Note that it returns two individuals named John, but only one has the surname that was searched - this reflects the query’s functionality of searching all surname fields, including that of an employer and not the main person in the event, when accessing the data. In other words, the John Brackett record has an employer whose surname is Adams, so that record is returned as well. Figure 10 shows the input and output of a user-generated graph, as well as the input and output of an error message in response to a user request that does not produce a valid graph. Figure 11 shows the data dictionary page, which defines all variables per OSS’ specifications.
The project also results in a complete data pipeline. Once cleaned and modeled, the data can be moved to the cloud database. Then the data is accessible via the web application thanks to the scalability built into the search and visualization functions.
In this project, the data was shown to contain meaningful information and was made publicly searchable and explorable. In this way, the data can be considered to be successfully operationalized. Preprocessing and cleaning permitted the conduction of semi-supervised clustering to extract meaning, demonstrating that even incomplete historical data can provide insightful glimpses into the past. Then, the data was successfully added to a Mongo database, and a functional Flask web application was written to access the data. The application permits a user to search for particular inputs and visually represent features of their choosing. The final product is scalable, with a well-defined data model and query that is adaptable to new record types which follow the data model. It permits OSS to scale the pipeline to meet its needs, and could inform future work. The framework developed in this project can be replicated with other sparsely populated historical data, both in extracting meaning and in making the data searchable and available to the public.
Sincerest thanks to Dr. Judy Fox, Dr. Rafael Alvarado, and Ian Liu for their help with this project.