Data ‘hashing’ improves estimate of the number of victims in databases
Researchers from Rice University and Duke University are using the tools of statistics and data science in collaboration with Human Rights Data Analysis Group (HRDAG) to accurately and efficiently estimate the number of identified victims killed in the Syrian civil war.
In a paper available online and due for publication in the June issue of the Annals of Applied Statistics, the scientists report on a four-year effort to combine a data-indexing method called “hashing with statistical estimation.” The new method produces real-time estimates of documented, identified victims with a far lower margin of error than existing statistical methods for finding duplicate records in databases.
Anshumali Shrivastava and Beidi Chen (Photo by D. Soward/Rice University)
“Throwing out duplicate records is easy if all the data are clean — names are complete, spellings are correct, dates are exact, etc.,” said study co-author Beidi Chen, a Rice graduate student in computer science. “The war casualty data isn’t like that. People use nicknames. Dates are sometimes included in one database but missing from another. It’s a classic example of what we refer to as a ‘noisy’ dataset. The challenge is finding a way to accurately estimate the number of unique records in spite of this noise.”
Using records from four databases of people killed in the Syrian war, Chen, Duke statistician and machine learning expert Rebecca Steorts and Rice computer scientist Anshumali Shrivastava estimated there were 191,874 unique individuals documented from March 2011 to April 2014. That’s very close to the estimate of 191,369 compiled in 2014 by HRDAG, a nonprofit that helps build scientifically defensible, evidence-based arguments of human rights violations.
But while HRDAG’s estimate relied on the painstaking efforts of human workers to carefully weed out potential duplicate records, hashing with statistical estimation proved to be faster, easier and less expensive. The …