Meet the newest ethical and legal challenge of obtaining and using data via the internet: The dark web
Are you afraid of the dark? What thoughts come to mind when you hear the word “dark”? Scary? Hidden? Sinister? Mysterious? Shadowy? The opposite of dark, of course, is “light” — and this signifies transparency, full disclosure, openness, and clarity. It is in this context that research professionals increasingly are using what is called the “dark web” to collect and use data for studies that are looking at activities that are often, but not always, illegal. What makes the dark web both intriguing and worrisome for researchers is not the kind of information it includes but rather the methods used to obtain the data (neither the usual search engines nor by hacking passwords).
“I have been thinking more and more about the use of the dark web for research in the last year,” says Elizabeth Buchanan, endowed chair in ethics, director of the Center for Applied Ethics, and acting director of the Office of Research and Sponsored Programs, University of Wisconsin-Stout. Buchanan says she is hearing more from colleagues who want to do studies in areas such as illicit sexual behavior, drug trafficking, terrorism, and more. A good way of getting data on these activities is from the dark web.
Buchanan and several colleagues have done some preliminary work in the use of dark web. “Two key questions researchers should consider if thinking about tapping into the dark web include: How they can assure confidentiality? and Is the data-gathering method itself ethical and legal?” She and her colleagues presented a session on the ethics of using data from the dark web at the annual Advancing Ethical Research Conference held by Public Responsibility in Medicine & Research (PRIM&R) in November 2016.
“In some protocols, the IRB reviewers may need even more information from a study using information from the dark web than from other sources,” says Tani Prestage, director of research compliance administration, University of California, Santa Cruz (formerly of the University of California, Berkeley; UCB) “The IRB must determine whether there is enough benefit to justify the possible risk of a breach of confidentiality.” Prestage also recommends that IRBs invite experts in law and computer science to assist in reviewing any protocol that may include the use of data from secondary sources.
Chris Jay Hoofnagle, professor of law and faculty director of the University of California Berkeley Center for Law and Technology, gave a presentation on the topic of dark web research to Prestage and other UCB colleagues. He is also a member of the UCB Institutional Review Board (IRB) and assisted in the development of guidelines for the secondary use of existing data. “Though such projects do not involve interactions or interventions with humans, they may still require Committee for Protection of Research Subjects (CPRP) review, since the definition of ‘human subjects’ at 45 CFR 46.102(f) includes living individuals about whom an investigator obtains identifiable private information for research purposes,” says the guidance.
The guidance outlines when the use of existing data does not require IRB review, when it is exempt, and when it is non-exempt.
The dark web — definition and usage
Generally, when someone refers to doing online research on a topic or to search for data, the use of search engines such as Google, Yahoo, Safari, and Bing come to mind. This is the public web, explains Buchanan, and is only 4% of web content (~8 billion pages). Available to everyone via a search engine, also is referred to as the “surface web.” The term “deep web” refers to approximately 96% of the digital universe and is on websites that are protected by passwords, she says. The term “dark web” is used to describe not the data itself but how it is obtained.
“There is an extraordinary amount of activity on the dark web,” Hoofnagle says. “These are sites that give users extra levels of anonymity because users’ browsing is obscured by software. The dark web in effect runs on the public web — it is a layer that is obscured,” he explains. “Researchers are interested in the dark web to better understand why people use it, whether it is a more private and secure alternative to the surface or deep web, to understand the cyber-crime networks that can take advantage of its characteristics, and so on,” he says. “The dark web is used by people who are breaking the law by trading illegal pornography, stolen personal information, copyrighted items, and even selling groups,” says Hoofnagle. “However, it is also used by human rights advocates and by intelligence agencies that are trying to avoid surveillance.”
Buchanan says users of the dark web include people who want to browse/chat anonymously, buy/engage in illegal acts, and provide illegal services/goods. Researchers use the dark web, as do hacktavists such as Anonymous and WikiLeaks, terrorists such as ISIS, researchers, and even all of us without our knowledge when our credit cards, medical information, social security number, and identify are stolen.
Below is some of the information that researchers can get on the dark web:
- Un-indexed materials (Snowden)
- Breached data, stolen data, and data with questionable provenance
- Data dealing with illegal topics and actions
- Large datasets (that might be masked in questionable ways)
- Bitcoin purchases (for example, studying addiction and purchasing habits)
Bitcoin purchases can be done on the dark or surface web, says Hoofnagle. It is a method of exchanging value that enjoys more anonymity than credit payments, but less than cash. Sophisticated law enforcement and researchers can determine payment flows in Bitcoin.
Buchanan also explains that a system called “The Onion Router (TOR)” can be used to browse the web anonymously by bouncing IP addresses through a circuit of encrypted connections. TOR software is legal and free software.1 It prevents people from learning someone’s (online) location and/or browsing habits by letting the user communicate anonymously on the internet. Buchanan says TOR was funded and developed initially by the U.S. Naval Research Laboratory.
TOR makes the user’s information like finding the proverbial needle in a haystack. It hides among all the other users on the network, so the more populous and diverse the user base for TOR is, the more anonymity will be protected, says an overview of the project.2
This gets to the intersection of data sites and the ethics and legality of use of research data from the dark web. What do investigators and IRBs need to know to decide whether the use of data obtained on the dark web is both ethical and legal? Is it appropriate to use a waiver of informed consent? When is the research exempt? Can researchers inadvertently uncover and download information from the dark web that is illegal such as child pornography on the university’s server? These are questions that Brenda Curtis, assistant professor, department of psychiatry, Perelman School of Medicine, University of Pennsylvania (UPenn) hopes to answer. Curtis studies illicit drug use and addictions in her work at UPenn and co-presented with Buchanan and Prestage at the PRIM&R conference.
Concerns for researchers and IRBs
“The concerns IRBs should have surround the kinds of data that could be captured when indexing or just generally scanning the dark web,” Hoofnagle advises. “Since cybercriminals and child pornographers are major users of it, there is a risk that investigators will obtain graphic imagery of children or personal information of individuals.” Prestage and Curtis discussed several cases to illustrate their discussion at the conference. The data in the first case came from a “cheater” website based out of Canada, “Ashley Madison. Life is short. Have an affair.” The letter “o” in the website name was a symbol of a wedding ring. Ashley Madison (AM) was an online dating service that targeted users currently in a relationship but seeking to “cheat.” AM started in 2002 and was not widely known until 2015 when the social hacktivist group Anonymous threatened and then actually leaked customer data of more than 3 million users on the dark web, explains Buchanan. The data found on the dark web included personal information of people who had signed up to use AM and paid a fee to delete the information afterwards.
Researchers proposed to take email addresses obtained on the dark web and then purchase other data on these individuals from a big data company, including spouses/partner information if available. They also proposed to use the breached AM email addresses. These email addresses were available on the dark web and subsequently are now on the surface web for a fee. All of the AM data is considered secondary use of the existing data. Hoofnagle says the secondary use of dark web data by researchers can be problematic. Once outed by the researcher studying the behavior of people using the dark web information, the information became identifiable to spouses/partners.
“For curious minds,” says Buchanan. The CEO of AM stepped down in August 2015 but AM remained a company. In July 2016, the parent company renamed AM to Ruby Corp. The tagline “Life is short. Have an affair” was shelved for the milder “Find your moment.” The website swapped the wedding ring image for a red gem, she says.
Case study 2 involves another dating website, OkCupid. The company is semi-public and its profiles are searchable on Google by username. But most profile information required the user to log into the site. This was all information available to its users once they were signed in. Researchers publicly released a dataset on nearly 70,000 users of OkCupid. They collected the data using a scraper — an automated tool that saves certain parts of a webpage — from random profiles that had answered a high number of OkCupid’s multiple-choice questions. These included queries about whether they ever do drugs, whether they’d like to be tied up during sex, or what’s their favorite of a series of romantic situations. It also included username, age, gender, location, religious and astrology opinions, the number of photos as well as users’ answers to the 2,600 most popular questions on the site. The researchers published the data with the paper but later were forced to retract it when OkCupid said the researchers violated the company’s terms of service. However, the researcher was not charged with breaking copyright law because the data was already public (seen by OkCupid users).
What types of data should researchers be allowed to collect, repackage, and distribute, asks Buchanan. Now that OkCupid is “out there,” can researchers use it for secondary analysis?
“How often do we ask researchers where and how they have obtained their data?” Prestage advises IRBs to understand what kind of data their institution’s investigators are after and how they are obtaining it. Consider the risk-benefit for your institution, researchers, subjects, and staff. Let your IT staff and local FBI branch know that you have researchers in the dark web.
Prestage says that when a UCB researcher came to the IRB a few years ago during her tenure there with a protocol that involved using data from the AM dark web information, the IRB recommended that the researcher use another set of data to examine that did not carry the same level of risk. If the researcher still wanted to use the AM data, he or she would have to justify why it was needed as well as develop a risk management plan for data security. This might include dumping data as soon as an email-spouse match was found rather than waiting to destroy linked databases until after the study, assuring that a user’s spouse would never be contacted, and presenting a complete and detailed security plan. Prestage says the UCB IRB also determined that the risk and benefit ratio for using the AM data obtained on the dark web “tipped more heavily toward risk.” The researcher did not provide justification for why email addresses had to be from this particular source for this project, and the IRB also found that if the sensitive data were stolen and breached, spouses and partners might find out.
The IRB reviewers also asked the all-important questions when a protocol is submitted — Does it need IRB review? Was this human subject research? The IRB determined that subjects had an expectation of privacy and decided that they were human subjects as defined in federal regulation as “a living individual about whom an investigator conducting research obtains data through intervention or interaction with the individual or identifiable private information.”
Seek legal guidance and establish guidelines
Prestage recommends that when a researcher seeks IRB approval for a study using dark web research data, the IRB should consult the institutional legal counsel both to make sure that they “are following the law and to explain how to establish guidelines.”
Hoofnagle says that there are two immediate concerns from bulk scanning of dark web researches. “Do investigators become subject to security breach notification laws when they acquire social security numbers and other sensitive personal information from scanning dark web sources?” To address this risk, investigators may have to take certain precautions such as limiting data storage to work computers, using encryption, and deleting data within some reasonable time after the research is concluded, says Hoofnagle.
“The other risk is the acquisition of child pornography, which is a bigger problem,” he says. “Possession of child pornography is a serious crime and there is no exception for researchers. Under federal law, once one realizes that child pornography has been acquired, the recipient is under legal duty to delete it immediately and inform law enforcement. Individuals who view such material often report being traumatized by seeing it. So the problems span both illegal possession and workplace-harassment-like issues, where employees may be harmed by merely being exposed to it,” says Hoofnagle.
At Berkeley, stolen personal information that is placed on the web is not considered public. Therefore, a researcher’s proposed use of the data would have to go through an IRB process even if technically speaking, the corpus of data is online for everyone to see, adds Hoofnagle. The guidance further outlines policy and procedures to address the use of secondary obtained data. It covers when existing data does not require review, when the use of secondary data is exempt, and when the secondary use of existing data is non-exempt.
UCB provides written guidance to its investigators and IRBs. “In general, the secondary use of existing data does not require review when it does not fall within the regulatory definition of research involving human subjects. Although the definition of a human subject includes only living individuals, thereby excluding decedents, there are cases in which the health information of the deceased and death data files may require IRB review. Public use data sets (such as portions of U.S. Census data, data from the National Center for Educational Statistics, National Center for Health Statistics) are data sets prepared with the intent of making them available to the public. The data available to the public are not individually identifiable and, therefore, their analysis would not involve human subjects.
“In addition to being identifiable, the existing data must include ‘private information’ in order to constitute research involving human subjects. Private information is defined as information which has been provided for specific purposes (e.g., medical or school record). Information that includes identifiers and can be accessed freely by the public (without special permission or application) is not ‘private’ and the research does not therefore involve human subjects.
“There are six categories of research activities involving human subjects that may be exempt from the requirements of the federal regulations at 45CFR46. Only Category 4 applies specifically to existing data. If the research is found to be exempt, it need not receive full or subcommittee (expedited) review. In order to qualify for the exempt determination, an eProtocol application must be submitted to the IRB. Research involving collection or study of existing data, documents, and records can be exempted under Category 4 if the sources of the data are publicly available or the information is recorded by the investigator in such a manner that subjects cannot be identified either directly or through identifiers linked to the subjects.
“If secondary analysis of existing data does involve research with human subjects and does not qualify for exempt status, the project must be reviewed either through expedited procedures or by a full convened board and a non-exempt eProtocol must be submitted to the IRB for review. Researchers using data previously collected under another study should consider whether the currently proposed research is ‘compatible use’ with what subjects agreed to in the original consent form.”
Prestage, Hoofnagle, and colleagues at UCB IRB also created a secondary data matrix/worksheet (see Figure 1) to help researchers determine if existing data research meets the definition of human subjects research.
Curtis has yet to actually use the dark web to obtain data for her research but has acquired a broader understanding of what it entails. She shares the following six steps that a researcher should use when contemplating the use of all web data:
- Know the source of the data.
- Check for a terms of service agreement.
- Determine if you can scrap the data after use.
- Make sure that you know how the data were obtained and whether the data include identifiable information.
- Contact your local IRB to determine if the data use would be considered human subjects research.
- Consult your IT department and ask for advice. If you are using the dark web and downloading information, do not use it on the institution’s service because it may affect other studies.
The bottom line for Curtis is this: “If I can do quality research and take the needed precautions, it can be an asset to my research.”
Hoofnagle cautions that the federal government and other law enforcement agencies have taken steps to stop illegal use of the dark web. “The most notable cases surround takedowns of the ‘Silk Road,’ a popular eBay-like marketplace for drugs, identification documents, and other legal products,” he says.
The headlines in today’s news are replete with stories about hacked computer data, private servers, identity theft, breaches of confidentiality, and more. Although it may seem from the news reports that all uses of data that are not in the public domain are illegal, that is not always the case. Research practitioners, including investigators and IRBs, should educate themselves and their institutions about ethical and legal issues involving the use of the dark web to obtain and use data for a research study. These issues include a determination of whether the proposed research is considered human subjects research, whether the data are identifiable, and whether the discovery of illegal data like pornography is in and of itself illegal. To make sure that you are taking all appropriate cautions, contact institutional IT experts as well as experts in ethics, law, and technology.
One final note: Yes, the federal government does use professional hackers to investigate possible illegal use of data on the dark web.
- Levine Y. Almost everyone involved in developing tor was (or is) funded by the U.S. government. Available at: https://pando.com/2014/07/16/tor-spooks/. Accessed Feb. 23, 2017.
- Tor. Tor: Overview. Available at: http://www.torproject.org/about/overview.html.en. Accessed Feb. 23, 2017.
By Terry Hartnett
This article was reprinted from Research Practitioner, Volume 18, Number 2, March-April 2017.