Toward Realistic and Artifact-Free Insider-Threat Data

Kevin Killourhy
Carnegie Mellon University

Roy Maxion
Carnegie Mellon University

Progress in insider-threat detection is currently limited by a lack of
realistic, publicly available, real-world data. For reasons of
privacy and confidentiality, no one wants to expose their sensitive
data to the research community. Data can be sanitized to mitigate
privacy and confidentiality concerns, but does the mere act of
sanitizing the data introduce artifacts that compromise its utility
for research purposes? If sanitization artifacts in data change the
results of insider-threat experiments, then those results could lead
us to make conclusions which are not true in the real world.

The goal of this work is to investigate the actual consequences of
sanitization artifacts on insider-threat detection experiments. We
assemble a suite of tools and present a methodology for collecting and
sanitizing data. Then, we use these tools and methods to replicate an
experimental evaluation of an insider-threat detection system. We
compare the results of the evaluation using raw data to the results
using each of three types of sanitized data to measure the effect of each
sanitization strategy.

We establish that two of the three sanitization strategies actually
alter the results of the experiment. Since these two sanitization
strategies are commonly used in practice, we must be concerned about
the consequences of sanitization artifacts on insider-threat research.
On the other hand, we demonstrate that the third sanitization strategy
addresses these concerns, and realistic, artifact-free data sets can
be created with appropriate tools and methods.

Keywords: Insider; Data Collection; Sanitization; Evaluation; Sanitization Artifacts

Read Paper Read Paper (in PDF)