Command Line Query Changes in 6.1

Carbon Black EDR (Endpoint Detection and Response) is the new name for the product formerly called CB Response.

EDR 6.1 introduces a new way to tokenize and query command lines. This document explains the rationale behind the change, how to enable the new tokenization, and how to use the new tokenization in your queries. The default behavior is to use the old (5.x) tokenization, so if you want to use the new features described here, you will want to enable the new tokenization in your /etc/cb/cb.conf file.

The new tokenization offers a lot of query power that did not exist before. For example, now you can simply look for a .exe or .dll as part of the command line query, something that was not possible before. Also, you can now explicitly search for a “/d” command line argument and not have to worry about false positives from just searching for “d”. And you can search for things like “execute*” to find a specific term passed to the command line, again something that was not possible previously!

Something else worth noting: the performance of this tokenization is on average twice as fast as what we had before. Custom code was written to apply these specific rules in the most efficient manner and the result is the time it takes to tokenize most command lines will perform as much as 2x better than before. With more predictable and cleaner tokens, the number of unique terms stored in the command line index will be smaller, resulting in less disk space and memory needed by the command line.

Enabling the new tokenization

This new tokenization is disabled by default. This feature can be enabled by adding

CurrentEventsSchema=cbevents_v2

in the /etc/cb/cb.conf file. Previously configured Watchlist that utilizes command line tokens may require re-write to take advantage of the new tokenization. It is recommended that you review your Watchlist entries to make sure they return the intended results.

How the tokenization works

When a process launches on an endpoint, that entire command line for that process is sent to the EDR server. The server then takes that data and breaks it up into tokens that ideally make it simple for the user to construct queries to easily find that process and respond accordingly. However, until now it was clear that the tokens we created and the query ability left much to be desired.

So why has this proven so difficult and what have been the drawbacks? Well, the problem is that command lines have a lot of elements to them that make a generalized tokenization solution difficult. There’s the start of the command line, which can optionally have a path followed by a the command name. Then there are arguments passed to the command line, which can include simple switches or complex expressions with variables, as well as references to other files as part of these switches or as part of redirected or piped output.

So one might ask, why not just store the whole command line as a token and allow the user to query using regular expressions to find exactly what they want? In a perfect world not constrained by performance, this might be the most flexible solution, but indexing in Solr/Lucene (or probably any data store for that matter) makes the performance of this prohibitive. Almost any query would need to start with a “.*” expression, and leading wildcard expressions are the equivalent of a full table scan in SQL: something to avoid at all costs.

Previous Solution

So what solution have we had until now? What we did was convert all path characters “/” and “" into spaces and then break up the command line into tokens using white space as a delimiter. This did a good job at allowing the user to query for parts of the path or the command name typically, as long as it wasn’t enclosed in quotes. But this did not provide a good way of searching for tokens that are extracted from arguments passed to the command.

Let’s take the following example:

C:\Windows\system32\rundll32.exe /d srrstr.dll,ExecuteScheduledSPPCreation

Previously, this would have been broken up into the following tokens, a result of removing all path characters and creating lowercase tokens based on white space:

c:
windows
system32
rundll32.exe
d
srrstr.dll,executescheduledsppcreation

As you can see, it was not hard to search for elements of the path or the command name itself. However, what follows was almost impossible or impossible to search for. Searching for “d” would give you that switch, but potentially a lot of false positives since that “d” could easily be a token that wasn’t a command line switch in another command line. And searching for what follows was actually impossible unless you searched for the exact string. Wildcards in queries were not supported.

Things got even worse if the command line was quoted. Here is the same command line above with the command and its path surround by quotes:

"C:\Windows\system32\rundll32.exe" /d srrstr.dll,ExecuteScheduledSPPCreation

This would result in the following tokens:

"c:
windows
system32
rundll32.exe"
d
srrstr.dll,executescheduledsppcreation

Not only did this have all the limitations of the prior example, but it made it more difficult to search for the either the drive letter or the command name itself. An escaped double quote would have to be included as part of the search which made it really ugly to simply search for command names.

New 6.1 Solution

The new tokenization and query capability being introduced in 6.1 attempts to solve these issues and many others by doing the following:

  • Add extra characters to the list of characters that should always be converted to white space and therefore never be considered a part of a token.

  • Provide special handling to allow user to search for command line switches that start with a / character instead of blindly assuming this is a path character and converting all of those characters to spaces.

  • Add additional tokens of file extensions to allow for searching of a simple file extensions in addition to entire command or file names.

  • Add wildcard support to support non-leading ? and * characters in queries to search for a single character and multiple characters within a token respectively.

  • Now the set of characters that will be converted to white space for tokenization purposes is the following:

      \ " ' ( ) [ ] { } , = < > & | ;
    

Notice that several important characters are not included in this list. The % and $ characters are often used for variables, so remain untouched. The -, . and _ characters are often parts of file names so those also remain untouched. Other characters that remain a part of tokens include ^ and @ and # as well as ! and ?.

As was mentioned before the / character is now handled specially. If it is the start of a token it is assumed to be a command line switch unless it is the start of the entire command line in which case it is assumed to be part of the path. This will result in absolute paths that are passed on the command line on Linux or Mac as being tokenized as if the beginning of the path were a command line switch. So a command line of /bin/ls /tmp/somefile will produce the tokens bin, ls, /tmp and somefile. It’s not efficient for the parser to distinguish between a command line switch and a Unix style absolute path so this is a necessary limitation.

The : character is also handled specially. If it is the end of a token, it is assumed to be something the user would want to search for like a drive letter, so it is included. If there are multiple colons at the end of if the colons are not at the end of a token they are converted to white space for tokenization purposes.

So based on the rules described above, let’s take another look at this command line:

"C:\Windows\system32\rundll32.exe" /d srrstr.dll,ExecuteScheduledSPPCreation

This will now be broken into the following tokens:

c:
windows
system32
rundll32.exe
.exe
/d
srrstr.dll
.dll
executescheduledsppcreation

These new rules for tokenization came about after a lot of discussion with stakeholders and a lot of provided examples that helped in the development of a concise and predictable set of rules that maximize the utility and power of querying the command line. I’m certain there is room for some tweaks here and there, but this should make things a lot more useful.

One final note: the wildcard support required some UI and API level changes to allow the *, ? and \\ characters to be passed to the back-end unescaped. So any queries having these characters as part of existing watchlists or feeds will need to be changed upon upgrade to 6.1 I would not expect there to be many of these if any at all on most systems. In order to minimize further incompatibility with prior versions, the new tokenization is not turned on by default, and will apply only to new event partitions once enabled.


Last modified on May 5, 2020