With the initial dataset now complete , the team commenced data exploration, looking to understand how the story came together. To inform perspective, the team reviewed the distribution of all numeric variables, as an aggregate and overtime, breaking the review into 3 distinct populations, the entire dataset, when arbitrage was present and when arbitrage was not present. This approach was taken as it helped to provide insights both into how the variable was distributed (A significant number of histograms provided minimal insight as a result of outliers, which prompted consideration, development and review of extension to differing representations, such as Log, Change, Percentages. While some of these items have been explored, many remain outstanding and provide an opportunity for continued discussion and consideration, we welcome thoughts and encourage questions (Link to Github).

 

In addition we included a decile perspective on the distribution of values. For those not familiar, a decile analysis divides the dataset into ten equal parts, each representing 10% of the data, which is meant to provide a clear articulation of the distribution, highlighting where data points cluster and how they spread across the range. We calculated the decile for each of the unique populations (Total, Arbitrage, Not Arbitrage), this analysis did further articulate some material differences across individual variables and datasets, which we believe will help to inform model creation. Also included were the visualization of statistics and comparisons across datasets, to support and simplify the review. Lastly a correlation matrix was provided of the variables which were most and least correlated to help provide insights and guidance as we sought to ensure not only did the theoretical inclusion of variables make sense, but that the model theoretically sound.

 

In addition to reviewing differences across the population (Arbitrage, No Arbitrage and Entire Data Set), we sought to understand across what other dimensions unique distinctions might exist and attempted to visually explore. Specifically, we suspected there could be differences based on the direction of trade (Purchase vs Sale) and distinctions related to the individual conducting the trade (given the combination of popular culture relevance and technical complexity required to effectively understand the entire landscape). In order to complete this, we created a dataset which aggregated unique transactions in Pool 1 and Pool 2 at the individual trader level, creating a data set which represented consolidated trader activity. This unique trader perspective distinguished purchases from sales, and was used to inform the creation of unique user profiles, which consisted of Retail Trader, Retail Investor, Trader, and Investor (Link to Github).