Historical: (Part 2) Reference Data - Cross Validation
Finding the maximum likelihood identifier.
We've reached a critical juncture in our project: consolidating the data gathered from various vendors into a dependable, unified dataset. Even top-tier hedge funds employ teams of engineers and data scientists to meticulously integrate and validate diverse data streams. However, our approach must be more streamlined and improvisational due to resource constraints. Picture this task as tackling an intricate puzzle—one that's daunting and possibly insurmountable.
Our sources
Here's a summary table detailing our data sources and their offerings. We aim to integrate these sources based on ticker, exchange, CUSIP6, which comprises the initial six digits of CUSIP, identifying the issuer, and Composite Figi, pinpointing the regional level identifier. Although not an immediate priority, we'll later map companies and their fundamentals using CIK. Additionally, note that PIT denotes Point-In-Time data, indicating its relevance at a specific date.
This table reveals both redundant information and areas where data is missing. While not explicitly depicted, it's common for data vendors to have their own gaps. When no single source is deemed more reliable than another, how do we determine the most probable identifiers? My solution lies in examining all potential identifiers through a combinatorial search, then selecting the one that appears most frequently for a specific ticker.
To start, we must ensure consistency in terminology across data vendors. This involves extracting CUSIP6 from CUSIP, standardizing all exchange codes to MIC exchange codes, and maintaining uniform ticker formats. For instance, some vendors may use different delimiters like BRKA, BRK.A, BRK/A, and BRK-A to denote distinct stock classes. I suggest removing all non-letter characters. Additionally, for Point-In-Time (PIT) data, retain entries with known start and end dates within our exchanges of interest (XNYS and XNAS), while discarding others. Finally, while we have security types, we'll only carry this information through our dataset without using it for vendor data integration due to its high noise level and varying nomenclature across vendors.
Now, we proceed with integrating the data for any given date. Thus, for a specified date (YYYY-MM-DD), we need to subset the Sharadar, Polygon, and Intrinio datasets that are applicable on that particular date. Leveraging our CIK to CUSIP6 mapping retrieved from the SEC (as discussed in the previous post), we should sub-select the most recent mapping available as of the date (YYYY-MM-DD) and merge it into Polygon to acquire the corresponding CUSIP6 entry. Finally, we select only the columns of interest, as outlined in the aforementioned table, and subsequently eliminate any duplicates within those subsets.
df_1 = df_sharadar[["ticker","cusip6","exchange","type"]]
df_2 = df_poly[["ticker","cusip6","composite_figi","exchange","type"]]
df_3 = df_intrinio[["ticker","cusip6","composite_figi","type"]]Combinatorial Search
Now, we approach the computationally intensive phase. For every ticker within the collection of all potential tickers (i.e., Sharadar, Polygon, and Intrinio) for the specified date (YYYY-MM-DD), we must execute the following procedures. Initially, we fill in any missing information and resolve unknowns. Subsequently, we amalgamate the datasets from various vendors into a unified table.
# FILL UNKNOWNS
df_1["composite_figi"] = np.nan
df_3["exchange"] = np.nan
# FILL SOURCE
df_1['src'] = 'sharadar'
df_2['src'] = 'poly'
df_4['src'] = 'intrinio'
# COMBINE ALL VENDORS
df = pd.concat([df_1, df_2, df_3]).drop_duplicates() Next, we'll identify all potential combinations, and for each combination, ascertain the level of agreement among vendors.
# CREATE A LIST OF ALL POSSIBLE CUSIPS, COMPOSITE_FIGIS, AND EXCHANGES
cusip6s = df.loc[~df.cusip6.isna()].cusip6.tolist()
composite_figis = df.loc[~df.composite_figi.isna()].composite_figi.tolist()
exchanges = df.loc[~df.exchange.isna()].exchange.tolist()
# HANDLE MISSING DATA
if len(cusip6s) == 0:
cusip6s = ["NAN"]
else:
cusip6s = set(cusip6s)
if len(composite_figis) == 0:
composite_figis = ["NAN"]
else:
composite_figis = set(composite_figis)
if len(exchanges) == 0:
exchanges = ["NAN"]
else:
exchanges = set(exchanges)
# CREATE A COMBINATORIAL SET OF POTENTIAL CUSIP, COMPOSITE_FIGI, AND EXCHANGE
potential_lists = [
cusip6s,
composite_figis,
exchanges
]
elements = [x for x in itertools.product(*potential_lists)]
# LOOP THROUGH ALL COMBINATIONS
prv_match = 0
checks = []
for idy, element in enumerate(elements):
# HOW MANY ROWS WOULD MATCH IF WE ASSUMED CUSIP, COMPOSITE_FIGI, AND EXCHANGE
v = df.loc[((df.cusip6 == element[0]) | df.cusip6.isna()) &
((df.composite_figi == element[1]) | df.composite_figi.isna()) &
((df.exchange == element[2]) | df.exchange.isna())]
x = len(v)
# WHICH SOURCES & TYPES WOULD MATCH
srcs = ",".join([str(x) for x in v.src.tolist()])
typs = ",".join([str(x) for x in v.type.tolist()])
# IS THIS MATCH AN OPTIMAL RESULT
if x > prv_match:
checks = [list(element) + [srcs, typs]]
prv_match = x
elif x == prv_match:
checks.append(list(element) + [srcs, typs])The variable checks represent the most optimal combination of identifiers, taking into account the sparse information provided by the data vendors. This suggests that the “true” (or the ones we are must comfortable using) identifiers for ticker XYZ on the date YYYY-MM-DD should be included in checks. Please note that we're conducting numerous calculations, so it's crucial to utilize as many cores as possible. We'll delve into this procedure further in the upcoming posts.
Coming up next
Now that we have identified our potential identifiers, we need to utilize additional vendors (such as OpenFIGI) to narrow down to only one option. This process will involve examining all possible corner cases that may arise.

