Historical: (Part 1) Reference Data - Ingestion

Creating a reference (symbology) data set for our hedge fund.

May 16, 2024

Now that we have live data being ingested daily with proper monitoring controls, let's shift our focus to historical data. We'll begin with reference data, as it's essential to understand our tradeable universe before downloading fundamentals, earnings, price, and volume data. Before starting, we need to define our universe. How far back should we look? Which exchanges will we focus on? What securities are we trading?

Let's begin with selecting the region. I've chosen the US, specifically focusing on securities traded on the two major exchanges: NASDAQ and NYSE. Next, we need to determine the time horizon. I've set the start date at January 1, 2017, as it provides a sufficient amount of data and is manageable with our data vendors. The further back we go, the more challenging it becomes, increasing exponentially with each year. Now, let's decide on what we will be trading. I will concentrate on the common shares security class, excluding other types like rights and warrants for now. Additionally, we'll focus on equities and leave ETFs for later.

Finally, let me emphasize this point: we must accept that we won't get all the data we want and that there may be data errors. Think of it like trying to win the Super Bowl. You can't just step onto the field and claim victory. You have to start with the basics. To win games, you need to score touchdowns; to score touchdowns, you need to get first downs; and to get first downs, you need to move the ball forward a few yards at a time. That's what we're doing here—making incremental progress. As we make progress and turn ideas into profits, we can scale up, obtain better data, and aim for bigger goals, like scoring our first touchdown.

So, what is reference data? Let me explain what I'm looking for. On any given day, I want to know:

Which symbols are being traded on the NYSE and NASDAQ.
Why certain symbols that were listed yesterday are no longer there today:
- Delistings
- Mergers & acquisitions
- Ticker changes
Why new symbols appear today that weren't listed yesterday:
- New listings
- Spinoffs
- Ticker changes
The other identifiers associated with these symbols for that day, such as:
- FIGI
- Composite FIGI
- ISIN
- CUSIP
- CIK

If I can get this information for each trading day, I'll be quite satisfied.

Let’s begin

We will rely on five data vendors:

OpenFIGI
SEC
Sharadar
Polygon
Intrinio

The first two are free, and the next two are affordable. Unfortunately, Intrinio is quite expensive, and I am only able to use it at a low cost because of an old subscription. Therefore, we need to find a more affordable alternative for you. Let's review how we are currently using Intrinio data, and then we can look for potential low-cost alternatives that provide similar reference data.

OpenFIGI

This data provider offers a fast and user-friendly API package called openfigipy. First, register on the OpenFIGI website to obtain a free API key. We will be requesting a substantial amount of data to create a mapping between various identifiers (e.g., CUSIP to Composite FIGI) using a large universe of known identifiers (e.g., CUSIP, Ticker, etc.). The identifier is specified by idType, which you can select from the available list provided here.

In the code example below, the query requests all equity entries with the CUSIP equal to XYZ123ABC that are listed or have been delisted and traded in US currency across any possible exchange. This query will return a significant amount of data, which you must store efficiently in your historical data directory. I recommend integrating this code with the API rate limit handling code we discussed earlier to efficiently use the OpenFIGI API.

So, where do we get our CUSIP or other identifiers? These will come from our other data vendors.

import pandas as pd
from openfigipy import OpenFigiClient

ofc = OpenFigiClient('API_KEY')

ofc.connect() # establish a requests session

df = pd.DataFrame({
    'idType': ['ID_CUSIP'],
    'idValue': ['XYZ123ABC'], 
    'currency': ['USD'],
    'marketSecDes': ['Equity'],
    'includeUnlistedEquities': [True],
})

result = ofc.map(df)

print(result.columns.tolist())
# ['q_idType',
#  'q_idValue',
#  'q_currency',
#  'q_marketSecDes',
#  'q_exchCode',         IMPORTANT
#  'query_number',
#  'status_code',
#  'status_message',
#  'figi',               IMPORTANT
#  'name', 
#  'ticker',             IMPORTANT
#  'exchCode',           IMPORTANT
#  'compositeFIGI',      IMPORTANT
#  'securityType',       IMPORTANT
#  'marketSector',
#  'shareClassFIGI',
#  'securityType2',      IMPORTANT
#  'securityDescription']

SEC

We need to use SEC data to map as many identifiers to one another as possible. One useful mapping I've found is the CIK to CUSIP mapping available in leoliu0’s Github Repo repository. I recommend forking this code and modifying it to include the timestamp from the filing. This will help us determine, as accurately as possible, the point in time when the CIK and CUSIP were mapped to each other. You should produce a file similar to cik-cusip-maps.csv, but with an additional column for the filing date. Set this up to run on your machine to start ingesting all the CIK/CUSIP mapping data. If you know of any other identifier mappings we can obtain from the SEC, please let me know so I can share them here.

Using the list of CUSIPs you obtained, you can return to OpenFIGI and ingest their reference mappings.

Sharadar

Sharadar provides an excellent source for a single ZIP file reference dump that doesn't require any special ingestion procedures.

Polygon

Polygon offers something very special: point-in-time reference data, which is extremely valuable for us. However, ingesting data from Polygon is challenging because they limit the number of items we can request with a single API key. For example, if we request the symbol "A," they will provide all tickers starting with that symbol, like AA and AAPL, but only up to a limit of about 1000 symbols. Unfortunately, we don't know our entire universe of tickers, and even if we did, querying each one individually would be inefficient. Therefore, we will query an expanded version of the alphabet to gather the data we need.

We will request symbols in segments such as A, AM, B, BM, C, CM, and so on. By dividing the symbols that start with each letter into two queries (e.g., [A-AM, AM-B]), we ensure that each smaller bucket contains fewer than 1000 symbols. This approach increases our buckets from 26 to 50, allowing us to efficiently retrieve all necessary data without exceeding the query limit.

letters = [x for x in range(51)]

For each bucket (i.e., letter in the code base below), we will define a range using the first and second letters (e.g., if letter = 2, then bucket = [AM, B]). Using the Polygon API, we can query tickers greater than or equal to the starting point (e.g., AM) and less than the next bucket (e.g., B). If the letter is in the last bucket, we don’t need the less than constraint.

alpha = [x for xs in [[x, '{0}M'.format(x)] for x in 
          list(string.ascii_uppercase)] for x in xs]

first_letter = alpha[letter]

if letter == 50:

    url = ("https://api.polygon.io/v3/reference/tickers?market=stocks&date={"
            "0}&limit=1000&ticker.gte={1}&apiKey=
            {2}").format(datesnap.strftime("%Y-%m-%d"), first_letter, apikey)

else:

    second_letter = alpha[letter + 1]
    url = ("https://api.polygon.io/v3/reference/tickers?market=stocks&date={"
            "0}&limit=1000&ticker.gte={1}&ticker.lt={2}&apiKey=
            {3}").format(datesnap.strftime("%Y-%m-%d"), first_letter,  
            second_letter, apikey)

This entire logic snippet will integrate with your API rate limiting code, as we need to query the API for each business day and each bucket individually (i.e., 50 buckets * 255 days a year * 6 years look back ~ 75,000 requests). Below is the argument that feeds into my worker wrapper. Once set up, run the process and save the data accordingly in your historical data directories.

arg = [((ix), {"date": job[0], "letter": job[1]}) for ix, job in
        enumerate(itertools.product(dates, letters))]

Intrinio

It's unfortunate that Intrinio has significantly raised their prices. I'll provide my ingestion routine here, but I don't recommend this dataset at their current price point. Ingesting the data is straightforward: you simply loop through the next page item if it is available and store the data within the securities variable to a file. For the input variables "active" and "delisted," you will have either True or False, resulting in a total of four files.

next_page = ""
while True:
    if next_page == "":
        url = "https://api-v2.intrinio.com/securities?page_size=10000&active=
               {0}&delisted={1}&api_key={2}".format(active, delisted, apikey)
    else:
        url = "https://api-v2.intrinio.com/securities?page_size=10000&active=
               {0}&delisted={1}&next_page={2}&api_key={3}".format(active, 
               delisted, next_page, apikey)

    response = requests.get(url)
    results = response.json()

    next_page = results['next_page']
    securities = results['securities']

    data = data + securities

    if next_page is None:
        break

Coming up next

We will explore how to cross-validate and integrate data from these individual vendors into a single, reliable source for our reference data going forward.

Thank you for reading The Livre (Free) Quant. This post is public so feel free to share it.

The Livre (Free) Quant