Second Update: Gathering Market Data and Writing/Debugging Computer Program


The reason for my rather long blogging hiatus is that I was taking two English courses in Cambridge, UK at Christ’s College. In my last blog post I said I collected press statements released by the Federal Reserve between the years 2001-2008, which I would like to correct to the years 2000-2008. It was during these years that the interest rates went through a full cycle from high to low and then back to high during these years. Specifically, I focused on the yields of six month U.S. Treasury bonds, one-year U.S. Treasury bonds, and two-year U.S. Treasury bonds. I gathered 317 press statements and each data point had 6 different yields associated with it: the three aforementioned T-bond yields from the day of the statement’s release and then the T-bond yields from exactly a year after the statement’s release. Thus, I have 1902 pieces of data, which can be very tedious to work with.

Unfortunately, I found that Microsoft Word documents were much more difficult to work with in Python than plain text documents, so I had to convert all 317 statements into plain text documents before opening them in Python. When documents are exported to different formats, problems are inevitable, and in my case, many characters that are not part of the English alphabet appeared randomly throughout the statements. With so many statements and no pattern to the dispersion of the unwanted characters, it would not have been worth my time to scan through every single statement, each of which had hundreds or even thousands of words looking for the characters. I decided, then, to use Python to strip the statements of the unwanted characters upon opening and reading them in the program. I used the Natural Language Toolkit and Python documentation as guides when writing my program. I was able to open and read the text in Python and let the computer scan the documents for the characters and remove them when needed, saving time and effort.

After opening the files in Python, I needed to get the program to recognize each word individually because the program initially sees the file as one long string of text with no means of dividing the text into words until you specify that the program should split the long string of text at spaces, which would give you individual words. As such, I split up the long string of text, which contained everything in a file, at the spaces so that what would emerge would be a list of individual words including repeats and numbers. Another problem arose with periods. Because there is no space between the last word of a sentence and the period that follows it, periods were being included with the word before it when the program split up the text at the spaces. This created a big problem with my program because my goal was to get the frequency distributions of words for each file. For example, “Monroe” would be counted as a different word than “Monroe.”, which renders the frequency distributions inaccurate. Fixing this problem was a little more involved than the aforementioned one in that the periods would be changed to either a space or an empty character depending on its context. If there was no space between the period of a sentence and the word after, then I changed the period to a space, but if there was already a space after the period, I just removed the period. I removed other punctuation that would not be examined conditionally like the periods, such as commas, parentheses, backslashes, etc.

Sparing the details that took days to sort out and debug, I was able to create a list of tuples each containing two pieces of data: the name of the file, which is the date of its release and then a list of all words in the file, with one word as a single entry. After doing this, I ran a frequency distribution on the list of words, which created a list of tuples each with two pieces of data: the word and then how many times it occurred in that file. Next, I created a list which consisted of smaller lists that had two pieces of data in each: ‘freqdist’ + the name of the file and then the frequency distribution of that file, which was the list of tuples described in the previous sentence. The step to follow the creation of this list was to group together the files that had the same five most commonly occurring words and then compare the market data for each group both within itself and then with other groups. I will discuss the results in the next blog update.