Explored the possibility of predicting Stock Market swings using sanitized Twitter data as first venture into Data Science.
Team Members: Abhigyan Kaustubh, Brennen Smith, Padma Vaithyam, Wenxuan Zheng
Tools: R, Excel
Key Activities: Data sanitization, regression, sentiment analysis, lexical analysis, TF-IDF, Data Visualization (heat maps)
Timeline: 10 weeks
Statistical data concerning opinions and emotions of people have been accurate in some instances in gleaming what is going to happen in the immediate or late future, when understood and applied correctly. Here, we have adopted the ‘Data Science’ based approach to verify our hypothesis that there is a correlation and possible causation between the rise and fall of the stock prices of a particular company and their corresponding tweets on Twitter. Through our research and analysis, we incorporated three different methods for sentiment and lexical analysis with the intension of divining accurate predictions for 5 major companies using the tweet keywords generated from January 2008 till March 2010.
The resulting figures and visualizations generated in the process showed that there was no correlation between the stock market movement and the twitter stock handle that we used.
The graphical results (graphs and heat maps) of the analysis can be viewed here.
Primary Methods – Sentiment and Lexical Analysis
Our first step in processing the tweets we had cleaned was determining numeric values for each of the keywords. We took three different approaches in our sentiment analysis, each returning slightly different values.
Our first method consisted of a hybrid sentiment analysis and word strength value which was based solely upon the word’s definition and its contextual usage in the stock market. We went through our list of keywords and determined if each of the words had a negative or positive connotation within the connotations of stock trading. From this meaning we derived and assigned each word a positive or a negative one. In addition, we scored each word on a scale from 0 to 3.5, depending on its relevance to the market’s movement. We then multiplied these numbers together to form an aggregate score for the selected keyword. However, we were skeptical as the data generated in this process might not have been accurate as per the norms of the stock market, and that we were hence potentially manipulating the actual data by scoring the keywords ourselves. Thus, to possibly improve the fidelity, we implemented a second method, TF-IDF.
TF-IDF, or Term Frequency–Inverse Document Frequency, is a numeric value determining how strong that word is within a document corpus. It increases proportionally to the quantity of times the word is contained in the document, but it is countered by the frequency of the word within the corpus. This helps to limit words being given a stronger weight simply due to their frequency within a language as a whole. To determine these values, we first created a Document Term Matrix (DTM). This DTM maps the quantity of each keyword to each document (tweet) within our corpus. From this matrix, we were able to calculate the overall TF-IDF function for our dataset, and determine each word’s strength within the corpus as a whole. Once we computed this value, we multiplied each strength against the positive/negative value we determined in the first method to determine the word’s net strength. However this method determined word importance depending on the corpus, rather than the actual meaning of the word itself. There was the risk that a word would be falsely given greater strength than another, simply due to its usage within the matrix. This led us to our final method of sentiment analysis, Boolean sentiment analysis.
Our final method of sentiment analysis was to forgo determining the strength of a word and base its score upon the meaning of the word in context of a stock market. However, instead of weighing each word upon a variable scale, we simply gave each value a positive or negative one. While this method would lose keyword granularity, it would ensure that each word would have equal weight within the corpus. We utilized this method to see if our sentiment analysis was potentially a confounding variable.
Once we determined the values of each keyword through each method, we mapped the appropriate values to each of the keywords. For tweets with multiple keywords, we summated each of the keyword’s values. We deliberated between averaging and summation; however we determined that a word with greater keywords should have a greater weight than one with a single word. When each of these three datasets was completed, we imported it into R in order to perform analysis and comparison against the respective stock dataset.
To determine if the change in movement of the tweets and stocks was correlated, we created an automated function which calculated the linear regression for the stock values and tweets of each month. From the data we acquired from this function, we took the derivative of each regression and compared them against their corresponding regression derivative. This allowed us to compare not only if the directions of predicted movement were the same, but also determine if the two datasets had a correlated regressions and if so, the accuracy of each. Through this experiment, we determined that the average correlation between the twitter stock regressions is 0.33644. This is well below the accepted value of 0.7 so we are unable to disprove the null hypothesis. In addition, we found that five out of the twelve months had derivatives with opposite signs. This means that the tweet data was predicting a downward movement when there was an actual upward movement and vice versa.
Through several statistical methods, our analysis has proven that there is no correlation between stock tweets and stock market swings. This is in line with a recently published researched study (as of March 5th, 2013) by Pew Research Center, which reveals that comments on Twitter rarely line up with opinions gauged in national polls and surveys. They found that Twitter reactions were, by and large, more liberal and pro-Democratic in tone than the national average (Mike Krumboltz, Yahoo News). For example, Pew found that Twitter responses to Obama’s second inaugural address were “not nearly as positive as public opinion” (Mike Krumboltz, Yahoo News). Also, Twitter can be used by people under the age of 18 and those who live outside of the U.S. Pew concludes that Twitter users aren’t representative of the public at large, thus, breaking the perception that opinions posted on Twitter are representative of the overall population.
In future, using a more representative dataset of the stock market will lead to a better accuracy and precision in prediction of the actual stock market rise and fall. Further, access to a dictionary which has accurately captured the quantitative values of different words used to describe stock market behavior and the emotions expressed by the twitter users could lead to results which may prove correlation and possible causation between the Twitter data and the Stock Market Movement to a very high degree.