NLP applied in alternative data

Recently I get that many hedge funds now collect a lot of alternative data to supplement traditional financial data when analyzing stocks. Such as some social media, satellite, and data generated by the Internet of Things(IoT). It contains a lot of unstructured textual information. I would like to know if NLP techniques can be applied to extract these textual information, and will these alternative data really make analysts’ analysis of a certain stock and forecasting of that stock price more accurate to some extent, rather than a burden? :slightly_smiling_face:


That is a good question, let’s see if someone have an answer for you, any answer will get five participation points.

Yes, NLP enables us to incorporate textual data in multiple languages from a variety of source. As I saw from a report, one of the obvious NLP applications is to gauge sentiment in the text—is the tone in the news articles or research reports being published on a company positive or negative? An extension of NLP is topic modeling—summarizing a large body of text into topics and themes that are easily understood by humans, but can also be used for systematic analysis in statistical and machine learning applications. For example, what subjects did company management focus on in their earnings call this quarter versus last quarter? The information gauged from the alternative data will affect the analysts’ decisions. Hope this can help.

1 Like

Hi Hongzheng,
I read an earlier paper about using twitter mood to predict the stock market, which models and analyzes the relationship between the seven seven dimensions of moods and DJIA. The paper uses OpinionFinder to extract positive and negative mood and GPOMS to extract other specific six moods from twitter and then uses granger causal analysis method and self-organizing fuzzy neural network (SOFNN) to analyze the relationship respectively. This paper did prove that use public moods on twitter can significantly improve the forecast accuracy of the DJIA. Thus, I think that the answer is definitely yes for your question. But it seems like NLP techniques in finance still need to be improved to extract more significant and influential information and reduce the noise, which can avoid to be a burden for analysts. Here is the link of the paper if you need :slightly_smiling_face:

1 Like

For sure, NLP does help forecast the future performance of stocks. I will give a specific application example to supplement the answers above regarding the sentiment analysis using NLP.

For all the textual data in the financial markets, the most informative one or the most insightful and professional one is the research paper produced by the sell-side fundamental analysts. However, for a hot issue stock, the company would be analyzed by a large amount of analysts from different companies. For instance, there are three thousand analysts from more than a hundred different companies in china to publish 240 thousand+ research paper each year. Hence, to read all these research paper would be impossible manually. Thus, NLP help us to deal with all these paper effectively and efficiently. In addition, NLP can capture other information besides the upward or downward rate adjustments to give a more comprehensive view for the company. So, how to do that?

Let’s say now we train the machine to read a research paper of a company written by a fundamental analyst. The first dimension is to transform the research paper into structured data. The transformation enables us to input these structured data into the NLP model for further analysis. The second dimension is to extract the key information from the research paper which includes stock ticker, core perspectives, attitudes towards the future performance(Revenue, frowth rate of the profit), the exact value of valuation metrics (Multiples, DCF, FCFF) etc,. The last one is to summarize all these metrics, both qualitative and quantitative one, and compute a weighted sentiment score for all these metrics and then we have a single sentiment score for the company under this analyst’s research paper.

Now we have the methodology to analyze a research report. We can construct a factor investment strategy based on the sentiment report factor. The sentiment report factor is calculated as, analyzing all the research paper published after 90 days before now (90 is a hyperparametre) using NLP and we take the time-decaying-weighted average score of all the research paper within this period. Thus, we are able to compute this factor for all the companies. After we have the factor, we should check the correlation of this alternative factor with the traditional general factor like style and industry factor. If there is a high correlation, we should implement neutrality to ensure the performance are not affected by these genenral character.

Constructing the analyst sentiment factor for a company is kind of QuantMental investing. We can use NLP to transform huge amounts of fundamental analysis into a quantitative score efficiently. There is no doubt that both the discretionary fund manager and the factor investing manager can benefit from the construction of this factor. Empirical practice suggested that this factor has a high information coefficient and shapre ratio for single factor long-short portfolio.

Hope this example helps.

1 Like

Thanks for the good question, if anyone asks another good question, you will get 5 participation points. If you ask two good questions, you get all the participation points.

To add to the discussion, using LSTM hybrids can be beneficial. CNNs, though popular in images, can be utilized in such a way that you can extract the essential information from the text and then try using that information to discover what answers you get.

1 Like

It has been taken seriously by some players and further than hypothesis in some research projects.

Here is an example:

a AQR whitepaper “Can Machine Learning Help Manage Climate Risks?” (2021) goes to conclude that

“Engle, Giglio, Kelly, Lee, and Stroebel in “Hedging Climate Change News”, 2020, where they use textual analysis and machine learning techniques to create a broad climate hedging portfolio based on stocks’ sensitivity to climate news. We find that, subject to further research, these insights could be used as a complement to carbon-aware investing in defending against climate change.”


“Engle et al. (2020) present a novel and effective climate-risk hedging portfolio derived from a combination of economic theory and textual analysis. While their emphasis is on hedging climate change news coverage in The Wall Street Journal, the framework is a general, rigorous methodology. It can flexibly accommodate alternative hedging targets that researchers might hypothesize.”

I guess these have been discussed earlier here somewhere, but anyway, some other examples:

Lasse Heje Pedersen, Abhilash Babu, and Ari Levine (2021), Enhanced Portfolio Optimization, Financial Analysts Journa], 77:2, 124-151, DOI: 10.1080/0015198X.2020.1854543 Enhanced Portfolio Optimization by Lasse Heje Pedersen, Abhilash Babu, Ari Levine :: SSRN

Ke, Zheng and Kelly, Bryan T. and Xiu, Dacheng, Predicting Returns with Text Data (September 30, 2020). University of Chicago, Becker Friedman Institute for Economics Working Paper No. 2019-69, Yale ICF Working Paper No. 2019-10, Chicago Booth Research Paper No. 20-37 Predicting Returns with Text Data by Zheng Tracy Ke, Bryan T. Kelly, Dacheng Xiu :: SSRN