Market forecast summarization is the task of compiling market forecasts supplied by domain experts. It is an essential task for investment strategists who base their investment decisions on them. A rich source for market forecast information in this connection is the World Wide Web. Our goal is the development of a Web-service which identifies and extracts forecast information for a user-specified market from the World Wide Web and based on the found documents generates a graphical summary that can be used for predictions.
How is the future potential of a market estimated? Market research offers a bundle of methods to answer this question. One of the most popular approaches in recent years is the Internet-based literature research. In this context the Internet is a rich source of market forecast statements like the one above. An analyst who collects and interprets these statements may obtain a reasonable idea about the future market volume. However, manually conducted literature research based on the Internet is time-consuming and usually not exhaustive, since human abilities to manage the information flood on the Internet are limited. These facts motivated our research to automate this process.
We developed a four stage approach to automate the process of market forecast summarization. A user of the system has to specifiy at least one keyword which characterizes the market of interest. The four stages are (1) a focused search for documents containing market forecasts, (2) analysis of the search results whether they contain forecast information, (3) parsing of time and money information, and (4) identification of significant associations between the extracted time and money values. (5) A diagram is generated containing all the forecasts identified for the respective market keyword.
- Collecting Candidate Documents. Internet search engines provide an obvious way to retrieve documents about a certain topic, and we use them in a meta-search-manner to compile a first candidate document collection. Starting point is a user-specified set of keywords that characterize the market of interest, e.g. "online advertising". These keywords are combined with entries from a hand-crafted vocabulary of typical market analysis terms. Then search queries like "online advertising forecast" are sent to well-known search engines and the results with a high ranking are downloaded.
- Report Filtering. At this stage the downloaded documents are classified whether they are likely to contribute valuable information to market analysis or not. This is done by a genre detection which differentiates documents with respect to their form, style or targeted audience. Some well-known genres of Web-pages are online-shops, help pages, discussion forums, etc. Each genre comes with its own characteristics, as does the genre of "market forecast documents". We identify a set of such characteristics and classify the downloaded documents on their basis using a statistical discriminant analysis. Note that this kind of genre detection automates one of the most time-consuming tasks within a manually conducted forecast summarization.
- Time and Money Identification. Documents that are likely to contain market forecast information are subject to a closer analysis: all time and money phrases are identified. Time and money extraction from plain text are challenging tasks since they can appear in very different forms. For example time information can be expressed as dates, relative time information like "next year", or vague time information like "within the past years". Money information can be identified on the basis of trailing or leading currency symbols.
- Phrase Analysis. In order to bring the extracted time and money information of the previous stage together, this stage associates them by means of natural language processing. The result of this stage are a number of forecast statements in a computer-understandable form. To obtain them each phrase of a candidate document is parsed in order to find out whether it actually is a market forecast or if it just contains time and money information in another sense.
- Presentation. The forecast statements identified at the end of stage four can be analyzed and presented according to the needs of a user. Currently each statement is included in a diagram where each point corresponds to one forecast. The x-axis of the diagram shows the time in years and the y-axis shows the turnover a forecast predicts for the respective year.
Students: Katja Schöllner, Verena Skuk