Workshop on Scholarly Web Mining

Workshop Program

Date: Friday February 10, 2017

Location: Council Chambers, The Guildhall, Cambridge, UK

Time: 9:00am - 12:30pm

9:00

Workshop introduction

Thomas Potok, Oak Ridge National Laboratory

9:05

Invited talk

Johanna McEntyre, Europe PMC

Europe PMC is a database of abstracts and open access full text articles in the life sciences, covering all of PubMed and PMC. As a core part of the Europe PMC mission, we share this content as widely as possible to support the development of algorithms and text-mining tools based on this content. Recently, we developed a platform called SciLite that enables text miners to upload annotations and highlight them on articles, as a tool to assist readers in browsing, or to discover links to related data. In this presentation I will describe some of the ways that developers can consume textual data and publish results on Europe PMC, and how doing this could contribute to a more connected literature-data ecosystem for the life sciences.

9:25

Effectively identifying users' research interests for scholarly reference management and discovery

Marco Rossetti, Saúl Vargas, Davide Magatti, Benjamin Pettit, Daniel Kershaw, Maya Hristakeva and Kris Jack

Discovering users' interests is essential in order to help them explore resources in large digital repositories. In particular, correctly identifying users' interests is commonly a good approach for organising information and providing personalised recommendations. We consider the case of discovering users' research interests in Mendeley, a research platform for scholarly article management and discovery. Prior work in this area has considered approaches such as matrix factorisation and text-based topic modelling for inferring topics of interest in recommendation scenarios. These approaches present several problems, such as little or no interpretability of the inferred topics and difficulty handling similarities in vocabulary in different research disciplines. We present an effective solution for extracting coherent and interpretable research topics that leverages the reference management data in Mendeley in a three-step approach: 1) a topic model based on the interactions between users and articles rather than article content, 2) keyword extraction to label the topics using article titles and author-declared keywords and 3) identifying the research interests of users based on the articles that they have added to their libraries. An evaluation comprised of a research interest prediction task and a paper recommendation task shows the validity of our proposal in different research disciplines (clearly outperforming a text-based latent topic model) and provides further insights regarding the effects of number of latent topics in the model and the trade-off between recency and quantity of the users' libraries.

9:40

Scalable Algorithm for Probabilistic Overlapping Community Detection (presentation slides)

Kento Nozawa and Kei Wakabayashi

In the data mining field, community detection, which decomposes a graph into multiple subgraphs, is one of the major techniques to analyze graph data. In recent years, the scalability of the community detection algorithm has been a crucial issue because of the growing size of real-world networks such as the co-author network and web graph. In this paper, we propose a scalable overlapping community detection method by using the stochastic variational Bayesian training of latent Dirichlet allocation (LDA) models, which predicts sets of neighbor nodes with a community mixture distribution. In the experiment, we show that the proposed method is much faster than previous methods and is capable of detecting communities even in a huge network that contains 60 million nodes and 1.8 billion edges. Furthermore, we compared different mini-batch sizes and the number of iterations in stochastic variational Bayesian inference to determine an empirical trade-off between efficiency and quality of overlapping community detection.

9:55

Building recommender systems for scholarly information

Maya Hristakeva, Daniel Kershaw, Marco Rossetti, Petr Knoth, Benjamin Pettit, Saul Vargas and Kris Jack

The depth and breadth of research now being published is overwhelming for an individual researcher to keep track of, let alone consume. Recommender systems have been developed to make it easier for researchers to discover relevant content. However, these have predominately taken the form of item-to-item recommendations using citation network features or text similarity features. This paper details how the Mendeley Suggest recommender system has been designed and developed. We show how implicit user feedback (based on activity data from the reference manager) and collaborative filtering (CF) are used to generate the recommendations for Mendeley Suggest. Because collaborative filtering suffers from the cold start problem (the inability to serve recommendations to new users), we developed additional recommendation methods based on user-defined attributes, such as discipline and research interests. Our off-line evaluation shows that where possible, recommendations based on collaborative filtering perform best, followed by recommendations based on recent activity. However, for cold users (for whom collaborative filtering was not possible) recommendations based on discipline performed best. Additionally, when we segmented users by career stages, we found that among senior academics, content-based recommendations from recent activity had comparable performance to collaborative filtering. This justifies our approach of developing a variety of recommendation methods, in order to serve a range of users across the academic spectrum.

10:10

Linking Mathematical Expressions to Wikipedia

Giovanni Yoko Kristianto and Akiko Aizawa

This paper addresses the challenge of determining the identity of mathematical expressions in documents by linking these expressions to their corresponding Wikipedia articles. Math expressions are frequently used to describe important concepts in scientific documents; however, particularly in the case of famous or well-established equations, they are often minimally explained within the documents themselves. Linking to Wikipedia allows readers to obtain additional explanation of these math expressions. This paper proposes a learning-based approach to solve this challenge using common features, such as math and text similarities, as well as the importance of the math expression within the document. Further, we develop a dataset that allowed us to train and test our proposed approach. Experimental results show that our learning-based approach achieves a precision of 83.40%, compared with 6.22% for the baseline method (a straightforward application of Math IR).

10:25

Break

10:55

Invited talk

Mads Rydahl, UNSILO

Mads will present the UNSILO Concept Extraction pipeline, and describe some of the recent products UNSILO has built for the Scientific Publishing industry, and outline the future direction of Scientific Text Intelligence.

11:10

Predicting MeSH Beyond MEDLINE (presentation slides)

Adam Kehoe, Vetle Torvik, Matthew Ross and Neil Smalheiser

Medical subject headings (MeSH) are a ﬂexible and useful tool for describing biomedical concepts. Here, we present MeSHier, a tool for assigning MeSH terms to biomedical documents based on abstract similarity and references to MEDLINE records. When applied to PubMedCentral papers, NIH grants, and USPTO patents we ﬁnd that these two sources of information produce largely disjoint sets of related MEDLINE records, albeit with some overlap in MeSH. When combined they provide an enriched topical annotation that would not have been possible with either alone. MeSHier is available as a demo tool that can take as input IDs of PubMed papers, USPTO patents, and NIH grants: http://abel.lis.illinois.edu/cgi-bin/meshier/search.py

11:25

ScholarBase: Towards a Cross-Domain Knowledgebase for Linked Scholarly Data (presentation slides)

Mahmoud Elbattah

Over the last years, Linked Data has gained a significant momentum through various initiatives embraced by academia and industry as well. In this paper, we introduce ScholarBase, a research project in progress that is aimed to serve as a Linked Data repository for cross-domain scholarly data. ScholarBase can be conceived as a knowledgebase that weaves links among scholars, institutions, research areas, publications, and geographical locations in a Linked Data fashion. Initially, the primary source of data stems from Google Scholar, specifically the profile pages of scholars. Subsequently, the collected dataset is transformed into a semantic-based format (i.e. RDF model). The semantified dataset will enable a machine-readable expression of entity relationships, and can in turn be linked to external knowledge bases (e.g. DBpedia). Through the paper, we discuss the architecture and implementation of the project. ScholarBase can play an important role in the Linked Data web by opening extended opportunities for interesting applications, and data exploration of scholarly data.

11:40

Describing Data Processing Pipelines in Scientific Publications for Big Data Injection

Sepideh Mesbah, Alessandro Bozzon, Christoph Lofi and Geert-Jan Houben

The rise of Big Data analytics has been a disruptive game changer for many application domains, allowing the integration into domain-specific applications and systems of insights and knowledge extracted from external big data sets. The effective ``injection'' of external Big Data demands an understanding of the properties of available data sets, and expertise on the available and most suitable methods for data collection, enrichment and analysis. A prominent knowledge source is scientific literature, where data processing pipelines are described, discussed, and evaluated. Such knowledge is however not readily accessible, due to its distributed and unstructured nature. In this paper, we propose a novel ontology aimed at modeling properties of data processing pipelines, and their related artifacts, as described in scientific publications. The ontology is the result of a requirement analysis that involved experts from both academia and industry. We showcase the effectiveness of our ontology by manually applying it to a collection of Big Data related publications, thus paving the way for future work on more informed Big Data injection workflows.

11:55

Citations and readership are poor indicators of research excellence: Introducing TrueID, a new dataset for validating research evaluation metrics (presentation slides)

Drahomira Herrmannova, Robert Patton, Petr Knoth and Christopher Stahl

In this paper we show that citation counts and Mendeley readership are poor indicators of research excellence. Our experimental design builds on the assumption that a good evaluation metric should be able to distinguish publications that have changed a research field from those that have not. The experiment has been conducted on a new dataset for bibliometric research which we call True Impact Dataset (TrueID). TrueID is a collection of research publications of two types -- research papers which are considered seminal work in their area and papers which provide a survey (a literature review) of a research area. The dataset also contains related metadata, which include DOIs, titles, authors and abstracts. We describe how the dataset was built and provide overview statistics of the dataset. We propose to use the dataset for validating research evaluation metrics. By using this data, we show that widely used research metrics only poorly distinguish excellent research.

12:10

Invited talk

Kiera McNeice, FutureTDM

FutureTDM has spent the last 18 months engaging with stakeholders in all aspects of the text and data mining value chain, developing an understanding of the challenges and opportunities these technologies offer to researchers and industry across Europe. In the final six months of the project we will be developing concrete guidelines for stakeholders in key areas, to support greater adoption of TDM in Europe. In this talk we will introduce some of the guidelines we propose to deliver, addressing issues such as minimising legal risk in TDM, and encouraging universities to support TDM across a broad range of disciplines.

Workshop Program

9:00

Workshop introduction

Thomas Potok, Oak Ridge National Laboratory

9:05

Invited talk

Johanna McEntyre, Europe PMC

9:25

Effectively identifying users' research interests for scholarly reference management and discovery

Marco Rossetti, Saúl Vargas, Davide Magatti, Benjamin Pettit, Daniel Kershaw, Maya Hristakeva and Kris Jack

9:40

Scalable Algorithm for Probabilistic Overlapping Community Detection (presentation slides)

Kento Nozawa and Kei Wakabayashi

9:55

Building recommender systems for scholarly information

Maya Hristakeva, Daniel Kershaw, Marco Rossetti, Petr Knoth, Benjamin Pettit, Saul Vargas and Kris Jack

10:10

Linking Mathematical Expressions to Wikipedia

Giovanni Yoko Kristianto and Akiko Aizawa

10:25

Break

10:55

Invited talk

Mads Rydahl, UNSILO

11:10

Predicting MeSH Beyond MEDLINE (presentation slides)

Adam Kehoe, Vetle Torvik, Matthew Ross and Neil Smalheiser

11:25

ScholarBase: Towards a Cross-Domain Knowledgebase for Linked Scholarly Data (presentation slides)

Mahmoud Elbattah

11:40

Describing Data Processing Pipelines in Scientific Publications for Big Data Injection

Sepideh Mesbah, Alessandro Bozzon, Christoph Lofi and Geert-Jan Houben

11:55

Citations and readership are poor indicators of research excellence: Introducing TrueID, a new dataset for validating research evaluation metrics (presentation slides)

Drahomira Herrmannova, Robert Patton, Petr Knoth and Christopher Stahl

12:10

Invited talk

Kiera McNeice, FutureTDM

12:25

Closing

12:30

Lunch