RMLDM - Research project

Representation and Machine Learning with Missing Data

PhD student: Richard SERRANO, ED SIS 488 (Science, Engineering, Health)

ABSTRACT

Attributed graphs are a very powerful data structure for many tasks in image analysis. After segmentation, the nodes of the graph represent the regions of the image and the links their relationships. The properties characterizing the regions (colour, texture, size, etc.) are used to define the attributes of the nodes and the type of relationships between the regions (connectivity, distance, …) the labels of the links. However, due to defaults of acquisition, many values of these features can be missing or noisy, degrading the performances of the graph-based mining and learning algorithms and the whole image analysis process (Little, 2019). On the other hand, deep learning methods have proven to be particularly effective in dealing with graph data and learning node representation but they strongly rely on the availability and the quality of node/link features (notably the family of Graph Neural Networks). In the frame of Graph Neural Networks, only a few works focused on learning attribute-missing graph embeddings (Chen, 2020. Yoon, 2018).

In this thesis, our objectives are twofold. First, we will study the impact of the incompleteness and biases of the data on the algorithms. Second, we will develop methods for filling missing entries with plausible values to be able to better perform downstream machine-learning methods on the completed data. In particular, recently and for tabular data, optimal transport (OT) (Muzellec, 2020) has been shown to be more efficient than classical imputation methods based on low-rank assumptions (Hastie et al.,2015), iterative random forests (Stekhoven & Buhlmann, 2011) or variational autoencoders (Mattei & Frellsen, 2019; Ivanov et al., 2019). We propose to adapt this approach to graph data using the Fused-Gromov-Wasserstein metric (Vayer, 2019).

RMLDM - Research project

Representation and Machine Learning with Missing Data

PhD student: Richard SERRANO, ED SIS 488 (Science, Engineering, Health)

ABOUT the RMLDM project - Thesis certified