Domains of Applications#

Jazz Musicians Network. Source: DataCamp

Social Networks:#

SNAP Datasets: Stanford Network Analysis Project provides a collection of social network datasets, including friendship networks (e.g., Facebook, Pokec) and citation networks (e.g., Cora, Citeseer). These are excellent for recommendation systems, community detection, and information diffusion tasks.

Knowledge Graphs:#

WordNet: A lexical database for English that can be represented as a knowledge graph. It’s useful for tasks like knowledge graph completion and relation extraction.

Freebase: A large collaborative knowledge base containing general facts about the world. Though it’s no longer actively maintained, it’s still valuable for research.

Chemistry and Drug Discovery:#

PubChem: A public database of chemical molecules and their properties. It’s widely used for molecular property prediction and drug discovery research.

ChEMBL: A manually curated database of bioactive molecules with drug-like properties. It’s valuable for drug-target interaction prediction and other tasks in drug discovery.

Traffic Prediction:#

PeMS: California Department of Transportation Performance Measurement System provides traffic data from sensors on highways, which can be used for traffic flow forecasting.

Cross-References and Citations:#

Cora Dataset: Commonly used for node classification tasks in GNNs. It consists of scientific publications categorized into different topics.

Reddit Datasets: Various datasets based on Reddit interactions, which can be used for graph-based machine learning tasks.

Open Graph Benchmark (OGB): A collection of benchmark datasets for graph machine learning tasks, including node classification, link prediction, and graph classification. Additionally, many research papers in the GNN field often provide access to their datasets for reproducibility and further research. You can find these in the supplementary materials or appendices of the papers.

To find these datasets, you can search for them by name on platforms like Google Dataset Search or Kaggle. You can also look for repositories on GitHub where researchers share their datasets.

Citations Application#

Planetoid is a dataset consisting of three citation networks (Cora, CiteSeer, PubMed) suitable for semi-supervised learning tasks. Nodes correspond to documents represented by bag-of-words feature vectors in 1433-dimensional space. Edges represent citation links. The objective is to develop a model capable of accurately predicting the class labels (cardinality seven) of unlabeled documents within the network.

dataset = Planetoid(root='data/Planetoid', name='Cora', transform=NormalizeFeatures())

print(f'Dataset: {dataset}:')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

data = dataset[0]  # Get the first graph object.
Dataset: Cora():
Number of graphs: 1
Number of features: 1433
Number of classes: 7
Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])
class GCN(torch.nn.Module):
    def __init__(self, hidden_channels):
        self.conv1 = GCNConv(dataset.num_features, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, dataset.num_classes)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.5,
        x = self.conv2(x, edge_index)
        return x

model = GCN(hidden_channels=16)
  (conv1): GCNConv(1433, 16)
  (conv2): GCNConv(16, 7)
def visualize(h, color):
    z = TSNE(n_components=2).fit_transform(h.detach().cpu().numpy())


    plt.scatter(z[:, 0], z[:, 1], s=80, c=color, cmap="Set2")

out = model(data.x, data.edge_index)
visualize(out, color=data.y)
model = GCN(hidden_channels=16)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()

def train():
      out = model(data.x, data.edge_index)
      loss = criterion(out[data.train_mask], data.y[data.train_mask])
      return loss

def test():
      out = model(data.x, data.edge_index)
      pred = out.argmax(dim=1)
      test_correct = pred[data.test_mask] == data.y[data.test_mask]
      test_acc = int(test_correct.sum()) / int(data.test_mask.sum())
      return test_acc

for epoch in range(1, 101):
    loss = train()
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')
test_acc = test()
print(f'Test Accuracy: {test_acc:.4f}')
Test Accuracy: 0.8220
out = model(data.x, data.edge_index)
visualize(out, color=data.y)

Example with Graph Attention Networks#

Graph Attention Networks (GATs)#

GATs are a type of Graph Neural Network (GNN) that leverage the concept of attention mechanisms, borrowed from the field of natural language processing (NLP), to enhance the way information is aggregated from neighboring nodes in a graph.

Key Advantages:#

Attention Mechanism: GATs don’t just treat all neighbors equally. They learn to assign different importance (attention) to different nodes in the neighborhood based on their features and the structure of the graph. This allows the network to focus on the most relevant nodes for a given task.

Improved Performance: This attention mechanism often leads to improved performance over traditional graph convolutional networks (GCNs) on various tasks, especially when the graph structure is complex or the node features are rich and informative.

Flexibility: GATs can handle graphs with variable-sized neighborhoods, which is a limitation of some other GNN architectures.

GATConv (Graph Attentional Convolution Layer)#

GATConv is the fundamental building block of a GAT. It’s the layer that performs the graph attentional convolution operation, where the attention mechanism is applied. Here’s how it works:

Node Feature Transformation: Each node’s feature vector is linearly transformed to a higher dimensional space.

Attention Coefficients Calculation: Pairs of nodes (a node and its neighbors) are considered. The attention mechanism computes a score (attention coefficient) for each pair, indicating how much attention the node should pay to its neighbor. This score is usually based on the transformed features of both nodes and can optionally include structural information (e.g., edge features).

Normalization: The attention coefficients are normalized (often using a softmax function) so that they sum to 1 for each node across all its neighbors.

Weighted Aggregation: The transformed features of the neighbors are aggregated using the normalized attention coefficients as weights. This means that the final node representation is a weighted sum of its neighbors’ features, where the weights are learned through the attention mechanism.

Optional Nonlinearity: An activation function (e.g., ReLU) may be applied to the aggregated representation to introduce non-linearity.

Multi-Head Attention:#

GATs often employ multi-head attention, which means that the attentional convolution is performed multiple times independently, each with its own set of parameters. The results of these multiple heads are then combined, usually by concatenation or averaging, to produce the final node representation. This improves the model’s capacity to capture different aspects of the node features and graph structure.

Applications Include:#

Node Classification: Predicting the type or label of a node in a graph (e.g., social network analysis, recommender systems). Link Prediction: Predicting the existence of a connection between two nodes (e.g., knowledge graph completion). Graph Classification: Predicting the overall type or label of an entire graph (e.g., chemical compound analysis).

class GAT(torch.nn.Module):
    def __init__(self, hidden_channels, heads):
        self.conv1 = GATConv(dataset.num_features, hidden_channels,heads)
        self.conv2 = GATConv(heads*hidden_channels, dataset.num_classes,heads)

    def forward(self, x, edge_index):
        x = F.dropout(x, p=0.6,
        x = self.conv1(x, edge_index)
        x = F.elu(x)
        x = F.dropout(x, p=0.6,
        x = self.conv2(x, edge_index)
        return x

model = GAT(hidden_channels=8, heads=8)

optimizer = torch.optim.Adam(model.parameters(), lr=0.005, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()

def train():
      out = model(data.x, data.edge_index)
      loss = criterion(out[data.train_mask], data.y[data.train_mask])
      return loss

def test(mask):
      out = model(data.x, data.edge_index)
      pred = out.argmax(dim=1)
      correct = pred[mask] == data.y[mask]
      acc = int(correct.sum()) / int(mask.sum())
      return acc

val_acc_all = []
test_acc_all = []

for epoch in range(1, 55):
    loss = train()
    val_acc = test(data.val_mask)
    test_acc = test(data.test_mask)
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Val: {val_acc:.4f}, Test: {test_acc:.4f}')
  (conv1): GATConv(1433, 8, heads=8)
  (conv2): GATConv(64, 7, heads=8)
plt.plot(np.arange(1, len(val_acc_all) + 1), val_acc_all, label='Validation accuracy', c='blue')
plt.plot(np.arange(1, len(test_acc_all) + 1), test_acc_all, label='Testing accuracy', c='red')
plt.legend(loc='lower right', fontsize='x-large')

out = model(data.x, data.edge_index)
visualize(out, color=data.y)