VerifiedarXiv:1609.0290728 min
Graph neural networks · Architecture

Semi-Supervised Classification with Graph Convolutional Networks

Average each node with its neighbors, and a few labels classify the whole graph.

Most nodes in a graph carry no label, but every node has neighbors. A graph convolutional network passes features along the edges, so a handful of labels reaches everyone.

Explaining the paperSemi-Supervised Classification with Graph Convolutional NetworksKipf, Welling · University of Amsterdam · ICLR 2017 · arXiv:1609.02907

You are handed a citation network: thousands of papers, an edge wherever one cites another, and a subject label for only twenty papers per subject. Label every other paper.

This is a common shape of problem, and the scarce thing is labels. Pulling the subject of one paper from its text alone is hard and ignores a strong signal sitting right there: papers cite papers on the same subject. The graph of citations is itself a clue, often a better one than any single document. The question this paper answers is how to feed a neural network both at once, the features on each node and the wiring between nodes, and train it when almost none of the nodes are labeled.

The method is one short rule, applied a couple of times: replace every node's feature vector with a normalized blend of its own and its neighbors', multiply by a learned weight matrix, and pass it through a nonlinearity. Two rounds of that, a softmax on top, and the model lands within a few points of the best methods of its day while running an order of magnitude faster. The work is in seeing why that rule is the right one, and where it comes from. Kipf and Welling derive it from spectral graph theory and then show it is also plain neighbor-averaging in disguise. The two readings meet in the middle.

A handful of ideas carry the paper: the graph Laplacian and its "frequencies", what convolution even means on a graph, the one approximation that collapses an expensive spectral filter into a single sparse matrix multiply, and the renormalization trick that keeps it stable. None is heavy on its own.

A few labels, a whole graph

Fix the vocabulary first, because it is load-bearing. A graph G=(V,E)\mathcal{G}=(\mathcal{V},\mathcal{E}) has NN nodes. Each node carries a feature vector; stack them into a matrix XRN×CX\in\mathbb{R}^{N\times C} with CC features per node (for a paper, a bag-of-words vector of its text). The wiring is the adjacency matrix ARN×NA\in\mathbb{R}^{N\times N}, with Aij=1A_{ij}=1 when nodes ii and jj share an edge. A few nodes come with a class label YY; most do not. The job is to predict the labels of the unlabeled nodes.

One detail of the setting shapes everything that follows. The entire graph is visible during training: all the features, all the edges, and the unlabeled nodes themselves. Only the labels for most nodes are missing. This is called transductive learning, which means you are filling in blanks within one fixed, fully-visible graph, not training a model to handle graphs it has never seen. (The follow-up GraphSAGE made the same idea inductive, able to embed brand-new nodes, by learning the aggregation as a reusable function rather than relying on the fixed graph. The original model here does not do that.) Because every node is present, a prediction for one node can lean on its unlabeled neighbors, and that is exactly the leverage a tiny label set needs.

The benchmarks are citation networks. Cora has 2,708 papers and 5,429 citation links across 7 subjects, with a 1,433-word vocabulary; training uses 20 labels per class, 140 in all, against a 1,000-node test set. Citeseer and Pubmed are the same shape at different sizes, and NELL is a knowledge graph pushed to the extreme of one label per class. On Cora that is 140 labeled nodes out of 2,708, a label rate of about 5%. Everything below is built to make those 140 labels go a very long way.

The old fix: bleed labels along edges

Before this paper, the standard way to use the graph was to add a term to the loss that rewards smoothness: nearby nodes should get similar predictions. Write it as a penalty on disagreement across edges:

L=L0+λLreg,Lreg=i,jAijf(Xi)f(Xj)2=f(X)Δf(X)\mathcal{L} = \mathcal{L}_0 + \lambda\,\mathcal{L}_{\text{reg}}, \qquad \mathcal{L}_{\text{reg}} = \sum_{i,j} A_{ij}\,\big\lVert f(X_i) - f(X_j) \big\rVert^2 = f(X)^\top \Delta\, f(X)
(1)

Here L0\mathcal{L}_0 is the ordinary supervised loss on the labeled nodes, and the second term sums the squared difference of predictions over every edge, weighted by AijA_{ij}. If two connected nodes get very different outputs, the penalty grows, so training is pushed toward predictions that vary slowly across the graph. That whole sum collapses into the compact form on the right, where Δ=DA\Delta = D - A is the graph Laplacian: the degree matrix DD (each DiiD_{ii} the number of edges at node ii) minus the adjacency. This is the unnormalized Laplacian, and it is the operator that, at each node, measures how far that node sits from the average of its neighbors. Its quadratic form fΔff^\top\Delta f is precisely the total edge-disagreement, which is why minimizing it smooths the output.2

The trouble is the assumption baked into (1): an edge means similarity, so connected nodes shouldagree. Often an edge means something looser. A paper can cite work it disagrees with, or cite across fields. Forcing agreement along every edge throws away the chance to learn what an edge actually signals. Kipf and Welling drop the regularizer entirely and make a different bet: instead of adding a smoothness penalty beside the model, put the graph inside the model. Let the network be a function f(X,A)f(X,A) of both the features and the adjacency, train only the supervised loss L0\mathcal{L}_0 on the labeled nodes, and let the architecture decide how much neighbors should matter. With the graph wired into ff, the gradient from a single labeled node flows back through its neighbors and shapes their representations too, so the supervision spreads on its own without anyone writing down a penalty for it.

One layer: blend a node with its neighbors

One layer maps every node's features to the next layer like this:

H(l+1)=σ ⁣(D~1/2A~D~1/2H(l)W(l))H^{(l+1)} = \sigma\!\left( \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}\, H^{(l)}\, W^{(l)} \right)
(2)

Read it from the inside out. H(l)RN×DH^{(l)}\in\mathbb{R}^{N\times D} holds the current feature vector for every node, one per row, with H(0)=XH^{(0)}=X. The matrix A^D~1/2A~D~1/2\hat{A}\equiv\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} is the graph operator: A~=A+IN\tilde{A}=A+I_N is the adjacency with a self-loop added to every node, and D~\tilde{D} is the degree matrix of A~\tilde{A}. Multiplying A^H(l)\hat{A}H^{(l)} replaces each node's row with a weighted combination of its neighbors' rows and its own. Then W(l)W^{(l)} mixes the feature channels (the learned part), and σ\sigma is a nonlinearity, the ReLU\mathrm{ReLU}. One layer is: gather from neighbors, transform, squash.

The exact weight on each contribution matters, because the obvious guess is wrong. The entry of A^\hat{A} linking nodes ii and jj is 1/d~id~j1/\sqrt{\tilde{d}_i\,\tilde{d}_j}, where d~i\tilde{d}_i is node ii's degree counting its self-loop. So a node's new value is a sum over its closed neighborhood, each neighbor scaled by one over the square root of the two degrees' product. This is not a plain average: the weights coming into a node do not sum to one. The plain row-averaging operator would be D~1A~\tilde{D}^{-1}\tilde{A}, the random walk, and the paper deliberately does not use it. The symmetric form D~1/2A~D~1/2\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} is chosen because it stays symmetric, which keeps its eigenvalues real and its behavior stable when you stack layers. The square-root split is what makes a high-degree hub count for less from both ends at once.

Tap a node below and watch what its layer computes. Its edges light up, every neighbor sends in its current value, and the readout shows the weighted sum that becomes the node's new feature. The bridge node, picked by default, ends up blended between the two groups it connects, the mechanism in miniature.

Figure 1 · one layer of aggregation
Each node is colored by a scalar feature (amber positive, teal negative). Tap a node: its new value is the symmetric-normalized sum jhj/d~id~j\sum_j h_j/\sqrt{\tilde{d}_i\tilde{d}_j} over itself and its neighbors. The weights do not sum to one, so this is a degree-weighted blend, not a plain average.

That covers the entire layer. Everything from here is two questions about it. Why call this a convolution, and where do those particular square-root-of-degree weights come from? They are not a guess. They fall out of asking what a convolution should even mean on a graph, and then approximating it twice.

Where the rule comes from: graph frequencies

On an image, convolution is well defined because the pixel grid has a fixed geometry: a 3×3 kernel always sees the same eight neighbors in the same arrangement. A graph has no such grid. Nodes have different numbers of neighbors and no canonical ordering, so "slide a kernel across it" has no obvious meaning. The escape route is the one signal processing already uses: define convolution in the frequency domain.

For that you need a notion of frequency on a graph, and the Laplacian supplies it. Take the normalizedLaplacian (a different object from the Δ=DA\Delta=D-A in the regularizer above, and unfortunately the paper reuses the letter LL):

L=IND1/2AD1/2=UΛUL = I_N - D^{-1/2} A D^{-1/2} = U \Lambda U^\top

LL is symmetric, so it has an orthonormal set of eigenvectors, the columns of UU, with eigenvalues Λ\Lambda. Those eigenvectors are the graph's natural vibration patterns, exactly like the pure tones of a Fourier basis. A low-eigenvalue eigenvector changes slowly from node to node (a smooth, low "frequency"); a high-eigenvalue one flips sign across nearly every edge (a jagged, high frequency). Projecting a node signal onto these eigenvectors, UxU^\top x, is the graph Fourier transform, and a spectral filter chooses how much of each frequency to keep:

gθx=Ugθ(Λ)Uxg_\theta \star x = U g_\theta(\Lambda) U^\top x
(3)

The figure makes the frequencies concrete. Step through the eigenvectors and watch the graph recolor. The zeroth is the smoothest pattern, the same sign everywhere. The very next one, the Fiedler vector, splits the two communities into opposite colors at a still-low frequency, which is the first hint of why this matters: the community structure lives in the low frequencies. The high modes just oscillate node to node and carry little but noise.

Figure 2 · graph frequencies and a low-pass filter
k = 1
The toy graph colored by one Laplacian eigenvector (a frequency), with the spectrum below. Eigenvector 1, the Fiedler vector, separates the two groups at low λ\lambda. The teal curve is the filter response g(λ)=1λ/2g(\lambda)=1-\lambda/2: it keeps low frequencies and suppresses high ones. Step the frequency and read its gain.

So far this is exact and also unusable. Forming UU means an eigendecomposition of an N×NN\times N matrix, which costs O(N3)O(N^3) once and an O(N2)O(N^2) dense multiply every time you apply the filter. There is no fast graph version of the FFT to rescue you; UU is a dense matrix on a general graph. And a filter written as a free function of every eigenvalue is not localized: it can mix a node with others arbitrarily far across the graph.

The fix, due to Hammond and to Defferrard's ChebNet, is to insist the filter be a low-degree polynomial of the eigenvalues, expanded in Chebyshev polynomials TkT_k:

gθxk=0KθkTk(L~)xg_{\theta'} \star x \approx \sum_{k=0}^{K} \theta'_k\, T_k(\tilde{L})\, x
(5)

Two good things happen at once. A polynomial of the eigenvalues is the same polynomial of the matrix, since (UΛU)k=UΛkU(U\Lambda U^\top)^k = U\Lambda^k U^\top, so you never form UU at all: you just multiply by the sparse Laplacian KK times. And a degree-KK polynomial in LL is exactly KK-hop localized, because (Lk)ij=0(L^k)_{ij}=0 whenever nodes i,ji,j are more than kk steps apart. Polynomial degree plays the role of kernel radius: a degree-KK graph filter sees exactly the KK-hop neighborhood, the way a 3×3 image kernel sees one pixel out. The cost drops to O(E)O(|\mathcal{E}|), linear in the number of edges.3 This is already a usable graph CNN. Kipf and Welling's move is to push the approximation one notch further.

The renormalization trick

Set the polynomial degree to its smallest interesting value, K=1K=1, so the filter is linear in the Laplacian. The Chebyshev rescaling needs the largest eigenvalue, and for the normalized Laplacian λmax2\lambda_{\max}\le 2 always (with equality only for a bipartite graph), so approximate λmax2\lambda_{\max}\approx 2 and trust training to absorb the slack. The two surviving terms give a filter with two free parameters:

gθxθ0x+θ1(LIN)x=θ0xθ1D1/2AD1/2xg_{\theta'} \star x \approx \theta'_0\, x + \theta'_1\,(L - I_N)\,x = \theta'_0\, x - \theta'_1\, D^{-1/2} A D^{-1/2}\, x
(6)

Now constrain the two knobs into one by setting θ=θ0=θ1\theta=\theta'_0=-\theta'_1. The minus sign is the load-bearing part: with it, the two terms add rather than fight, and (6) collapses to a single-parameter filter:

gθxθ(IN+D1/2AD1/2)xg_\theta \star x \approx \theta\,\big( I_N + D^{-1/2} A D^{-1/2} \big)\, x
(7)

This is clean, and it is also a trap. The operator IN+D1/2AD1/2I_N + D^{-1/2}AD^{-1/2} has eigenvalues in the range [0,2][0,2]. Stack a few of these layers and a signal aligned with the top eigenvalue gets multiplied by something near 22 every layer, blowing up, while the bottom of the spectrum is multiplied by something near zero and vanishes. Repeated application is numerically unstable, the deep-network version of exploding and vanishing gradients. The paper's fix is one substitution, the renormalization trick:

IN+D1/2AD1/2    D~1/2A~D~1/2,A~=A+INI_N + D^{-1/2} A D^{-1/2} \;\longrightarrow\; \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}, \qquad \tilde{A}=A+I_N

Read carefully, because this is not the algebraic no-op it first looks like. The left side normalizes by the original degrees DD and adds the identity outside the normalization. The right side folds the self-loop inside: it adds a self-loop first (A~=A+IN\tilde{A}=A+I_N) and then renormalizes by the new, larger degrees D~\tilde{D}. Folding the self-loop in re-weights every single entry, and it pulls the spectrum down: the largest eigenvalue of D~1/2A~D~1/2\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} is exactly 11, and the whole spectrum lands in (1,1](-1, 1]. Counting yourself in the average, with the average rescaled to include you, damps the sharpest swings instead of amplifying them. Now you can stack layers without the signal exploding. (The spectral proof that self-loops strictly shrink the top eigenvalue, the "2\approx 2 down to 1.5\approx 1.5 on Cora" result, came later, from the SGC paper; Kipf and Welling introduced the trick and justified it as fixing the instability, then showed it helps.)

The figure tracks that largest eigenvalue on a star graph, the simplest case with a high-degree hub. The raw adjacency's top eigenvalue grows as 1+d1+\sqrt{d}, so a hub of degree dd amplifies more and more; the un-renormalized operator sits pinned at 22; the renormalized one stays at 11. Drag the hub degree and read off the consequence after a few stacked layers: the raw operator explodes, while the renormalized one holds.

Figure 3 · why the renormalization trick is stable
d = 9
Largest eigenvalue of three propagation operators on a star, versus hub degree. Raw A+I grows as 1+d1+\sqrt{d}; the plain first-order filter is flat at 2; the renormalized GCN operator is flat at 1. After a few layers a top-mode signal scales by λlayers\lambda^{\text{layers}}, so only the renormalized one stays bounded.

Generalize the single signal xx to a full feature matrix with CC input channels and FF output filters, and the filter θ\theta becomes a weight matrix ΘRC×F\Theta\in\mathbb{R}^{C\times F}:

Z=D~1/2A~D~1/2XΘZ = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} X \Theta
(8)

Aggregate with A^\hat{A}, mix channels with Θ\Theta, repeat: the layer rule from before, now with its weights accounted for. The square-root-of-degree weights were never arbitrary. They are what a first-order spectral filter looks like once you make it stable.

Two layers and a softmax

Stacking two of these layers gives the model the paper actually runs. Precompute A^=D~1/2A~D~1/2\hat{A}=\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} once, then:

Z=f(X,A)=softmax ⁣(A^  ReLU(A^XW(0))W(1))Z = f(X,A) = \mathrm{softmax}\!\Big( \hat{A}\;\mathrm{ReLU}\big( \hat{A}\,X\,W^{(0)} \big)\,W^{(1)} \Big)
(9)

W(0)RC×HW^{(0)}\in\mathbb{R}^{C\times H} maps the input features to HH hidden units; W(1)RH×FW^{(1)}\in\mathbb{R}^{H\times F} maps those to FF class scores; the softmax turns each node's row into a distribution over classes. Two applications of A^\hat{A} means each node's final prediction draws on its 2-hop neighborhood: the layout of the local graph, not just the node's own text. Training is the cross-entropy over the labeled nodes only:

L=lYLf=1FYlflnZlf\mathcal{L} = -\sum_{l\in\mathcal{Y}_L}\sum_{f=1}^{F} Y_{lf}\,\ln Z_{lf}
(10)

The sum runs over YL\mathcal{Y}_L, the labeled nodes, and nothing else. That one restriction does all the semi-supervised work, so it pays to walk through it slowly. The loss only scores 140 nodes on Cora. But ZlZ_l, the prediction at a labeled node ll, is computed by A^\hat{A} from the features of ll's neighbors, most of which are unlabeled, and those passed through A^\hat{A} from their neighbors. So the gradient of the loss at one labeled node flows backward through two hops of unlabeled nodes and updates the shared weights W(0),W(1)W^{(0)},W^{(1)} that also produce every other node's prediction. The 140 labels never directly supervise the other 2,568 nodes. They supervise the weights, and the weights, applied through the graph, classify everyone.

Concretely, on Cora: XX is 2708×14332708\times 1433, W(0)W^{(0)} is 1433×161433\times 16, W(1)W^{(1)} is 16×716\times 7, and A^\hat{A} is a sparse 2708×27082708\times 2708 matrix with one entry per edge. Because A^X\hat{A}X is a sparse-times-dense product, the forward pass costs O(ECHF)O(|\mathcal{E}|\,C\,H\,F), linear in the number of edges. The model trains full-batch (the entire graph is one example) with Adam at learning rate 0.01, dropout for regularization, and early stopping. The forward pass and one training step, in code:

# a two-layer GCN on Cora (N=2708 nodes, C=1433 features, F=7 classes)
A_hat = normalize(A + I)        # D~^-1/2 (A + I) D~^-1/2, sparse, precomputed
H = relu(A_hat @ X @ W0)        # X:[N,1433]  W0:[1433,16]  ->  H:[N,16]
Z = softmax(A_hat @ H @ W1)     # W1:[16,7]   ->  Z:[N,7] class scores
loss = cross_entropy(Z[idx_lab], Y[idx_lab])   # only 140 nodes are labeled
loss.backward(); adam.step()    # the gradient reaches W0, W1 through every node

Nothing here is exotic. It is a two-layer network whose only unusual move is the sparse multiply by A^\hat{A} sitting in front of each weight matrix. The graph enters as one fixed matrix you bake once and reuse every step.

What the graph buys you

On the headline numbers, the model wins cleanly. Accuracy on Cora is 81.5%, on Citeseer 70.3%, on Pubmed 79.0%, on NELL 66.0%, the best on every dataset, though by uneven margins: nearly six points clear on Cora (the strongest baselines there were ICA at 75.1% and Planetoid at 75.7%) and four on NELL, but only one to two on Citeseer and Pubmed. And it trains in seconds rather than the minutes those methods needed. But the single most telling number is an ablation, not a baseline. Strip the graph out of the model entirely, leaving a plain multi-layer perceptron on the same features (replace A^\hat{A} with the identity), and Cora accuracy falls from 81.5% to 55.1%. More than twenty-six points of accuracy come from the graph alone, the thesis in a single number.

The same table answers the other question the derivation raised: did the renormalization trick actually earn its place, or was it just for stability? Swap the propagation operator and hold everything else fixed. The figure plots the result. Every operator that uses the graph clears the no-graph MLP by a wide margin, and among them the renormalized form is the best, edging out the raw first-order filters and the heavier Chebyshev models that have more parameters. Cleaner and more stable also turned out to be more accurate.

Figure 4 · the graph does the work
Accuracy when you swap only the propagation operator (Table 3). The no-graph MLP sits far below everything that uses the graph; the renormalized GCN operator is the best of the lot. The bracket marks the accuracy the graph adds. Switch datasets with the buttons.

One honest footnote on the numbers. That 81.5% is the accuracy on a single fixed train/validation/test split, the one the Planetoid benchmark established and everyone reused.7 The paper also reports results over ten random splits, where Cora settles to 80.1% ± 0.5 and the harder NELL drops from 66.0% to 58.4% ± 1.7. The ranking holds, but later work showed that comparing graph methods on one frozen split flatters whichever model was tuned on it, so read the headline as "clearly competitive," not "exactly 6.4 points better forever."

Why GCNs stay shallow

A natural instinct, fresh off ResNets and very deep image networks, is to stack more layers. It does not work here. The paper finds two or three layers best, and beyond about seven the model gets hard to train at all. Kipf and Welling give three concrete reasons, and they are easy to confuse with the explanation that later became popular. First, each layer widens a node's receptive field by one hop, so a deep stack forces every node to integrate an enormous, mostly-irrelevant chunk of the graph. Second, more layers means more parameters and so more overfitting on a tiny label set. Third, the spectral instability: before the renormalization trick, deep stacks of the [0,2][0,2]-spectrum operator blow up.

There is a fourth account you will hear constantly, and its provenance matters. The modern reading is oversmoothing: each layer averages a node toward its neighbors, so after many rounds every node in a connected community drifts toward the same value and the features stop being distinguishable. The figure shows it directly. Start the two groups at +1+1 and 1-1, apply the aggregator layer by layer, and the gap between the groups shrinks every round: by ten layers it has fallen below half of where it started and is still dropping, the two groups bleeding into each other. Stack enough and a node's features no longer say which group it belongs to. The effect is real and it is how people reason about depth today. It is also not what the paper says. The word "oversmoothing" appears nowhere in it; that analysis arrived in 2018 and after. So the figure below is the modern lens applied in hindsight, not a result from 2016.

Figure 5 · stacking layers washes the features out
2
A two-group signal (+1 / −1) under repeated aggregation. The curve tracks the gap between the groups' averages: it falls every layer, dropping below half over the stack as the groups bleed together. Residual connections (the paper's Eq 14) let you go deeper. Drag the layer count.

One precise caveat on the collapse, because the cartoon version overstates it. Under the symmetric normalization GCN uses, repeated aggregation drives features toward a fixed pattern proportional to the square root of each node's degree, not toward a single constant; only the random-walk normalization collapses to a literal constant. Either way the differences between nodes in a community vanish, which is what kills classification. The paper's own remedy is to borrow residual connections from ResNets: add the input back after each layer, H(l+1)=σ(A^H(l)W(l))+H(l)H^{(l+1)} = \sigma(\hat{A}H^{(l)}W^{(l)}) + H^{(l)}, which lets gradients and detail survive depth and pushes the workable limit out past ten layers. For these citation graphs, though, two hops is genuinely most of what a node needs to know.

A graph fingerprint you can train

There is a second way to read the layer rule that needs no spectral theory at all, and it explains why such a simple operation captures so much structure. The Weisfeiler-Lehman test is a classic algorithm for telling graphs apart: give every node a label, then repeatedly replace each node's label with a hash of its own and its neighbors' labels. After a few rounds, structurally similar nodes end up with the same fingerprint. Write that update with a trainable linear map and a nonlinearity in place of the hash, and you get exactly the GCN layer:

hi(l+1)=σ ⁣(jNi1cijhj(l)W(l)),cij=didjh_i^{(l+1)} = \sigma\!\left( \sum_{j\in\mathcal{N}_i} \frac{1}{c_{ij}}\, h_j^{(l)}\, W^{(l)} \right), \qquad c_{ij}=\sqrt{d_i d_j}
(12)

So a GCN is a smooth, differentiable version of the Weisfeiler-Lehman fingerprint. That framing predicts something striking: because even the raw fingerprint separates structurally distinct nodes, an untrained GCN with random weights should already place similar nodes near each other. It does. One sharpening is needed, though, and it cuts the other way from how the analogy is often stated: because the smooth aggregation in a GCN averages rather than hashes injectively, it can blur together neighborhoods the discrete test would keep apart. A GCN is therefore at most as discriminating as the Weisfeiler-Lehman test, not more. "Generalizes" here means differentiable and learnable, not strictly more powerful.

The cleanest demonstration is the one the paper closes with: Zachary's karate club, a 34-member social network that famously split into two factions.6 Hand a two-layer GCN no node features at all, just the graph, and exactly one labeled node per faction. Its two-dimensional output starts as an undifferentiated blob. Press play and watch the two factions pull apart, driven entirely by two labels and the wiring. The model has never been told which side anyone is on except those two seeds; the graph carries the rest.

Figure 6 · semi-supervised, on the karate club
A featureless two-layer GCN trained with one labeled node per faction (the ringed nodes). The 2-D embedding, plotted with the graph's edges, separates the two factions as training runs. Real gradient descent, computed live; press play or scrub, and reseed to redraw the start.

What the paper delivered fits in a sentence. One layer rule, motivated two ways that meet in the middle: a first-order spectral filter made stable by a self-loop, and a trainable version of a graph-coloring test. Stack two of them, add a softmax and a cross-entropy over a few labeled nodes, and the gradient spreads that thin supervision through the edges to every node it never directly scores. It was not the first neural network on graphs, but it was the one simple and scalable enough to build on, and most of the graph neural networks that followed are variations on this single equation.

Provenance Verified against primary literature
Hammond et al. (2011)Polynomial (Chebyshev) spectral filters are exactly K-hop localized.
Defferrard et al. (2016)ChebNet: the Chebyshev graph CNN that GCN truncates to K=1.
Bruna et al. (2014)The first spectral graph neural network.
Wu et al. (2019, SGC)Proved the renormalization trick shrinks the spectrum into a low-pass filter.
Weisfeiler & Lehman (1968)The graph-coloring test the layer rule generalizes (Appendix A).
Yang et al. (2016, Planetoid)The Cora / Citeseer / Pubmed / NELL splits and baselines.
correctionThe popular account blames shallow GCNs on 'oversmoothing.' That word never appears in the 2016 paper, which gives three different reasons (growing receptive field, overfitting, spectral instability). Oversmoothing is a 2018-and-later reframing, and we flag it as such where we use it.

Questions you might still have

?

If the loss only scores a few labeled nodes, how does the model learn the rest?
A labeled node’s prediction is computed from its neighbors’ features through A_hat, so the gradient flows back through unlabeled nodes and updates the shared weights that produce every node’s output. The 140 labels supervise the weights; the weights, applied through the graph, classify all 2,708 nodes.

?

Is the aggregation just averaging each node’s neighbors?
Almost, but not exactly. It is a degree-weighted sum with edge weight 1/sqrt(d_i d_j) over the closed neighborhood, and the weights into a node do not sum to one. The plain row-averaging operator is the random walk D~^{-1}A~, a different choice the paper deliberately avoids in favor of the symmetric form, which keeps eigenvalues real and stacking stable.

?

Why not just stack twenty layers like a deep image network?
Each layer widens a node’s receptive field by one hop and adds parameters, and repeated symmetric averaging washes node features toward a degree-weighted consensus until classes blur together. Two or three layers work best; residual connections (Eq 14) push the limit deeper.

?

Does a trained GCN work on a graph it never saw?
Not as published. The original model is transductive: it needs the full graph and all features at training time. GraphSAGE (Hamilton et al., 2017) reworked the idea to be inductive, learning the aggregation as a reusable function that embeds brand-new nodes.

?

Do I actually need the spectral graph theory to implement this?
No. The final layer is a sparse neighbor-averaging matrix multiply you can write in a few lines, no eigenvectors anywhere. The spectral derivation is where the specific normalized weights come from and why the operation counts as a convolution, but the running code never touches the Fourier basis.

Footnotes & further reading

  1. The paper: Thomas N. Kipf and Max Welling, Semi-Supervised Classification with Graph Convolutional Networks (ICLR 2017). Reference code: github.com/tkipf/gcn.
  2. A bookkeeping note on Eq (1): the paper prints Lreg=i,jAijfifj2=fΔf\mathcal{L}_{\text{reg}}=\sum_{i,j}A_{ij}\lVert f_i-f_j\rVert^2=f^\top\Delta f, while the exact identity carries a factor of one half, fΔf=12i,jAijfifj2f^\top\Delta f=\tfrac12\sum_{i,j}A_{ij}\lVert f_i-f_j\rVert^2. The constant is absorbed into the weighting factor λ\lambda, so it changes nothing.
  3. K-hop localization of polynomial filters is Hammond, Vandergheynst & Gribonval, Wavelets on Graphs via Spectral Graph Theory (2011); the Chebyshev graph CNN is Defferrard, Bresson & Vandergheynst, ChebNet (2016). As printed, the sums in Eqs (4)–(5) run k=0,,Kk=0,\dots,K (so K+1K+1 terms) while the text writes θRK\theta'\in\mathbb{R}^K; reproduced here as written.
  4. The spectral analysis showing self-loops shrink the largest eigenvalue and make the filter low-pass (with A~=A+γI\tilde{A}=A+\gamma I generalizing the self-loop weight) is Wu et al., Simplifying Graph Convolutional Networks (SGC) (2019), which also shows you can drop the nonlinearities and collapse the GCN into a single fixed low-pass filter plus logistic regression at little cost in accuracy.
  5. The oversmoothing reading of GCN depth: Li, Han & Wu, Deeper Insights into Graph Convolutional Networks (2018); NT & Maehara (2019); Oono & Suzuki (ICLR 2020). The expressivity bound (message-passing GNNs are at most as powerful as Weisfeiler-Lehman, with sum-aggregation needed to match it) is Xu et al., How Powerful Are Graph Neural Networks? (GIN) (2019). The inductive follow-up is Hamilton, Ying & Leskovec, GraphSAGE (2017).
  6. The paper states the karate club has 154 edges; the canonical Zachary (1977) network is 34 nodes and 78 undirected edges (154 or 156 if you double-count directions). The figure here uses the two ground-truth factions and one labeled seed per faction, in the spirit of the paper's Appendix A.2 (which used a four-way split with one label per class).
  7. On split sensitivity, see Shchur et al., Pitfalls of Graph Neural Network Evaluation (2018): rankings across graph methods can reshuffle under different random splits, so a single fixed split is a weak basis for fine-grained comparisons. The benchmark split itself is from Yang, Cohen & Salakhutdinov, Planetoid (2016).