Alice 85jj
For a minibatch (x, y, τ) the total loss is:
[ \mathcalL = \underbrace\mathcalL\textCE(f(x; \theta), y)\textClassification
Hyper‑parameters (λ values, β) are tuned on a held‑out validation task.
Figure 1 (below) illustrates the high‑level flow. The backbone B processes an input image x into a feature map F ∈ ℝ^C×H×W. The pipeline then splits into three parallel modules:
The final representation z is obtained by a joint‑junction operation:
[ z = \underbrace\textNorm\big(,W_s z_s \oplus W_c z_c,\big)_\text85JJ , ] alice 85jj
where ⊕ denotes concatenation, W_s, W_c are learnable projection matrices, and Norm is a LayerNorm. This joint vector drives the classifier head.
The quest for continual learning—the ability of an artificial system to acquire an open‑ended sequence of tasks—remains a central challenge in modern AI. Classical deep networks excel when trained on a static dataset but suffer from catastrophic forgetting when the data distribution shifts (McCloskey & Cohen, 1989). Recent work has tackled this problem from three complementary angles:
While effective in isolation, these strategies struggle to balance three desiderata simultaneously:
Neuroscientific studies of the hippocampal‑cortical system reveal a joint‑junction mechanism: episodic traces are bound via junction cells that integrate semantic content with contextual metadata (Eichenbaum, 2017). Moreover, lateral inhibition in cortical columns dynamically sharpens representations, ensuring that only task‑relevant neurons remain active (Carandini & Heeger, 2012). These observations motivate a computational analogue: a network that jointly fuses semantic and contextual streams while inhibiting irrelevant pathways.
In this paper we propose ALICE‑85JJ (Adaptive Lateral Inhibition with 85‑Joint‑Junction), a unified framework that operationalizes the joint‑junction principle. The name reflects its two core components: For a minibatch (x, y, τ) the total
Our contributions are threefold:
The remainder of the paper is organized as follows: Section 2 surveys related work; Section 3 details the ALICE‑85JJ architecture; Section 4 describes the training protocol; Section 5 reports experimental results; Section 6 discusses limitations and future directions; Section 7 concludes.
We adopt the task‑incremental setting where tasks arrive sequentially, each accompanied by a task descriptor τ (e.g., “classify CIFAR‑10 objects under rainy lighting”). The protocol is:
No replay buffer or external memory is employed; all consolidation occurs via GMC.
Both junctions maintain running importance estimates I_s, I_c using an exponential moving average of gradient magnitudes: Hyper‑parameters (λ values, β) are tuned on a
[ I_s \leftarrow \beta I_s + (1-\beta) |\nabla_\theta_s \mathcalL|, \qquad I_c \leftarrow \beta I_c + (1-\beta) |\nabla_\theta_c \mathcalL|. ]
These scores modulate the gradient‑modulated consolidation (GMC) loss:
[ \mathcalL\textGMC = \sump \in \Theta \big( I_p \cdot \Delta \theta_p \big)^2 , ]
where Δθ_p is the parameter change for weight p in the current update, and Θ denotes the union of parameters in B, S‑Junction, and C‑Junction. Intuitively, parameters with high past importance receive a stronger penalty for deviation, thus preserving previously learned knowledge without requiring explicit replay.






