Thử thách Khoa học

Đề thi mẫu Vòng Chung kết Quốc tế IAIO 2026 (bằng tiếng Anh). 8 câu, chỉ đề bài.

  1. Task 1:

    In the context of the k-arm bandit problem, what is the primary challenge faced by the decision-making algorithm? 

    1. Determining the optimal sequence of actions before the first time-frame begins. 

    2. Balancing exploration of unknown actions with exploitation of known high-reward actions. 

    3. Predicting how the reward for each action will change over time. 

    4. Minimizing the total number of actions taken within the time period T.

  2. Task 2:

    Which of the following statements is TRUE?

    1. The kernel or mask of a Convolutional layer is the distance between adjacent receptive fields in a specific direction (horizontal or vertical).

    2. Average pooling is a Convolution operation where all neurons have trainable weights.

    3. A Convolutional layer can have more than one feature maps.

    4. Two neurons A, B within a Convolutional feature map have different weights WA # WB.

  3. Task 3:

    During the development of a Large Language Model (LLM), a team implements "Differential Privacy" (DP) as an ethical design constraint to prevent the leakage of Personally Identifiable Information (PII). This process involves a privacy budget parameter ε.
    Which of the following best describes the resulting impact of this constraint on the system's performance?

    1. The model's utility remains constant, but the training time increases exponentially to handle the privacy layers

    2. There is an inherent trade-off where increasing the privacy guarantee (decreasing ε) necessarily degrades the model's accuracy and convergence rate.

    3. The constraint acts as a regularization term, improving generalization by preventing the model from over-fitting on specific data points.

    4. Ethical constraints like Differential Privacy are purely architectural and do not affect the loss function or the final weights of the model.

  4. Task 4

    Let XX be a finite set of some finite size kk. Let AA be a learner that on a sample

    S=((x1,y1),,(xm,ym))(X×{0,1})mS = ((x_{1},y_{1}),\ldots,(x_{m},y_{m})) \in (X \times \{ 0,1\})^{m} outputs a function A(S)A(S) such that:

    If x=xix = x_{i} for some i{1,,m}i \in \{ 1,\ldots,m\} then A(S)(x)=yiA(S)(x) = y_{i}

    If x{x1,,xm}x \notin \{ x_{1},\ldots,x_{m}\}, then A(S)(x)A(S)(x) is picked randomly with probability 12\frac{1}{2} for each label in {0,1}\left\{ 0,1 \right\}

    Given a function f:X{0,1}f:X \rightarrow \{ 0,1\}, define a probability distribution (U,f)\left( U,f \right) over X×{0,1}X \times \{ 0,1\} by assigning probability 1/k1/k to any pair (x,f(x))\left( x,f(x) \right) and probability 00 to any (x,y){(x,f(x)):xX}(x,y) \notin \{(x,f(x)):x \in X\}.

    Show that for every f:X{0,1}f:X \rightarrow \{ 0,1\} and S=((x1,f(x1)),,(xm,f(xm)))S = ((x_{1},f(x_{1})),\ldots,(x_{m},f(x_{m}))),

    L(U,f)(A(S))LS(A(S))(km)2k.L_{\left( U,f \right)}(A(S)) - L_{S}(A(S)) \geq \frac{\left( k-m \right)}{2k}.
  5. Task 5

    In sequence models like the Transformer, the core self-attention mechanism processes all tokens simultaneously and natively lacks any concept of sequence order. To address this, a positional encoding vector is added to each token's embedding to inject positional information.

    For a token at an integer sequence position pos, its positional encoding is a vector of dimension dd. The values of this vector are constructed using sine and cosine functions. Specifically, for a given dimension index ii (where 0i<d/20 \leq i < d/2), the encoding values for the 2i2i-th and (2i+1)\left( 2i+1 \right)-th dimensions are given by:

    PE(pos,2i)=sin(pos10000id)PE(pos,2i+1)=cos(pos10000id)\begin{aligned} \text{PE}_{(\text{pos},\,2i)} &= \sin\left( \frac{\text{pos}}{10000^{\frac{i}{d}}} \right) \\ \text{PE}_{(\text{pos},\,2i+1)} &= \cos\left( \frac{\text{pos}}{10000^{\frac{i}{d}}} \right) \end{aligned}

    Which one of the following mathematical properties holds true for this specific encoding formulation? Select one correct option and justify your answer.

    1. Orthogonality: The encodings for any two different positions are statistically orthogonal, ensuring that position information does not interfere with the semantic meaning of the word embeddings.

    2. Linear Translation: For any fixed offset kk, the encoding at position pos+k\text{pos} + k can be represented as a linear function (rotation) of the encoding at position pos.

    3. Decay: The magnitude of the encoding vector decays exponentially as the position index increases, naturally biasing the model towards recent context.

    4. Symmetry: The function is perfectly symmetric around the midpoint of the sequence, allowing the model to process bidirectional context with shared weights.

  6. Task 6

    Let us consider a multi-label classification problem. For each class cc, TPcTP_{c} represents the instances that are in class cc correctly identified as belonging to cc; TNcTN_{c} represents the instances that are not in class cc correctly identified as not belonging to cc; FPcFP_{c} represents the instances that are in not class cc incorrectly labeled as belonging to cc; FNcFN_{c} represents the instances that are in class cc incorrectly labeled as not belonging to cc.

    We have the following metrics:

    • per-class recall:
    Recallc=TPcTPc+FNc=P(pred=ctrue=c);\text{Recall}_{c} = \frac{TP_{c}}{TP_{c} + FN_{c}} = P(\text{pred} = c \mid \text{true} = c);
    • per-class precision:
    Precisionc=TPcTPc+FPc=P(true=cpred=c);\text{Precision}_{c} = \frac{TP_{c}}{TP_{c} + FP_{c}} = P(\text{true} = c \mid \text{pred} = c);
    • per-class specificity:
    Specc=TNcTNc+FPc=P(predctruec);\text{Spec}_{c} = \frac{TN_{c}}{TN_{c} + FP_{c}} = P(\text{pred} \neq c \mid \text{true} \neq c);
    • for each class-dependent error metric McM_{c}, the macro average error metric is the arithmetic mean of the class specific scores:
    MacroM=cMcnumber of classes;\text{Macro} - M = \frac{\sum_{c}^{}M_{c}}{\text{number~of~classes}};
    • for a distribution π=(πc)\pi = (\pi_{c}) of the labels, the accuracy:
    acc(π)=cπcRecallc;\text{acc}(\pi) = \sum_{c}^{}\pi_{c}\text{Recall}_{c};
    • if C=(Cij)C = (C_{ij}) is a cost matrix where Cij=Cost(pred=jtrue=i)C_{ij} = \text{Cost}(\text{pred} = j \mid \text{true} = i), the expected cost is
    Eπ(C)=i,jπiP(pred=jtrue=i)Cij.E_{\pi}(C) = \sum_{i,j}^{}\pi_{i}P(\text{pred} = j \mid \text{true} = i)C_{ij}.

    You can assume that every class has nonzero support and the classifier makes both correct and incorrect predictions for every class (non-degeneracy).

    1. Let us assume that you are in a binary classifier, with classes 0 and 1. Prove that, if

    accuracy=precision1=recall1,\text{accuracy} = \text{precision}_{1} = \text{recall}_{1},

    the support of the two classes must be the same.

    For the remaining parts, you can assume the following context.

    Context

    A public-health triage system classifies incoming messages into one of three classes:

    U=Urgent,S=Semi-urgent,N=Non-urgent.U = \text{Urgent},S = \text{Semi-urgent},N = \text{Non-urgent}.

    The training dataset was intentionally enriched with urgent and semi-urgent cases. On the training set, the classifier produced the following confusion matrix (rows correspond to the true label, columns to the predicted label):

    True\PredUSNTotalU2003020250S3027050350N1030360400\begin{array}{c|ccc|c} \text{True} \backslash \text{Pred} & U & S & N & \text{Total} \\ \hline U & 200 & 30 & 20 & 250 \\ S & 30 & 270 & 50 & 350 \\ N & 10 & 30 & 360 & 400 \end{array}

    The corresponding training label distribution is

    πtrain=(P(U)=0.25, P(S)=0.35, P(N)=0.40).\pi_{\text{train}} = (P(U) = 0.25,\ P(S) = 0.35,\ P(N) = 0.40).

    At deployment time, the real-world label distribution is different:

    πdep=(P(U)=0.05, P(S)=0.15, P(N)=0.80).\pi_{\text{dep}} = (P(U) = 0.05,\ P(S) = 0.15,\ P(N) = 0.80).

    You may assume that the classifier's conditional behavior is stable:

    Ptrain(pred=jtrue=i)=Pdep(pred=jtrue=i)for all i,j{U,S,N}.P_{\text{train}}(\text{pred} = j \mid \text{true} = i) = P_{\text{dep}}(\text{pred} = j \mid \text{true} = i)\text{for~all~}i,j \in \{ U,S,N\}.

    We are given the following cost matrix (cost of predicting the column label when the true label is the row label):

    C(pred=jtrue=i)=USNU020100S5010N110C(\text{pred} = j \mid \text{true} = i) = \begin{matrix} & U & S & N \\ U & 0 & 20 & 100 \\ S & 5 & 0 & 10 \\ N & 1 & 1 & 0 \end{matrix}

    2. Compute PrecisionU\text{Precision}_{U}under both πtrain\pi_{\text{train}}and πdep\pi_{\text{dep}}. Explain why precision changes under distribution shift while recall does not.

    3. Consider two strategies for choosing the deployed decision rule:

    1. maximizing training accuracy;

    2. minimizing expected deployment cost.

    Explain why strategy (1) can lead to a higher expected cost at deployment than strategy (2).

  7. Task 7

    1. Express the distance of the image of a point x\mathbf{x} from the centre of mass of a set of examples
    S={x1,,xm}S = \{\mathbf{x}_{1},\ldots,\mathbf{x}_{m}\}

    in a feature space defined by kernel κ(x,z)=ϕ(x),ϕ(z)\kappa(\mathbf{x},\mathbf{z}) = \langle\phi(\mathbf{x}),\phi(\mathbf{z})\rangle in terms of kernel evaluations.

    [2 Marks]

    1. Using the result from item 1, give pseudocode for the novelty detection algorithm that labels a test point novel if it lies outside the smallest sphere centred at the centre of mass that contains all of the training data.

    [3 Marks]

    1. Show that if we remove one point xj\mathbf{x}_{j} from SS, the centre of mass moves away from ϕ(xj)\phi(\mathbf{x}_{j}) on the line from ϕ(xj)\phi(\mathbf{x}_{j}) through the centre of mass by 1m1\frac{1}{m - 1} times the distance between them.

    [3 Marks]

    1. Hence or otherwise show that the novelty detection method using the same centre as in item 2, above but increasing the radius of the sphere by mm2\frac{m}{m - 2} ensures that the leave one out error (an error occurs if the left out point is not inside the sphere) on the training set is at most 1.

    [5 Marks]

  8. Task 8

    Context

    In Statistical Machine Translation, a common sub-problem is Word Ordering. You are given a "bag of words" (a set of scrambled words) and must reconstruct the original sentence order to maximize the probability of the sequence.

    Let W={w1,w2,,wN}W = \{ w_{1},w_{2},\ldots,w_{N}\} be a set of NN unique words. You are given a Bigram Language Model, P(wjwi)P(w_{j} \mid w_{i}), which provides the probability of word wjw_{j} following word wiw_{i}. There are two special tokens: S\langle S\rangle (Start) and E\langle E\rangle (End). The sentence must begin with S\langle S\rangle and end with E\langle E\rangle.

    Objective

    Find a permutation π\pi of the words in WWthat maximizes the sentence probability:

    P(π)=P(wπ1S)×(i=1N1P(wπi+1wπi))×P(EwπN)P(\pi) = P(w_{\pi_{1}} \mid \langle S\rangle) \times \left( \prod_{i = 1}^{N - 1}P(w_{\pi_{i + 1}} \mid w_{\pi_{i}}) \right) \times P(\langle E\rangle \mid w_{\pi_{N}})

    To apply the A* algorithm, we transform this maximization problem into a minimization problem by defining the cost of a transition between word uu and word vvas the negative log-probability:

    C(u,v)=ln(P(vu))C(u,v) = - \ln(P(v \mid u))

    Part A: State Space Formalization

    Formalize this problem as a State Space Graph search suitable for the A* algorithm.

    Specifically, define

    1. The structure of a State nn.

    2. The Start State and Goal State.

    3. The Successor Function (how transitions are generated) and the accumulated cost g(n)g(n).

    [3 Marks]

    Part B: Heuristic Design

    The core difficulty lies in the factorial structure of the search space (N!N!), making a trivial heuristic (e.g., h=0h = 0) computationally intractable for even modest input sizes (e.g., N>15N > 15). Design a non-trivial heuristic h(n)h(n) that estimates the remaining cost to complete the sentence. Your heuristic must meet the following criteria:

    1. It must be admissible.

    2. It must be monotonic (consistent).

    Define your heuristic mathematically. You do not need to prove the properties yet, but you must explain the logic behind your design.

    Hint: Consider the set of words that have not yet been placed. In a valid completed sentence, every one of these words must have exactly one incoming edge.

    [5 Marks]

    Part C: Theoretical Proofs

    Using the heuristic h(n)h(n) you defined in Part B, provide formal proofs for the following properties:

    1. Admissibility: Prove that h(n)h(n)h(n) \leq h^{*}(n), where h(n)h^{*}(n) is the true minimum cost to the goal.

    2. Monotonicity: Prove that h(n)cost(n,n)+h(n)h(n) \leq \text{cost}(n,n') + h(n'), where nn' is a successor of nn.

    [5 Marks]

    [Total 13 Marks]

Nguồn: Đề thi Vòng Chung kết Quốc tế IAIO 2026. Chỉ mang tính tham khảo.