Scientific Challenges
Sample paper from the IAIO 2026 International Round (in English). 8 tasks, questions only.
Task 1:
In the context of the k-arm bandit problem, what is the primary challenge faced by the decision-making algorithm?
-
Determining the optimal sequence of actions before the first time-frame begins.
-
Balancing exploration of unknown actions with exploitation of known high-reward actions.
-
Predicting how the reward for each action will change over time.
-
Minimizing the total number of actions taken within the time period T.
-
Task 2:
Which of the following statements is TRUE?
-
The kernel or mask of a Convolutional layer is the distance between adjacent receptive fields in a specific direction (horizontal or vertical).
-
Average pooling is a Convolution operation where all neurons have trainable weights.
-
A Convolutional layer can have more than one feature maps.
-
Two neurons A, B within a Convolutional feature map have different weights WA # WB.
-
Task 3:
During the development of a Large Language Model (LLM), a team implements "Differential Privacy" (DP) as an ethical design constraint to prevent the leakage of Personally Identifiable Information (PII). This process involves a privacy budget parameter ε.
Which of the following best describes the resulting impact of this constraint on the system's performance?-
The model's utility remains constant, but the training time increases exponentially to handle the privacy layers
-
There is an inherent trade-off where increasing the privacy guarantee (decreasing ε) necessarily degrades the model's accuracy and convergence rate.
-
The constraint acts as a regularization term, improving generalization by preventing the model from over-fitting on specific data points.
-
Ethical constraints like Differential Privacy are purely architectural and do not affect the loss function or the final weights of the model.
-
Task 4
Let be a finite set of some finite size . Let be a learner that on a sample
outputs a function such that:
If for some then
If , then is picked randomly with probability for each label in
Given a function , define a probability distribution over by assigning probability to any pair and probability to any .
Show that for every and ,
Task 5
In sequence models like the Transformer, the core self-attention mechanism processes all tokens simultaneously and natively lacks any concept of sequence order. To address this, a positional encoding vector is added to each token's embedding to inject positional information.
For a token at an integer sequence position pos, its positional encoding is a vector of dimension . The values of this vector are constructed using sine and cosine functions. Specifically, for a given dimension index (where ), the encoding values for the -th and -th dimensions are given by:
Which one of the following mathematical properties holds true for this specific encoding formulation? Select one correct option and justify your answer.
-
Orthogonality: The encodings for any two different positions are statistically orthogonal, ensuring that position information does not interfere with the semantic meaning of the word embeddings.
-
Linear Translation: For any fixed offset , the encoding at position can be represented as a linear function (rotation) of the encoding at position pos.
-
Decay: The magnitude of the encoding vector decays exponentially as the position index increases, naturally biasing the model towards recent context.
-
Symmetry: The function is perfectly symmetric around the midpoint of the sequence, allowing the model to process bidirectional context with shared weights.
-
Task 6
Let us consider a multi-label classification problem. For each class , represents the instances that are in class correctly identified as belonging to ; represents the instances that are not in class correctly identified as not belonging to ; represents the instances that are in not class incorrectly labeled as belonging to ; represents the instances that are in class incorrectly labeled as not belonging to .
We have the following metrics:
- per-class recall:
- per-class precision:
- per-class specificity:
- for each class-dependent error metric , the macro average error metric is the arithmetic mean of the class specific scores:
- for a distribution of the labels, the accuracy:
- if is a cost matrix where , the expected cost is
You can assume that every class has nonzero support and the classifier makes both correct and incorrect predictions for every class (non-degeneracy).
1. Let us assume that you are in a binary classifier, with classes 0 and 1. Prove that, if
the support of the two classes must be the same.
For the remaining parts, you can assume the following context.
Context
A public-health triage system classifies incoming messages into one of three classes:
The training dataset was intentionally enriched with urgent and semi-urgent cases. On the training set, the classifier produced the following confusion matrix (rows correspond to the true label, columns to the predicted label):
The corresponding training label distribution is
At deployment time, the real-world label distribution is different:
You may assume that the classifier's conditional behavior is stable:
We are given the following cost matrix (cost of predicting the column label when the true label is the row label):
2. Compute under both and . Explain why precision changes under distribution shift while recall does not.
3. Consider two strategies for choosing the deployed decision rule:
-
maximizing training accuracy;
-
minimizing expected deployment cost.
Explain why strategy (1) can lead to a higher expected cost at deployment than strategy (2).
Task 7
- Express the distance of the image of a point from the centre of mass of a set of examples
in a feature space defined by kernel in terms of kernel evaluations.
[2 Marks]
- Using the result from item 1, give pseudocode for the novelty detection algorithm that labels a test point novel if it lies outside the smallest sphere centred at the centre of mass that contains all of the training data.
[3 Marks]
- Show that if we remove one point from , the centre of mass moves away from on the line from through the centre of mass by times the distance between them.
[3 Marks]
- Hence or otherwise show that the novelty detection method using the same centre as in item 2, above but increasing the radius of the sphere by ensures that the leave one out error (an error occurs if the left out point is not inside the sphere) on the training set is at most 1.
[5 Marks]
Task 8
Context
In Statistical Machine Translation, a common sub-problem is Word Ordering. You are given a "bag of words" (a set of scrambled words) and must reconstruct the original sentence order to maximize the probability of the sequence.
Let be a set of unique words. You are given a Bigram Language Model, , which provides the probability of word following word . There are two special tokens: (Start) and (End). The sentence must begin with and end with .
Objective
Find a permutation of the words in that maximizes the sentence probability:
To apply the A* algorithm, we transform this maximization problem into a minimization problem by defining the cost of a transition between word and word as the negative log-probability:
Part A: State Space Formalization
Formalize this problem as a State Space Graph search suitable for the A* algorithm.
Specifically, define
-
The structure of a State .
-
The Start State and Goal State.
-
The Successor Function (how transitions are generated) and the accumulated cost .
[3 Marks]
Part B: Heuristic Design
The core difficulty lies in the factorial structure of the search space (), making a trivial heuristic (e.g., ) computationally intractable for even modest input sizes (e.g., ). Design a non-trivial heuristic that estimates the remaining cost to complete the sentence. Your heuristic must meet the following criteria:
-
It must be admissible.
-
It must be monotonic (consistent).
Define your heuristic mathematically. You do not need to prove the properties yet, but you must explain the logic behind your design.
Hint: Consider the set of words that have not yet been placed. In a valid completed sentence, every one of these words must have exactly one incoming edge.
[5 Marks]
Part C: Theoretical Proofs
Using the heuristic you defined in Part B, provide formal proofs for the following properties:
-
Admissibility: Prove that , where is the true minimum cost to the goal.
-
Monotonicity: Prove that , where is a successor of .
[5 Marks]
[Total 13 Marks]
-
Source: IAIO 2026 International Round paper. For reference only.
