Czech Contracts dataset was created as a part of the thesis Low-resource Text Classification (2021), A. Szabó, MFF UK.
Contracts are obtained from the Hlídač Státu web portal. Labels in the development and training set are automatically classified on the basis of the keyword method according to the thesis Automatická klasifikace smluv pro portál HlidacSmluv.cz, J. Maroušek (2020), MFF UK. For this reason, the goal in the classification is not to achieve 100% on the development set, as the classification contains a certain amount of noise. The test set is manually annotated. The dataset contains a total of 97493 contracts.
This communication gives some extensions of the original Bühlmann model. The paper is devoted to semi-linear credibility, where one examines functions of the random variables representing claim amounts, rather than the claim amounts themselves. The main purpose of semi-linear credibility theory is the estimation of µ0(θ) = E[f0(Xt+1)|θ] (the net premium for a contract with risk parameter θ) by a linear combination of given functions of the observable variables: X ′ = (X1, X2, . . . , Xt). So the estimators mainly considered here are linear combinations of several functions f1, f2, . . . , fn of the observable random variables. The approximation to µ0(θ) based on prescribed approximating functions f1, f2, . . . , fn leads to the optimal non-homogeneous linearized estimator for the semi-linear credibility model. Also we discuss the case when taking fp = f for all p to find the optimal function f. It should be noted that the approximation to µ0(θ) based on a unique optimal approximating function f is always better than the one in the semi-linear credibility model based on prescribed approximating functions: f1, f2, . . . , fn. The usefulness of the latter approximation is that it is easy to apply, since it is sufficient to know estimates for the structure parameters appearing in the credibility factors. Therefore we give some unbiased estimators for the structure parameters. For this purpose we embed the contract in a collective of contracts, all providing independent information on the structure distribution. We close this paper by giving the semi-linear hierarchical model used in the applications chapter.