Research into sequence predictions using neural networks holds the key to major advances in speech recognition, machine translation, and language modeling. However, building and training such models is computationally taxing and very expensive. For instance, recent advances in language modeling were only achieved with the use of massive models trained on large GPU clusters for weeks at a time.

But these models are too expensive for typical academic institutions and impractical for use in most enterprises. To address this issue, Facebook AI Research (FAIR) has developed a softmax function approximation for GPUs, enabling researchers to more efficiently train neural network-based language models and cut down on the computational load.

Dubbed “adaptive softmax”, this approach is more effective for language modeling because it removes linear dependency on vocabulary size and instead clusters words based on distribution to simplify the process. Thus, the model allocates its computational budget according to a probability distribution, rather than using a brute force approach. This approach to softmax means that the model delivers faster access to the most frequent classes and also provides them with more resources.

Language modeling seeks to find the “probability distribution over a sequence of words from a given dictionary.” This joint distribution is a product of the conditional distribution of words given their past usage.

For example, the probability of a sequence of T words w[1],…, w[T] is given as:

**P**(w[1],…, w[T])) = **P**(w[T]|w[T-1],…, w[1])…**P**(w[1]).

In addition to adopting a labor-saving approach to language modeling, adaptive softmax also increases efficiency by “leveraging the specificities of modern architectures and matrix-matrix vector operations both at train and test time”, making it a custom solution for GPUs. The model learns a k-way hierarchical softmax that is based on the architecture of GPUs, in order to use its computational budget judiciously.

The result: when adaptive softmax was applied against standard benchmark environments with large vocabulary sizes such as EuroParl and One Billion Word, FAIR was able to process 12,500 words per second on one GPU, with an accuracy only slightly below that of standard softmax.

**Building recurrent models with torch-rnnlib**

FAIR has also open sourced its library (torch-rnnlib) to allow researchers to build and test recurrent model prototypes on GPUs, and access fast baselines through torch.cudnn bindings. Recurrent networks, as interpreted by Facebook, model sequences of variables over discrete periods of time. This understanding is predicated on the Markov property, which holds that the network’s future state only depends on its present state.

Details on how to design new recurrent networks in the library are available here.