Add a cls token and apply a spatial-temporal encoding to a tensor. :param embed_dim: Embedding dimension for input sequence. :type embed_dim: int :param patch_embed_shape: The number of patches in each dimension (T, H, W) after patch embedding. :type patch_embed_shape: Tuple :param sep_pos_embed: If set to true, one positional encoding is used for spatial patches and another positional encoding is used for temporal sequence. Otherwise, only one positional encoding is used for all the patches. :type sep_pos_embed: bool :param has_cls: If set to true, a cls token is added in the beginning of each input sequence. :type has_cls: bool.