A multiscale vision transformer block. Each block contains a multiscale attention layer and a Mlp layer. :: Input |-------------------+ ↓ | Norm | ↓ | MultiScaleAttention Pool ↓ | DropPath | ↓ | Summation ←-------------+ | |-------------------+ ↓ | Norm | ↓ | Mlp Proj ↓ | DropPath | ↓ | Summation ←------------+ :param dim: Input feature dimension. :type dim: int :param dim_out: Output feature dimension. :type dim_out: int :param num_heads: Number of heads in the attention layer. :type num_heads: int :param mlp_ratio: MLP ratio which controls the feature dimension in the hidden layer of the MLP block. :type mlp_ratio: float :param qkv_bias: If set to False, the qkv layer will not learn an additive bias. :type qkv_bias: bool :param dropout_rate: DropOut rate. If set to 0, DropOut is disabled. :type dropout_rate: float :param droppath_rate: DropPath rate. If set to 0, DropPath is disabled. :type droppath_rate: float :param activation: Activation layer used in the MLP layer. :type activation: nn.Module :param norm_layer: Normalization layer. :type norm_layer: nn.Module :param kernel_q: Pooling kernel size for q. If pooling kernel size is 1 for all the dimensions. :type kernel_q: _size_3_t :param kernel_kv: Pooling kernel size for kv. If pooling kernel size is 1 for all the dimensions, pooling is not used. :type kernel_kv: _size_3_t :param stride_q: Pooling kernel stride for q. :type stride_q: _size_3_t :param stride_kv: Pooling kernel stride for kv. :type stride_kv: _size_3_t :param pool_mode: Pooling mode. :type pool_mode: nn.Module :param has_cls_embed: If set to True, the first token of the input tensor should be a cls token. Otherwise, the input tensor does not contain a cls token. Pooling is not applied to the cls token. :type has_cls_embed: bool :param pool_first: If set to True, pool is applied before qkv projection. Otherwise, pool is applied after qkv projection. :type pool_first: bool.