towhee.models.clip.clipΒΆ

Functions

create_model

Create a CLIP model.

Classes

AttentionPool2d

Attention module for modified ResNet :param spacial_dim: spatial dimension :type spacial_dim: int :param embed_dim: embedding dimension :type embed_dim: int :param num_heads: number of heads :type num_heads: int :param output_dim: output dimension :type output_dim: int

Bottleneck

BottleNeck :param inplanes: number of inplanes :type inplanes: int :param planes: number of planes :type planes: int :param stride: number of stride :type stride: int

CLIP

CLIP model :param embed_dim: embedding dimension :type embed_dim: int :param image_resolution: image resolution :type image_resolution: int :param vision_layers: configs for vision transformer layers :type vision_layers: Union[Tuple[int, int, int, int], int] :param vision_width: width of vision transformer :type vision_width: int :param vision_patch_size: patch size of vision transformer :type vision_patch_size: int :param multilingual_model: config for multilingual model :type multilingual_model: str :param context_length: length of context :type context_length: int :param vocab_size: vocabulary size :type vocab_size: int :param transformer_width: width of transformer :type transformer_width: int :param transformer_heads: heads number of transformer :type transformer_heads: int :param transformer_layers: layer number of transformer :type transformer_layers: int :param clip4clip: is clip4clip or not :type clip4clip: bool :param vis: visualization :type vis: bool :param is_bridge_former: is bridge model or not :type is_bridge_former: bool :param is_bridge_former_video: text transformer or visual transformer for a single frame :type is_bridge_former_video: bool

LayerNorm

Subclass torch's LayerNorm to handle fp16.

ModifiedResNet

A ResNet class that is similar to torchvision's but contains the following changes: - There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.

QuickGELU

ResidualAttentionBlock

Residual Attention Block :param d_model: dimension of model :type d_model: int :param n_head: number of head :type n_head: int :param attn_mask: mask for attention :type attn_mask: Union[torch.Tensor, Callable] :param vis: visualization :type vis: int :param patch_nums: number of patches :type patch_nums: int :param is_bridge_former_video: text transformer or visual transformer for a single frame :type is_bridge_former_video: bool

Transformer

Transformer for clip :param width: width :type width: int :param layers: number of layers :type layers: int :param heads: number of heads :type heads: int :param attn_mask: attention mask :type attn_mask: Union[torch.Tensor, Callable] :param vis: visualization :type vis: bool :param patch_nums: number of patches :type patch_nums: int :param is_bridge_former_video: text transformer or visual transformer for a single frame :type is_bridge_former_video: bool

VisionTransformer

ViT for clip :param input_resolution: input resolution :type input_resolution: int :param patch_size: patch size :type patch_size: int :param width: width :type width: int :param layers: number of layers :type layers: int :param heads: number of heads :type heads: int :param output_dim: output dimension :type output_dim: int :param vis: visualization :type vis: bool :param is_bridgeformer: is bridge model or not :type is_bridgeformer: bool :param is_bridge_former_video: text transformer or visual transformer for a single frame :type is_bridge_former_video: bool