InternVL <box>

Spatial Layout Projector (SLP)

InternVL采用了一种称为"Spatial Layout Projector (SLP)"的方法，将四维的空间坐标[x1,y1,x2,y2]（一个bounding box）转换为单个token嵌入：

"A key innovation in LayTextLLM is the Spatial Layout Projector (SLP), which transforms a spatial layout into a singular bounding box token. This enhancement enables the model to process both spatial layouts and textual inputs simultaneously. To be specifically, each OCR-derived spatial layout is represented by a bounding box defined by four-dimensional coordinates [x1,y1,x2,y2]..."
这种方法确实将每个边界框（box）表示为一个token，不同于之前的"coordinate-as-tokens"方案（这种方案会将坐标转换为多个token）：

"Compared to the coordinate-as-tokens scheme, the SLP represents each bounding box with a single token. This approach significantly reduces the number of input tokens..."
这种单token表示法的计算方式是通过将坐标映射到高维空间来实现的：

"The process can be computed as z=W⋅c+b, where c∈ℝ^4 is the vector of the bounding box coordinates. W∈ℝ^(d×4) is a weight matrix with d represents the dimension of the embedding, b∈ℝ^(d×1) is a bias vector, z is the resulting bounding box token represented as an d-dimensional embedding."

InternVL中关于box相关的token定义

InternVL中关于box相关的token定义主要在以下代码片段中：

internvl_chat/internvl/train/constants.py 定义了box相关的特殊标记：

python
展开代码
BOX_START_TOKEN = '<box>'
BOX_END_TOKEN = '</box>'

这些标记用于在文本中标记边界框的开始和结束。

在 internvl_chat/internvl/train/internvl_chat_mpo.py 和 internvl_chat/internvl/train/internvl_chat_pretrain.py 中都导入了这些token：

python
展开代码
from internvl.train.constants import (BOX_END_TOKEN, BOX_START_TOKEN,
                                     IMG_CONTEXT_TOKEN, IMG_END_TOKEN,
                                     IMG_START_TOKEN, QUAD_END_TOKEN,
                                     QUAD_START_TOKEN, REF_END_TOKEN,
                                     REF_START_TOKEN)

这些结果表明InternVL使用<box>和</box>标记来包裹边界框的信息，但确实没有明确说明边界框内部的具体编码格式。

另外，有几个与边界框处理相关的函数可能提供了线索：

transform_bbox函数（在internvl_chat/eval/domain_specific/rs_det/caculate.py中）显示边界框坐标被归一化为0-1000范围：

python
展开代码
def transform_bbox(bbox, image_size):
    x1, y1, x2, y2 = bbox
    W, H = image_size
    x1 = min(max(x1 / 1000 * W, 0), W)
    x2 = min(max(x2 / 1000 * W, 0), W)
    y1 = min(max(y1 / 1000 * H, 0), H)
    y2 = min(max(y2 / 1000 * H, 0), H)
    return [x1, y1, x2, y2]

在drivelm/evaluate.py中也有类似的坐标归一化/反归一化处理：

python
展开代码
cx = (cx / 1000) * whole_img_width
cy = (cy / 1000) * whole_img_height

这表明InternVL可能将边界框坐标归一化到0-1000的范围内，然后在需要时根据实际图像尺寸进行转换。

BOX_START_TOKEN (<box>) 和 BOX_END_TOKEN (</box>) 在InternVL项目中主要用于标记边界框信息的开始和结束。根据发现的代码，特别是在streamlit_demo/app.py中的find_bounding_boxes函数，我们可以看到它们的具体使用方式：

边界框格式和编码：
- 边界框通常以<ref>物体名称</ref><box>[[x1, y1, x2, y2]]</box>的格式出现在模型输出中
- 其中<ref>和</ref>标记物体名称，<box>和</box>标记边界框坐标

边界框坐标归一化：

边界框坐标被归一化到0-1000范围内
从streamlit_demo/app.py中的代码可以看到：

python
展开代码
coordinates = [(float(x[0]) / 1000, float(x[1]) / 1000, float(x[2]) / 1000, float(x[3]) / 1000) for x in coordinates]
coordinates = [(int(x[0] * width), int(x[1] * height), int(x[2] * width), int(x[3] * height)) for x in coordinates]

这表明当需要显示在实际图像上时，坐标会先从0-1000范围转换为0-1范围，然后乘以图像的实际宽度和高度

完整使用流程：
- 这些token被添加到tokenizer的特殊标记中（见model_worker.py中的tokens_to_keep = ['<box>', '</box>', '<ref>', '</ref>']）
- 当模型生成包含边界框的响应时，这些标记被用来包装边界框信息
- 在应用程序中（如find_bounding_boxes函数），使用正则表达式从响应中提取这些标记和它们之间的内容
- 提取的边界框坐标被转换为图像的实际坐标，并用于在图像上绘制边界框和标签

这种实现允许模型以易于处理的文本格式输出视觉定位信息，同时通过特殊标记明确边界框的开始和结束，便于后续处理和可视化。

InternVL对`<box>`标记的处理

特殊token定义：
- 在internvl_chat/internvl/train/constants.py中，BOX_START_TOKEN和BOX_END_TOKEN被定义为<box>和</box>
- 这些token被用于在文本中标记边界框信息的开始和结束

token添加到模型词表：

在训练脚本internvl_chat/internvl/train/internvl_chat_finetune.py、internvl_chat_pretrain.py和internvl_chat_mpo.py中，可以看到这些token被添加到tokenizer的词表中：

python
展开代码
token_list = [IMG_START_TOKEN, IMG_END_TOKEN, IMG_CONTEXT_TOKEN,
             QUAD_START_TOKEN, QUAD_END_TOKEN, REF_START_TOKEN,
             REF_END_TOKEN, BOX_START_TOKEN, BOX_END_TOKEN]
num_new_tokens = tokenizer.add_tokens(token_list, special_tokens=True)

推理时保留token：

在streamlit_demo/model_worker.py中可以看到，这些token在标记化时被特别保留：

python
展开代码
tokens_to_keep = ['<box>', '</box>', '<ref>', '</ref>']
tokenizer.additional_special_tokens = [item for item in tokenizer.additional_special_tokens if item not in tokens_to_keep]

处理输出中的边界框：
- 在streamlit_demo/app.py中的find_bounding_boxes函数展示了如何从模型输出中提取边界框信息：
```
python
展开代码
pattern = re.compile(r'<ref>\s*(.*?)\s*</ref>\s*<box>\s*(\[\[.*?\]\])\s*</box>')
```
- 这表明边界框的格式为<ref>物体名称</ref><box>[[x1, y1, x2, y2]]</box>

坐标归一化：

边界框坐标在0-1000范围内表示，在显示时会被转换为实际像素坐标：

python
展开代码
coordinates = [(float(x[0]) / 1000, float(x[1]) / 1000, float(x[2]) / 1000, float(x[3]) / 1000) for x in coordinates]
coordinates = [(int(x[0] * width), int(x[1] * height), int(x[2] * width), int(x[3] * height)) for x in coordinates]

关于训练数据和返回结果

基于上述代码分析，我可以确认：

训练数据含有这些标记：
- 由于这些token被添加到了tokenizer中，并且在训练脚本中被明确添加，说明训练数据包含这些标记
- 这些特殊token使模型能够学习如何标记和输出边界框信息
InternVL确实可以返回含有这些标记的结果：
- 从model_worker.py中可以看到，在tokenizer处理时特别保留了这些token
- app.py中的find_bounding_boxes函数专门用于处理模型返回的含有这些标记的边界框信息
- 当模型需要指出图像中特定物体的位置时，它会使用<ref>物体名称</ref><box>[[x1, y1, x2, y2]]</box>的格式

这种设计允许InternVL在对话中自然地引用和定位图像中的物体，使模型具有视觉定位和引用能力，特别是在需要精确定位图像中物体位置的应用场景中非常有用。

目录

Spatial Layout Projector (SLP)

InternVL中关于box相关的token定义

InternVL对<box>标记的处理

关于训练数据和返回结果

InternVL对`<box>`标记的处理