文章目录
网络虚拟化与虚拟网络的区别
Q:网络虚拟化与虚拟网络有何区别
GPT A:网络虚拟化与虚拟网络是两个相关但有明显区别的概念:
- 网络虚拟化(Network Virtualization)
定义:网络虚拟化是一种将物理网络资源(如交换机、路由器、链路、带宽等)抽象、整合或分割成一个或多个逻辑网络(虚拟网络)的技术。
目标:使网络资源能够像计算资源一样被灵活分配、隔离和管理,提高网络的灵活性、可扩展性和资源利用率。
例子:SDN(软件定义网络)、VXLAN、NVGRE、OpenFlow等。
特点:
可以在同一物理网络上创建多个相互隔离的虚拟网络。
支持动态调整、迁移和自动化部署等高级功能。
管理和控制层面通常由控制器统一管理。
2. 虚拟网络(Virtual Network)
定义:虚拟网络是建立在物理网络之上的、具有独立网络拓扑、地址空间和路由规则的逻辑网络实体。
目标:为用户或应用提供一个逻辑上独立、隔离的网络环境。
例子:VLAN(虚拟局域网)、VPN(虚拟专用网)、云中的VPC(Virtual Private Cloud)等。
特点:
主要关注“网络的使用者”视角,为租户或应用提供专属的网络空间。
虚拟网络可以通过网络虚拟化技术实现。
虚拟网络的实现可以简单(如VLAN)也可以复杂(如Overlay网络)。
区别总结:
网络虚拟化是实现手段和底层技术,侧重于资源的抽象、隔离和管理。
虚拟网络是网络虚拟化的产物或表现形式,侧重于给用户或应用提供虚拟的网络环境。
网络虚拟化是“如何做”,虚拟网络是“做出来的结果”。
veth
veth通信实验
$sudo ip link add veth0 type veth peer name veth1
$sudo ip addr add 192.168.10.1/24 dev veth0
$sudo ip addr add 192.168.10.2/24 dev veth1
$sudo ip link set veth0 up
$sudo ip link set veth1 up
$ ip addr
#...
4: veth1@veth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 22:68:39:16:5b:a3 brd ff:ff:ff:ff:ff:ff
inet 192.168.10.2/24 scope global veth1
valid_lft forever preferred_lft forever
inet6 fe80::2068:39ff:fe16:5ba3/64 scope link
valid_lft forever preferred_lft forever
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 9e:72:41:a4:b6:fc brd ff:ff:ff:ff:ff:ff
inet 192.168.10.1/24 scope global veth0
valid_lft forever preferred_lft forever
inet6 fe80::9c72:41ff:fea4:b6fc/64 scope link
valid_lft forever preferred_lft forever
再进行必要的设置
# echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter
# echo 0 > /proc/sys/net/ipv4/conf/veth0/rp_filter
# echo 0 > /proc/sys/net/ipv4/conf/veth1/rp_filter
# echo 1 > /proc/sys/net/ipv4/conf/veth1/accept_local
# echo 1 > /proc/sys/net/ipv4/conf/veth0/accept_local
尝试Ping:
# ping 192.168.10.2 -I veth0
PING 192.168.10.2 (192.168.10.2) from 192.168.10.1 veth0: 56(84) bytes of data.
64 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=0.067 ms
64 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=0.033 ms
64 bytes from 192.168.10.2: icmp_seq=3 ttl=64 time=0.041 ms
64 bytes from 192.168.10.2: icmp_seq=4 ttl=64 time=0.049 ms
^C
--- 192.168.10.2 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3073ms
rtt min/avg/max/mdev = 0.033/0.047/0.067/0.012 ms
可以看到创建的veth和其他的网卡很像,只不过比较特殊的就是他有一对
源码实现
内核模块初始化
veth作为内核模块存在
# modinfo veth
filename: /lib/modules/6.11.0-28-generic/kernel/drivers/net/veth.ko.zst
alias: rtnl-link-veth
license: GPL v2
description: Virtual Ethernet Tunnel
其位置位于drivers/net/veth.c
module_init(veth_init);
static __init int veth_init(void)
{
return rtnl_link_register(&veth_link_ops);
}
static struct rtnl_link_ops veth_link_ops = {
.kind = DRV_NAME,
.priv_size = sizeof(struct veth_priv),
.setup = veth_setup,
.validate = veth_validate,
.newlink = veth_newlink,
.dellink = veth_dellink,
.policy = veth_policy,
.maxtype = VETH_INFO_MAX,
.get_link_net = veth_get_link_net,
};
其调用了rtnl_link_register注册了veth_link_ops到link_ops上,等新建接口时(比如用户态发送RTM_NEWLINK)rtnl_newlink就会调用veth_link_ops中的相关函数去创建和设置网卡,像openvswitch的internal网卡,tun/tap网卡等等的虚拟网卡都会调用rtnl_link_register来注册,所以像这种可以临时添加的虚拟网卡,应该都可以响应RTM_NEWLINK来创建网卡。
设置网卡
在rtnl_newlink调用中,会先间接调用rtnl_create_link,其中调用了alloc_netdev_mqs,
static int __rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh,
struct nlattr **attr, struct netlink_ext_ack *extack)
{
//...
dev = rtnl_create_link(link_net ? : dest_net, ifname,
name_assign_type, ops, tb, extack);
if (IS_ERR(dev)) {
err = PTR_ERR(dev);
goto out;
}
dev->ifindex = ifm->ifi_index;
if (ops->newlink) {
err = ops->newlink(link_net ? : net, dev, tb, data, extack);
//...
}
struct net_device *rtnl_create_link(struct net *net, const char *ifname,
unsigned char name_assign_type,
const struct rtnl_link_ops *ops,
struct nlattr *tb[],
struct netlink_ext_ack *extack)
{
//...
dev = alloc_netdev_mqs(ops->priv_size, ifname, name_assign_type,
ops->setup, num_tx_queues, num_rx_queues);
if (!dev)
return ERR_PTR(-ENOMEM);
dev_net_set(dev, net);
dev->rtnl_link_ops = ops;
dev->rtnl_link_state = RTNL_LINK_INITIALIZING;
//...
}
/**
* alloc_netdev_mqs - allocate network device
* @sizeof_priv: size of private data to allocate space for
* @name: device name format string
* @name_assign_type: origin of device name
* @setup: callback to initialize device
* @txqs: the number of TX subqueues to allocate
* @rxqs: the number of RX subqueues to allocate
*
* Allocates a struct net_device with private data area for driver use
* and performs basic initialization. Also allocates subqueue structs
* for each queue on the device.
*/
struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
unsigned char name_assign_type,
void (*setup)(struct net_device *),
unsigned int txqs, unsigned int rxqs)
{
//...
setup(dev);
//...
}
因此调用了ops->setup,setup函数实现:
static void veth_setup(struct net_device *dev)
{
ether_setup(dev);
//...
dev->netdev_ops = &veth_netdev_ops;
dev->ethtool_ops = &veth_ethtool_ops;
//...
}
创建网卡
在调用完rtnl_create_link后,就会调用newlink函数了
static int veth_newlink(struct net *src_net, struct net_device *dev,
struct nlattr *tb[], struct nlattr *data[],
struct netlink_ext_ack *extack)
{
int err;
struct net_device *peer;
struct veth_priv *priv;
char ifname[IFNAMSIZ];
struct nlattr *peer_tb[IFLA_MAX + 1], **tbp;
unsigned char name_assign_type;
struct ifinfomsg *ifmp;
struct net *net;
//...
net = rtnl_link_get_net(src_net, tbp);
if (IS_ERR(net))
return PTR_ERR(net);
//先创建邻居
peer = rtnl_create_link(net, ifname, name_assign_type,
&veth_link_ops, tbp, extack);
//...
//注册邻居
err = register_netdevice(peer);
//...
//注册自身
err = register_netdevice(dev);
//...
/*
* tie the deviced together
*/
//把两个设备关联到一起
priv = netdev_priv(dev);
rcu_assign_pointer(priv->peer, peer);
priv = netdev_priv(peer);
rcu_assign_pointer(priv->peer, dev);
//...
}
struct veth_priv {
struct net_device __rcu *peer;
atomic64_t dropped;
struct bpf_prog *_xdp_prog;
struct veth_rq *rq;
unsigned int requested_headroom;
};
数据发送接收
函数位于veth_xmit
static const struct net_device_ops veth_netdev_ops = {
.ndo_init = veth_dev_init,
.ndo_open = veth_open,
.ndo_stop = veth_close,
.ndo_start_xmit = veth_xmit,
.ndo_get_stats64 = veth_get_stats64,
.ndo_set_rx_mode = veth_set_multicast_list,
.ndo_set_mac_address = eth_mac_addr,
.ndo_get_iflink = veth_get_iflink,
.ndo_fix_features = veth_fix_features,
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom = veth_set_rx_headroom,
.ndo_bpf = veth_xdp,
.ndo_xdp_xmit = veth_ndo_xdp_xmit,
.ndo_get_peer_dev = veth_peer_dev,
};
static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct veth_rq *rq = NULL;
struct net_device *rcv;
int length = skb->len;
bool rcv_xdp = false;
int rxq;
rcu_read_lock();
//取邻居
rcv = rcu_dereference(priv->peer);
if (unlikely(!rcv)) {
kfree_skb(skb);
goto drop;
}
rcv_priv = netdev_priv(rcv);
rxq = skb_get_queue_mapping(skb);
if (rxq < rcv->real_num_rx_queues) {
rq = &rcv_priv->rq[rxq];
rcv_xdp = rcu_access_pointer(rq->xdp_prog);
}
skb_tx_timestamp(skb);
//发送
if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {
if (!rcv_xdp)
dev_lstats_add(dev, length);
} else {
drop:
atomic64_inc(&priv->dropped);
}
if (rcv_xdp)
__veth_xdp_flush(rq);
rcu_read_unlock();
return NETDEV_TX_OK;
}
static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
struct veth_rq *rq, bool xdp)
{
//进入xdp还是上送协议栈
return __dev_forward_skb(dev, skb) ?: xdp ?
veth_xdp_rx(rq, skb) :
netif_rx(skb);
}
int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
{
int ret = ____dev_forward_skb(dev, skb);
if (likely(!ret)) {
skb->protocol = eth_type_trans(skb, dev);
skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
}
return ret;
}
/**
* eth_type_trans - determine the packet's protocol ID.
* @skb: received socket data
* @dev: receiving network device
*
* The rule here is that we
* assume 802.3 if the type field is short enough to be a length.
* This is normal practice and works for any 'now in use' protocol.
*/
__be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
{
//...
//替换为邻居的dev
skb->dev = dev;
//...
}
因此对于veth,其数据发送就是取邻居的net_device,然后调用netif_rx把skb放入softnet_data的的input_pkt_queue上,然后后面的过程就是普通的接收流程了。
借用一张图表达veth的实现:
network namespace与vrf
网络命名空间可以为不同命名空间从逻辑上提供独立的协议栈,包括设备、路由表、arp表、iptables以及套接字(socket)等,做到隔离的效果。
命名空间通信实验
先创建一个网络命名空间
$ sudo ip netns add net1
$ lsns -t net
NS TYPE NPROCS PID USER NETNSID NSFS COMMAND
4026531840 net 80 1322 bentutu unassigned /usr/bin/pipewire
4026532383 net 0 root /run/netns/net1
创建后可以使用,可以看到多出一个命名空间
可以使用ip netns exec或nsenter命令进入此命名空间
$ sudo nsenter --net=/run/netns/net1 bash
# ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
可以看到此时只有一个环回口,回到默认的命名监控创建一对veth并将一端添加到命名空间
$ sudo ip link add veth1 type veth peer name veth1_p
$ sudo ip link set veth1 netns net1
$ ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 54:ee:75:4c:62:fa brd ff:ff:ff:ff:ff:ff
3: wlp2s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether 34:e6:ad:38:6e:27 brd ff:ff:ff:ff:ff:ff
6: veth1_p@if7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether a6:00:93:fb:53:66 brd ff:ff:ff:ff:ff:ff link-netns net1
$ sudo ip netns exec net1 ip link list
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
7: veth1@if6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 22:68:39:16:5b:a3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
可以看到veth1已被添加到net1命名空间
$sudo ip addr add 192.168.10.1/24 dev veth1_p
$sudo ip netns exec net1 ip addr add 192.168.10.2/24 dev veth1
$sudo ip link set dev veth1_p up
$sudo ip netns exec net1 ip link set dev veth1 up
$ sudo ip netns exec net1 ping 192.168.10.1 -I veth1
PING 192.168.10.1 (192.168.10.1) from 192.168.10.2 veth1: 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=0.066 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=0.040 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=0.047 ms
^C
--- 192.168.10.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2034ms
rtt min/avg/max/mdev = 0.040/0.051/0.066/0.011 ms
可以看到能ping通,并且在两个命名空间内的接口,ip tables,路由等都是独立的
$ ip addr
#...
6: veth1_p@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether a6:00:93:fb:53:66 brd ff:ff:ff:ff:ff:ff link-netns net1
inet 192.168.10.1/24 scope global veth1_p
valid_lft forever preferred_lft forever
inet6 fe80::a400:93ff:fefb:5366/64 scope link
valid_lft forever preferred_lft forever
$ ip route
192.168.1.0/24 dev enp3s0 proto kernel scope link src 192.168.1.2 metric 100
192.168.10.0/24 dev veth1_p proto kernel scope link src 192.168.10.1
$ sudo ip netns exec net1 ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
7: veth1@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 22:68:39:16:5b:a3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.10.2/24 scope global veth1
valid_lft forever preferred_lft forever
inet6 fe80::2068:39ff:fe16:5ba3/64 scope link
valid_lft forever preferred_lft forever
$ sudo ip netns exec net1 ip route
192.168.10.0/24 dev veth1 proto kernel scope link src 192.168.10.2
源码实现
数据结构
上面的net_ns就是网络命名空间,每个net_device和socket都有他们所属的命名空间,可以通过成员变量获取到net,而每个命名空间也有对应的路由表,iptable及内核参数。其中loopback_dev是每个net都有的一个环回设备,这也是为何上面实验中刚创建好命名空间,就有一个lo的原因
默认命名空间创建
之前在树莓派编译uboot及内核&&整体启动流程梳理过内核启动流程,经过内核入口点一路执行,会到达start_kernel:primary_entry->__primary_switch->__primary_switched->start_kernel
而在__primary_switched中会完成0号进程描述符的定义,其值为init_task,其定义中有
/*
* Set up the first task table, touch at your own risk!. Base=0,
* limit=0x1fffff (=2MB)
*/
struct task_struct init_task
#ifdef CONFIG_ARCH_TASK_STRUCT_ON_STACK
__init_task_data
#endif
__aligned(L1_CACHE_BYTES)
= {
//...
.nsproxy = &init_nsproxy,
//...
}
nsproxy里面有各类命名空间
struct nsproxy init_nsproxy = {
.count = ATOMIC_INIT(1),
.uts_ns = &init_uts_ns,
#if defined(CONFIG_POSIX_MQUEUE) || defined(CONFIG_SYSVIPC)
.ipc_ns = &init_ipc_ns,
#endif
.mnt_ns = NULL,
.pid_ns_for_children = &init_pid_ns,
#ifdef CONFIG_NET
.net_ns = &init_net,
#endif
#ifdef CONFIG_CGROUPS
.cgroup_ns = &init_cgroup_ns,
#endif
#ifdef CONFIG_TIME_NS
.time_ns = &init_time_ns,
.time_ns_for_children = &init_time_ns,
#endif
};
其中就包含了net_ns,也就是网络命名空间,在子系统初始化时就已经创建好了
struct net init_net = {
.count = REFCOUNT_INIT(1),
.dev_base_head = LIST_HEAD_INIT(init_net.dev_base_head),
#ifdef CONFIG_KEYS
.key_domain = &init_net_key_domain,
#endif
};
static int __init net_ns_init(void)
{
struct net_generic *ng;
//...
down_write(&pernet_ops_rwsem);
//初始化默认网络命名空间
if (setup_net(&init_net, &init_user_ns))
panic("Could not setup the initial network namespace");
init_net_initialized = true;
up_write(&pernet_ops_rwsem);
if (register_pernet_subsys(&net_ns_ops))
panic("Could not register network namespace subsystems");
//注册命名空间新增,获取的netlink
rtnl_register(PF_UNSPEC, RTM_NEWNSID, rtnl_net_newid, NULL,
RTNL_FLAG_DOIT_UNLOCKED);
rtnl_register(PF_UNSPEC, RTM_GETNSID, rtnl_net_getid, rtnl_net_dumpid,
RTNL_FLAG_DOIT_UNLOCKED);
return 0;
}
pure_initcall(net_ns_init);
新创建进程命名空间
执行新进程的fork时会判断是否执行新命名空间创建
SYSCALL_DEFINE0(fork)->kernel_clone->copy_process->copy_namespaces->create_new_namespaces->copy_net_ns
其中copy_process中将会调用dup_task_struct进行进程描述符拷贝,并进行一系列拷贝动作的处理。
如果指定了CLONE_NEWNET会调用create_new_namespaces创建新命名空间,赋值给新进程描述符的nsproxy
struct net *copy_net_ns(unsigned long flags,
struct user_namespace *user_ns, struct net *old_net)
{
struct ucounts *ucounts;
struct net *net;
int rv;
//没指定增加引用计数使用旧的
if (!(flags & CLONE_NEWNET))
return get_net(old_net);
ucounts = inc_net_namespaces(user_ns);
if (!ucounts)
return ERR_PTR(-ENOSPC);
net = net_alloc();
if (!net) {
rv = -ENOMEM;
goto dec_ucounts;
}
refcount_set(&net->passive, 1);
net->ucounts = ucounts;
get_user_ns(user_ns);
rv = down_read_killable(&pernet_ops_rwsem);
if (rv < 0)
goto put_userns;
//初始化新创建的命名空间
rv = setup_net(net, user_ns);
up_read(&pernet_ops_rwsem);
if (rv < 0) {
put_userns:
#ifdef CONFIG_KEYS
key_remove_domain(net->key_domain);
#endif
put_user_ns(user_ns);
net_drop_ns(net);
dec_ucounts:
dec_net_namespaces(ucounts);
return ERR_PTR(rv);
}
return net;
}
网络命名空间内的子系统初始化
命名空间内的各个子系统都是在调用setup_net时初始化的,包括路由表、tcp的proc
伪文件系统、iptable规则读取,等等。
由于内核网络模块的复杂性,在内核中将网络模块划分成了各个子系统。每个子系
统都定义了一个初始化函数和一个退出函数,其结构体为pernet_operations。
struct pernet_operations {
struct list_head list;
/*
* Below methods are called without any exclusive locks.
* More than one net may be constructed and destructed
* in parallel on several cpus. Every pernet_operations
* have to keep in mind all other pernet_operations and
* to introduce a locking, if they share common resources.
*
* The only time they are called with exclusive lock is
* from register_pernet_subsys(), unregister_pernet_subsys()
* register_pernet_device() and unregister_pernet_device().
*
* Exit methods using blocking RCU primitives, such as
* synchronize_rcu(), should be implemented via exit_batch.
* Then, destruction of a group of net requires single
* synchronize_rcu() related to these pernet_operations,
* instead of separate synchronize_rcu() for every net.
* Please, avoid synchronize_rcu() at all, where it's possible.
*
* Note that a combination of pre_exit() and exit() can
* be used, since a synchronize_rcu() is guaranteed between
* the calls.
*/
int (*init)(struct net *net);
void (*pre_exit)(struct net *net);
void (*exit)(struct net *net);
void (*exit_batch)(struct list_head *net_exit_list);
unsigned int *id;
size_t size;
};
register_pernet_device和register_pernet_subsys将会调用register_pernet_operations把这个pernet_operations进行注册:
/**
* register_pernet_subsys - register a network namespace subsystem
* @ops: pernet operations structure for the subsystem
*
* Register a subsystem which has init and exit functions
* that are called when network namespaces are created and
* destroyed respectively.
*
* When registered all network namespace init functions are
* called for every existing network namespace. Allowing kernel
* modules to have a race free view of the set of network namespaces.
*
* When a new network namespace is created all of the init
* methods are called in the order in which they were registered.
*
* When a network namespace is destroyed all of the exit methods
* are called in the reverse of the order with which they were
* registered.
*/
int register_pernet_subsys(struct pernet_operations *ops)
{
int error;
down_write(&pernet_ops_rwsem);
error = register_pernet_operations(first_device, ops);
up_write(&pernet_ops_rwsem);
return error;
}
/**
* register_pernet_device - register a network namespace device
* @ops: pernet operations structure for the subsystem
*
* Register a device which has init and exit functions
* that are called when network namespaces are created and
* destroyed respectively.
*
* When registered all network namespace init functions are
* called for every existing network namespace. Allowing kernel
* modules to have a race free view of the set of network namespaces.
*
* When a new network namespace is created all of the init
* methods are called in the order in which they were registered.
*
* When a network namespace is destroyed all of the exit methods
* are called in the reverse of the order with which they were
* registered.
*/
int register_pernet_device(struct pernet_operations *ops)
{
int error;
down_write(&pernet_ops_rwsem);
error = register_pernet_operations(&pernet_list, ops);
//指向第一个device
if (!error && (first_device == &pernet_list))
first_device = &ops->list;
up_write(&pernet_ops_rwsem);
return error;
}
其中first_device是作为device的链表头存在,因此这两个函数加入的一个链表
static struct list_head *first_device = &pernet_list;
register_pernet_operations将pernet_operations挂到链上:
static int register_pernet_operations(struct list_head *list,
struct pernet_operations *ops)
{
//...
error = __register_pernet_operations(list, ops);
//...
}
static int __register_pernet_operations(struct list_head *list,
struct pernet_operations *ops)
{
//...
list_add_tail(&ops->list, list);
if (ops->init || (ops->id && ops->size)) {
/* We held write locked pernet_ops_rwsem, and parallel
* setup_net() and cleanup_net() are not possible.
*/
for_each_net(net) {
error = ops_init(ops, net);
if (error)
goto out_undo;
list_add_tail(&net->exit_list, &net_exit_list);
}
}
//...
}
而在进行注册时,会遍历所有命名空间进行初始化,net_namespace_list连接了所有的net,参见前面的数据结构小节:
#define for_each_net(VAR) \
list_for_each_entry(VAR, &net_namespace_list, list)
static int ops_init(const struct pernet_operations *ops, struct net *net)
{
int err = -ENOMEM;
void *data = NULL;
if (ops->id && ops->size) {
data = kzalloc(ops->size, GFP_KERNEL);
if (!data)
goto out;
err = net_assign_generic(net, *ops->id, data);
if (err)
goto cleanup;
}
err = 0;
if (ops->init)
err = ops->init(net);
if (!err)
return 0;
cleanup:
kfree(data);
out:
return err;
}
register_pernet_subsys例子
很多/proc/net下的文件其实是通过register_pernet_subsys实现的,见图,此处以/proc/net/protocols为例:
register_pernet_device例子
前面说的每个net都有的loop设备就是通过register_pernet_device注册的:
/* Registered in net/core/dev.c */
struct pernet_operations __net_initdata loopback_net_ops = {
.init = loopback_net_init,
};
/* Setup and register the loopback device. */
static __net_init int loopback_net_init(struct net *net)
{
struct net_device *dev;
int err;
err = -ENOMEM;
dev = alloc_netdev(0, "lo", NET_NAME_UNKNOWN, loopback_setup);
if (!dev)
goto out;
dev_net_set(dev, net);
err = register_netdev(dev);
if (err)
goto out_free_netdev;
BUG_ON(dev->ifindex != LOOPBACK_IFINDEX);
//赋值
net->loopback_dev = dev;
return 0;
out_free_netdev:
free_netdev(dev);
out:
if (net_eq(net, &init_net))
panic("loopback: Failed to register netdevice: %d\n", err);
return err;
}
因此每次新建命名空间都会有个lo:
ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
新创建进程命名空间初始化
上节提到的初始化在设备启动时就会进行,但是如果新增了一个命名空间,就要自己初始化了,再看前面提到的setup_net
/*
* setup_net runs the initializers for the network namespace object.
*/
static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
{
//...
//对每个命名空间进行初始化
list_for_each_entry(ops, &pernet_list, list) {
error = ops_init(ops, net);
if (error < 0)
goto out_undo;
}
//...
}
改变设备所属命名空间
设备刚创建出来时位于默认命名空间,但是可以修改,如上面实验章节的命令:
sudo ip link set veth1 netns net1
使用netlink的RTM_SETLINK就可以做到
void __init rtnetlink_init(void)
{
//...
rtnl_register(PF_UNSPEC, RTM_SETLINK, rtnl_setlink, NULL, 0);
//...
}
static int rtnl_setlink(struct sk_buff *skb, struct nlmsghdr *nlh,
struct netlink_ext_ack *extack)
{
struct net *net = sock_net(skb->sk);
struct ifinfomsg *ifm;
struct net_device *dev;
int err;
struct nlattr *tb[IFLA_MAX+1];
char ifname[IFNAMSIZ];
//...
err = -EINVAL;
ifm = nlmsg_data(nlh);
//根据ifinde寻找
if (ifm->ifi_index > 0)
dev = __dev_get_by_index(net, ifm->ifi_index);
//根据接口名寻找
else if (tb[IFLA_IFNAME] || tb[IFLA_ALT_IFNAME])
dev = rtnl_dev_get(net, NULL, tb[IFLA_ALT_IFNAME], ifname);
else
goto errout;
if (dev == NULL) {
err = -ENODEV;
goto errout;
}
//执行
err = do_setlink(skb, dev, ifm, extack, tb, ifname, 0);
errout:
return err;
}
static int do_setlink(const struct sk_buff *skb,
struct net_device *dev, struct ifinfomsg *ifm,
struct netlink_ext_ack *extack,
struct nlattr **tb, char *ifname, int status)
{
const struct net_device_ops *ops = dev->netdev_ops;
int err;
err = validate_linkmsg(dev, tb);
if (err < 0)
return err;
if (tb[IFLA_NET_NS_PID] || tb[IFLA_NET_NS_FD] || tb[IFLA_TARGET_NETNSID]) {
const char *pat = ifname && ifname[0] ? ifname : NULL;
struct net *net = rtnl_link_get_net_capable(skb, dev_net(dev),
tb, CAP_NET_ADMIN);
if (IS_ERR(net)) {
err = PTR_ERR(net);
goto errout;
}
err = dev_change_net_namespace(dev, net, pat);
put_net(net);
if (err)
goto errout;
status |= DO_SETLINK_MODIFIED;
}
//...
}
/**
* dev_change_net_namespace - move device to different nethost namespace
* @dev: device
* @net: network namespace
* @pat: If not NULL name pattern to try if the current device name
* is already taken in the destination network namespace.
*
* This function shuts down a device interface and moves it
* to a new network namespace. On success 0 is returned, on
* a failure a netagive errno code is returned.
*
* Callers must hold the rtnl semaphore.
*/
int dev_change_net_namespace(struct net_device *dev, struct net *net, const char *pat)
{
struct net *net_old = dev_net(dev);
int err, new_nsid, new_ifindex;
//...
/*
* And now a mini version of register_netdevice unregister_netdevice.
*/
/* If device is running close it first. */
//接口down
dev_close(dev);
/* And unlink it from device chain */
unlist_netdevice(dev);
synchronize_net();
/* Shutdown queueing discipline. */
dev_shutdown(dev);
/* Notify protocols, that we are about to destroy
* this device. They should clean all the things.
*
* Note that dev->reg_state stays at NETREG_REGISTERED.
* This is wanted because this way 8021q and macvlan know
* the device is just moving and can keep their slaves up.
*/
//通知接口销毁
call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
rcu_barrier();
new_nsid = peernet2id_alloc(dev_net(dev), net, GFP_KERNEL);
//分配ifindex
/* If there is an ifindex conflict assign a new one */
if (__dev_get_by_index(net, dev->ifindex))
new_ifindex = dev_new_index(net);
else
new_ifindex = dev->ifindex;
//通知接口删除
rtmsg_ifinfo_newnet(RTM_DELLINK, dev, ~0U, GFP_KERNEL, &new_nsid,
new_ifindex);
/*
* Flush the unicast and multicast chains
*/
dev_uc_flush(dev);
dev_mc_flush(dev);
/* Send a netdev-removed uevent to the old namespace */
kobject_uevent(&dev->dev.kobj, KOBJ_REMOVE);
netdev_adjacent_del_links(dev);
/* Move per-net netdevice notifiers that are following the netdevice */
//移动通知链
move_netdevice_notifiers_dev_net(dev, net);
/* Actually switch the network namespace */
//指向新net
dev_net_set(dev, net);
dev->ifindex = new_ifindex;
/* Send a netdev-add uevent to the new namespace */
kobject_uevent(&dev->dev.kobj, KOBJ_ADD);
netdev_adjacent_add_links(dev);
/* Fixup kobjects */
err = device_rename(&dev->dev, dev->name);
WARN_ON(err);
/* Adapt owner in case owning user namespace of target network
* namespace is different from the original one.
*/
err = netdev_change_owner(dev, net_old, net);
WARN_ON(err);
/* Add the device back in the hashes */
list_netdevice(dev);
/* Notify protocols, that a new device appeared. */
call_netdevice_notifiers(NETDEV_REGISTER, dev);
/*
* Prevent userspace races by waiting until the network
* device is fully setup before sending notifications.
*/
rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL);
synchronize_net();
err = 0;
out:
return err;
}
static inline
void dev_net_set(struct net_device *dev, struct net *net)
{
write_pnet(&dev->nd_net, net);
}
socket创建所属命名空间
在TCP/IP实现浅析的上节中已经看过socket创建过程:
int sock_create(int family, int type, int protocol, struct socket **res)
{
return __sock_create(current->nsproxy->net_ns, family, type, protocol, res, 0);
}
可以看到scoket创建时使用的就是当前进程所属的网络命名空间。最后在sk_alloc中把命名空间设置上
/**
* sk_alloc - All socket objects are allocated here
* @net: the applicable net namespace
* @family: protocol family
* @priority: for allocation (%GFP_KERNEL, %GFP_ATOMIC, etc)
* @prot: struct proto associated with this new sock instance
* @kern: is this to be a kernel socket?
*/
struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
struct proto *prot, int kern)
{
struct sock *sk;
sk = sk_prot_alloc(prot, priority | __GFP_ZERO, family);
if (sk) {
//...
sock_net_set(sk, net);
//...
}
return sk;
}
static inline
void sock_net_set(struct sock *sk, struct net *net)
{
write_pnet(&sk->sk_net, net);
}
多命名空间socket编程
进一步的,若想在一个进程拥有不同命名空间的scoket,可以使用setns先切换到目标命名空间,此时再创建socket,继承的就是新的命名空间,创建完再切换回来就行了
int pal_sock_set_ns(fib_id_t fib)
{
s_int32_t fd;
char netns_file_name[PATH_MAX];
s_int32_t ret;
/* /var/run/netns/nosfib256 */
PAL_VRF_NS_PATH(netns_file_name, fib);
fd = open(netns_file_name, O_RDONLY, 0);
if (fd < 0) {
return -1;
}
ret = setns(fd, CLONE_NEWNET);
close(fd);
return ret;
}
网络收发使用命名空间
以报文发送为例,在TCP/IP实现浅析中一节分析过ip_queue_xmit,其查找路由的调用为ip_route_output_ports->ip_route_output_flow->__ip_route_output_key->ip_route_output_key_hash->ip_route_output_key_hash_rcu->fib_lookup
int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
{
return __ip_queue_xmit(sk, skb, fl, inet_sk(sk)->tos);
}
/* Note: skb->sk can be different from sk, in case of tunnels */
int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,
__u8 tos)
{
struct inet_sock *inet = inet_sk(sk);
struct net *net = sock_net(sk);
//...
rt = ip_route_output_ports(net, fl4, sk,
daddr, inet->inet_saddr,
inet->inet_dport,
inet->inet_sport,
sk->sk_protocol,
RT_CONN_FLAGS_TOS(sk, tos),
sk->sk_bound_dev_if);
//...
}
static inline
struct net *sock_net(const struct sock *sk)
{
return read_pnet(&sk->sk_net);
}
而上面sk_alloc实现的赋值,因此调用ip_route_output_ports使用的就是socet创建时赋值的net。
在fib_lookup中:
static inline int fib_lookup(struct net *net, struct flowi4 *flp,
struct fib_result *res, unsigned int flags)
{
//...
tb = rcu_dereference_rtnl(net->ipv4.fib_main);
//...
tb = rcu_dereference_rtnl(net->ipv4.fib_default);
//...
}
可以看到跟数据结构小节中的结构都对上了,因此每个scoket有不同的网络命名空间,就有不同的路由表。
vrf与network namespace
交换机vrf即虚拟转发实例就可以使用命名空间来实现,因其对于新创建的vrf,都有三层隔离。
虚拟机与容器命名空间互联方案差异
对于命名空间之间的互联,虚拟机都用Tap/Tun网卡,而容器都用Veth网卡
可能是因为对于虚拟机都需要对接物理网卡,因此使用tun/tap网卡,一端连到内核,一端连到物理网卡
而对于容器,其互联是容器的互联,直接使用成对的veth即可
bridge
brideg通信实验
相关命令
brctl addbr 添加网桥
brctl addif 将网卡连接到网桥
brctl show 查看连接到网桥上有哪些网卡
新建两个命名空间及两对veth并配置:
$ sudo ip netns add net1
$ sudo ip link add veth1 type veth peer name veth1_p
$ sudo ip link set veth1 netns net1
$ sudo ip netns add net2
$ sudo ip link add veth2 type veth peer name veth2_p
$ sudo ip link set veth2 netns net2
$sudo ip netns exec net1 ip addr add 192.168.10.1/24 dev veth1
$sudo ip netns exec net2 ip addr add 192.168.10.2/24 dev veth2
$sudo ip netns exec net1 ip link set dev veth1 up
$sudo ip netns exec net2 ip link set dev veth2 up
查看配置结果
$ sudo ip netns exec net1 ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
11: veth1@if10: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default qlen 1000
link/ether 22:68:39:16:5b:a3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.10.1/24 scope global veth1
valid_lft forever preferred_lft forever
$` sudo ip netns exec net2 ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
9: veth2@if8: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default qlen 1000
link/ether da:3f:28:00:b0:69 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.10.2/24 scope global veth2
valid_lft forever preferred_lft forever
此时创建bridge,
$ sudo brctl addbr br0
$ sudo ip link set dev veth1_p master br0
$ sudo ip link set dev veth2_p master br0
$ sudo ip addr add 192.168.10.3/24 dev br0
$ sudo ip link set veth1_p up
$ sudo ip link set veth2_p up
$ sudo ip link set br0 up
查看配置情况
$ brctl show
bridge name bridge id STP enabled interfaces
br0 8000.fac06bb1b35b no veth1_p
veth2_p
此时Ping一下:
$ sudo ip netns exec net1 ping 192.168.10.2 -I veth1
PING 192.168.10.2 (192.168.10.2) from 192.168.10.1 veth1: 56(84) bytes of data.
64 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=0.072 ms
64 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=0.041 ms
^C
--- 192.168.10.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.041/0.056/0.072/0.015 ms
源码实现
初始化及创建网桥函数
brige的代码位于net/bridge/br.c
module_init(br_init)
static int __init br_init(void)
{
int err;
//...
err = stp_proto_register(&br_stp_proto);
if (err < 0) {
pr_err("bridge: can't register sap for STP\n");
return err;
}
err = br_fdb_init();
if (err)
goto err_out;
//销毁命名空间时销毁设备
err = register_pernet_subsys(&br_net_ops);
if (err)
goto err_out1;
err = br_nf_core_init();
if (err)
goto err_out2;
err = register_netdevice_notifier(&br_device_notifier);
if (err)
goto err_out3;
err = register_switchdev_notifier(&br_switchdev_notifier);
if (err)
goto err_out4;
err = br_netlink_init();
if (err)
goto err_out5;
brioctl_set(br_ioctl_deviceless_stub);
//...
return 0;
//...
}
int __init br_netlink_init(void)
{
int err;
br_mdb_init();
br_vlan_rtnl_init();
rtnl_af_register(&br_af_ops);
err = rtnl_link_register(&br_link_ops);
if (err)
goto out_af;
return 0;
//...
}
rtnl_link_register在分析veth时已经见过了,可以响应RTM_NEWLINK创建设备,会依次调用setup和newlink,而在rtnl_link_register其实已经创建了net_device
struct rtnl_link_ops br_link_ops __read_mostly = {
.kind = "bridge",
.priv_size = sizeof(struct net_bridge),
.setup = br_dev_setup,
.maxtype = IFLA_BR_MAX,
.policy = br_policy,
.validate = br_validate,
.newlink = br_dev_newlink,
.changelink = br_changelink,
.dellink = br_dev_delete,
.get_size = br_get_size,
.fill_info = br_fill_info,
.fill_linkxstats = br_fill_linkxstats,
.get_linkxstats_size = br_get_linkxstats_size,
.slave_maxtype = IFLA_BRPORT_MAX,
.slave_policy = br_port_policy,
.slave_changelink = br_port_slave_changelink,
.get_slave_size = br_port_get_slave_size,
.fill_slave_info = br_port_fill_slave_info,
};
void br_dev_setup(struct net_device *dev)
{
struct net_bridge *br = netdev_priv(dev);
eth_hw_addr_random(dev);
ether_setup(dev);
//包含收发包函数
dev->netdev_ops = &br_netdev_ops;
dev->needs_free_netdev = true;
//ethtool
dev->ethtool_ops = &br_ethtool_ops;
SET_NETDEV_DEVTYPE(dev, &br_type);
dev->priv_flags = IFF_EBRIDGE | IFF_NO_QUEUE;
dev->features = COMMON_FEATURES | NETIF_F_LLTX | NETIF_F_NETNS_LOCAL |
NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_STAG_TX;
dev->hw_features = COMMON_FEATURES | NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_HW_VLAN_STAG_TX;
dev->vlan_features = COMMON_FEATURES;
br->dev = dev;
//...
}
static int br_dev_newlink(struct net *src_net, struct net_device *dev,
struct nlattr *tb[], struct nlattr *data[],
struct netlink_ext_ack *extack)
{
struct net_bridge *br = netdev_priv(dev);
int err;
err = register_netdevice(dev);
if (err)
return err;
if (tb[IFLA_ADDRESS]) {
spin_lock_bh(&br->lock);
br_stp_change_bridge_id(br, nla_data(tb[IFLA_ADDRESS]));
spin_unlock_bh(&br->lock);
}
err = br_changelink(dev, tb, data, extack);
if (err)
br_dev_delete(dev, NULL);
return err;
}
向网桥添加设备
br_netdev_ops实现:
static const struct net_device_ops br_netdev_ops = {
.ndo_open = br_dev_open,
.ndo_stop = br_dev_stop,
.ndo_init = br_dev_init,
.ndo_uninit = br_dev_uninit,
.ndo_start_xmit = br_dev_xmit,
.ndo_get_stats64 = br_get_stats64,
.ndo_set_mac_address = br_set_mac_address,
.ndo_set_rx_mode = br_dev_set_multicast_list,
.ndo_change_rx_flags = br_dev_change_rx_flags,
.ndo_change_mtu = br_change_mtu,
.ndo_do_ioctl = br_dev_ioctl,
.ndo_add_slave = br_add_slave,
.ndo_del_slave = br_del_slave,
.ndo_fix_features = br_fix_features,
.ndo_fdb_add = br_fdb_add,
.ndo_fdb_del = br_fdb_delete,
.ndo_fdb_dump = br_fdb_dump,
.ndo_fdb_get = br_fdb_get,
.ndo_bridge_getlink = br_getlink,
.ndo_bridge_setlink = br_setlink,
.ndo_bridge_dellink = br_dellink,
.ndo_features_check = passthru_features_check,
};
其中的ndo_add_slave应当就是向网桥添加设备
static int br_add_slave(struct net_device *dev, struct net_device *slave_dev,
struct netlink_ext_ack *extack)
{
struct net_bridge *br = netdev_priv(dev);
return br_add_if(br, slave_dev, extack);
}
/* called with RTNL */
int br_add_if(struct net_bridge *br, struct net_device *dev,
struct netlink_ext_ack *extack)
{
struct net_bridge_port *p;
int err = 0;
unsigned br_hr, dev_hr;
bool changed_addr, fdb_synced = false;
/* Don't allow bridging non-ethernet like devices. */
if ((dev->flags & IFF_LOOPBACK) ||
dev->type != ARPHRD_ETHER || dev->addr_len != ETH_ALEN ||
!is_valid_ether_addr(dev->dev_addr))
return -EINVAL;
/* Also don't allow bridging of net devices that are DSA masters, since
* the bridge layer rx_handler prevents the DSA fake ethertype handler
* to be invoked, so we don't get the chance to strip off and parse the
* DSA switch tag protocol header (the bridge layer just returns
* RX_HANDLER_CONSUMED, stopping RX processing for these frames).
* The only case where that would not be an issue is when bridging can
* already be offloaded, such as when the DSA master is itself a DSA
* or plain switchdev port, and is bridged only with other ports from
* the same hardware device.
*/
if (netdev_uses_dsa(dev)) {
list_for_each_entry(p, &br->port_list, list) {
if (!netdev_port_same_parent_id(dev, p->dev)) {
NL_SET_ERR_MSG(extack,
"Cannot do software bridging with a DSA master");
return -EINVAL;
}
}
}
/* No bridging of bridges */
if (dev->netdev_ops->ndo_start_xmit == br_dev_xmit) {
NL_SET_ERR_MSG(extack,
"Can not enslave a bridge to a bridge");
return -ELOOP;
}
/* Device has master upper dev */
if (netdev_master_upper_dev_get(dev))
return -EBUSY;
/* No bridging devices that dislike that (e.g. wireless) */
if (dev->priv_flags & IFF_DONT_BRIDGE) {
NL_SET_ERR_MSG(extack,
"Device does not allow enslaving to a bridge");
return -EOPNOTSUPP;
}
c
//申请一个net_bridge_port
p = new_nbp(br, dev);
if (IS_ERR(p))
return PTR_ERR(p);
//通知slave设备设备加入
call_netdevice_notifiers(NETDEV_JOIN, dev);
err = dev_set_allmulti(dev, 1);
if (err) {
br_multicast_del_port(p);
kfree(p); /* kobject not yet init'd, manually free */
goto err1;
}
err = kobject_init_and_add(&p->kobj, &brport_ktype, &(dev->dev.kobj),
SYSFS_BRIDGE_PORT_ATTR);
if (err)
goto err2;
err = br_sysfs_addif(p);
if (err)
goto err2;
err = br_netpoll_enable(p);
if (err)
goto err3;
//注册设备帧接收函数,实际为br_handle_frame
err = netdev_rx_handler_register(dev, br_get_rx_handler(dev), p);
if (err)
goto err4;
dev->priv_flags |= IFF_BRIDGE_PORT;
err = netdev_master_upper_dev_link(dev, br->dev, NULL, NULL, extack);
if (err)
goto err5;
err = nbp_switchdev_mark_set(p);
if (err)
goto err6;
dev_disable_lro(dev);
//添加到brideg的已用端口列表里
list_add_rcu(&p->list, &br->port_list);
nbp_update_port_count(br);
if (!br_promisc_port(p) && (p->dev->priv_flags & IFF_UNICAST_FLT)) {
/* When updating the port count we also update all ports'
* promiscuous mode.
* A port leaving promiscuous mode normally gets the bridge's
* fdb synced to the unicast filter (if supported), however,
* `br_port_clear_promisc` does not distinguish between
* non-promiscuous ports and *new* ports, so we need to
* sync explicitly here.
*/
fdb_synced = br_fdb_sync_static(br, p) == 0;
if (!fdb_synced)
netdev_err(dev, "failed to sync bridge static fdb addresses to this port\n");
}
netdev_update_features(br->dev);
br_hr = br->dev->needed_headroom;
dev_hr = netdev_get_fwd_headroom(dev);
if (br_hr < dev_hr)
update_headroom(br, dev_hr);
else
netdev_set_rx_headroom(dev, br_hr);
if (br_fdb_insert(br, p, dev->dev_addr, 0))
netdev_err(dev, "failed insert local address bridge forwarding table\n");
if (br->dev->addr_assign_type != NET_ADDR_SET) {
/* Ask for permission to use this MAC address now, even if we
* don't end up choosing it below.
*/
err = dev_pre_changeaddr_notify(br->dev, dev->dev_addr, extack);
if (err)
goto err7;
}
//初始化vlan
err = nbp_vlan_init(p, extack);
if (err) {
netdev_err(dev, "failed to initialize vlan filtering on this port\n");
goto err7;
}
spin_lock_bh(&br->lock);
changed_addr = br_stp_recalculate_bridge_id(br);
if (netif_running(dev) && netif_oper_up(dev) &&
(br->dev->flags & IFF_UP))
br_stp_enable_port(p);
spin_unlock_bh(&br->lock);
//通知
br_ifinfo_notify(RTM_NEWLINK, NULL, p);
if (changed_addr)
call_netdevice_notifiers(NETDEV_CHANGEADDR, br->dev);
br_mtu_auto_adjust(br);
br_set_gso_limits(br);
kobject_uevent(&p->kobj, KOBJ_ADD);
return 0;
//...
}
将添加设备添加到网桥
其中将被添加设备添加到网桥的关键函数是new_nbp:
/* called with RTNL but without bridge lock */
static struct net_bridge_port *new_nbp(struct net_bridge *br,
struct net_device *dev)
{
//申请插口对象
struct net_bridge_port *p;
int index, err;
index = find_portno(br);
if (index < 0)
return ERR_PTR(index);
p = kzalloc(sizeof(*p), GFP_KERNEL);
if (p == NULL)
return ERR_PTR(-ENOMEM);
//关联插口对象和bridge
p->br = br;
dev_hold(dev);
//关联插口对象和被插入设备
p->dev = dev;
p->path_cost = port_cost(dev);
p->priority = 0x8000 >> BR_PORT_BITS;
//保存
p->port_no = index;
p->flags = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
br_init_port(p);
br_set_state(p, BR_STATE_DISABLED);
br_stp_port_timer_init(p);
err = br_multicast_add_port(p);
if (err) {
dev_put(dev);
kfree(p);
p = ERR_PTR(err);
}
return p;
}
/* find an available port number */
static int find_portno(struct net_bridge *br)
{
int index;
struct net_bridge_port *p;
unsigned long *inuse;
//创建位图
inuse = bitmap_zalloc(BR_MAX_PORTS, GFP_KERNEL);
if (!inuse)
return -ENOMEM;
//置0
set_bit(0, inuse); /* zero is reserved */
list_for_each_entry(p, &br->port_list, list) {
//设置位图
set_bit(p->port_no, inuse);
}
//找到未使用的index
index = find_first_zero_bit(inuse, BR_MAX_PORTS);
bitmap_free(inuse);
return (index >= BR_MAX_PORTS) ? -EXFULL : index;
}
设备帧接收函数注册
接收函数注册是netdev_rx_handler_register
/**
* netdev_rx_handler_register - register receive handler
* @dev: device to register a handler for
* @rx_handler: receive handler to register
* @rx_handler_data: data pointer that is used by rx handler
*
* Register a receive handler for a device. This handler will then be
* called from __netif_receive_skb. A negative errno code is returned
* on a failure.
*
* The caller must hold the rtnl_mutex.
*
* For a general description of rx_handler, see enum rx_handler_result.
*/
int netdev_rx_handler_register(struct net_device *dev,
rx_handler_func_t *rx_handler,
void *rx_handler_data)
{
if (netdev_is_rx_handler_busy(dev))
return -EBUSY;
if (dev->priv_flags & IFF_NO_RX_HANDLER)
return -EINVAL;
//赋值保存
/* Note: rx_handler_data must be set before rx_handler */
rcu_assign_pointer(dev->rx_handler_data, rx_handler_data);
rcu_assign_pointer(dev->rx_handler, rx_handler);
return 0;
}
数据包处理过程
前面分析veth发送接收过程的时候看到过,veth的一端在接收数据包后,直接寻找邻居的net_device,然后调用netif_rx上送,最终会调用到__netif_receive_skb_core,但与普通路径不同,加入bridge的veth,会用上面看到的函数netdev_rx_handler_register设置rx_handler,其报文不会上送协议栈
static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
struct packet_type **ppt_prev)
{
//tcpdump抓包
//...
rx_handler = rcu_dereference(skb->dev->rx_handler);
if (rx_handler) {
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = NULL;
}
switch (rx_handler(&skb)) {
case RX_HANDLER_CONSUMED:
ret = NET_RX_SUCCESS;
goto out;
case RX_HANDLER_ANOTHER:
goto another_round;
case RX_HANDLER_EXACT:
deliver_exact = true;
case RX_HANDLER_PASS:
break;
default:
BUG();
}
}
//...
//送往协议栈
//...
out:
/* The invariant here is that if *ppt_prev is not NULL
* then skb should also be non-NULL.
*
* Apparently *ppt_prev assignment above holds this invariant due to
* skb dereferencing near it.
*/
*pskb = skb;
return ret;
}
在处理完tcpdump抓包后,会处理rx_handler,如果结果是RX_HANDLER_CONSUMED,则直接退出,不上送协议栈。
bridge报文处理
再看下br_handle_frame的实现:
/*
* Return NULL if skb is handled
* note: already called with rcu_read_lock
*/
static rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
{
struct net_bridge_port *p;
struct sk_buff *skb = *pskb;
const unsigned char *dest = eth_hdr(skb)->h_dest;
if (unlikely(skb->pkt_type == PACKET_LOOPBACK))
return RX_HANDLER_PASS;
if (!is_valid_ether_addr(eth_hdr(skb)->h_source))
goto drop;
skb = skb_share_check(skb, GFP_ATOMIC);
if (!skb)
return RX_HANDLER_CONSUMED;
memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
//获取前面使用netdev_rx_handler_register设置进去的rx_handler_data,即为net_bridge_port
p = br_port_get_rcu(skb->dev);
if (p->flags & BR_VLAN_TUNNEL) {
if (br_handle_ingress_vlan_tunnel(skb, p,
nbp_vlan_group_rcu(p)))
goto drop;
}
//...
forward:
switch (p->state) {
case BR_STATE_FORWARDING:
case BR_STATE_LEARNING:
//目的mac与bridge的地址一致,是发给自己的报文
if (ether_addr_equal(p->br->dev->dev_addr, dest))
skb->pkt_type = PACKET_HOST;
return nf_hook_bridge_pre(skb, pskb);
default:
drop:
kfree_skb(skb);
}
return RX_HANDLER_CONSUMED;
}
static int nf_hook_bridge_pre(struct sk_buff *skb, struct sk_buff **pskb)
{
#ifdef CONFIG_NETFILTER_FAMILY_BRIDGE
//...
frame_finish:
net = dev_net(skb->dev);
br_handle_frame_finish(net, NULL, skb);
#else
//直接看简单的分支
br_handle_frame_finish(dev_net(skb->dev), NULL, skb);
#endif
return RX_HANDLER_CONSUMED;
}
/* note: already called with rcu_read_lock */
int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{
//获取veth所连接的网桥端口
struct net_bridge_port *p = br_port_get_rcu(skb->dev);
enum br_pkt_type pkt_type = BR_PKT_UNICAST;
struct net_bridge_fdb_entry *dst = NULL;
struct net_bridge_mdb_entry *mdst;
bool local_rcv, mcast_hit = false;
struct net_bridge *br;
u16 vid = 0;
u8 state;
if (!p || p->state == BR_STATE_DISABLED)
goto drop;
state = p->state;
if (!br_allowed_ingress(p->br, nbp_vlan_group_rcu(p), skb, &vid,
&state))
goto out;
nbp_switchdev_frame_mark(p, skb);
/* insert into forwarding database after filtering to avoid spoofing */
//二层转发表学习
br = p->br;
if (p->flags & BR_LEARNING)
br_fdb_update(br, p, eth_hdr(skb)->h_source, vid, 0);
local_rcv = !!(br->dev->flags & IFF_PROMISC);
if (is_multicast_ether_addr(eth_hdr(skb)->h_dest)) {
/* by definition the broadcast is also a multicast address */
if (is_broadcast_ether_addr(eth_hdr(skb)->h_dest)) {
pkt_type = BR_PKT_BROADCAST;
local_rcv = true;
} else {
pkt_type = BR_PKT_MULTICAST;
if (br_multicast_rcv(br, p, skb, vid))
goto drop;
}
}
if (state == BR_STATE_LEARNING)
goto drop;
BR_INPUT_SKB_CB(skb)->brdev = br->dev;
BR_INPUT_SKB_CB(skb)->src_port_isolated = !!(p->flags & BR_ISOLATED);
if (IS_ENABLED(CONFIG_INET) &&
(skb->protocol == htons(ETH_P_ARP) ||
skb->protocol == htons(ETH_P_RARP))) {
br_do_proxy_suppress_arp(skb, br, vid, p);
} else if (IS_ENABLED(CONFIG_IPV6) &&
skb->protocol == htons(ETH_P_IPV6) &&
br_opt_get(br, BROPT_NEIGH_SUPPRESS_ENABLED) &&
pskb_may_pull(skb, sizeof(struct ipv6hdr) +
sizeof(struct nd_msg)) &&
ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
struct nd_msg *msg, _msg;
msg = br_is_nd_neigh_msg(skb, &_msg);
if (msg)
br_do_suppress_nd(skb, br, vid, p, msg);
}
switch (pkt_type) {
case BR_PKT_MULTICAST:
mdst = br_mdb_get(br, skb, vid);
if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
br_multicast_querier_exists(br, eth_hdr(skb))) {
if ((mdst && mdst->host_joined) ||
br_multicast_is_router(br)) {
local_rcv = true;
br->dev->stats.multicast++;
}
mcast_hit = true;
} else {
local_rcv = true;
br->dev->stats.multicast++;
}
break;
case BR_PKT_UNICAST:
//单播查找
dst = br_fdb_find_rcu(br, eth_hdr(skb)->h_dest, vid);
default:
break;
}
if (dst) {
unsigned long now = jiffies;
if (test_bit(BR_FDB_LOCAL, &dst->flags))
return br_pass_frame_up(skb);
if (now != dst->used)
dst->used = now;
//转发到端口
br_forward(dst->dst, skb, local_rcv, false);
} else {
if (!mcast_hit)
br_flood(br, skb, pkt_type, local_rcv, false);
else
br_multicast_flood(mdst, skb, local_rcv, false);
}
if (local_rcv)
return br_pass_frame_up(skb);
out:
return 0;
drop:
kfree_skb(skb);
goto out;
}
转发流程
/**
* br_forward - forward a packet to a specific port
* @to: destination port
* @skb: packet being forwarded
* @local_rcv: packet will be received locally after forwarding
* @local_orig: packet is locally originated
*
* Should be called with rcu_read_lock.
*/
void br_forward(const struct net_bridge_port *to,
struct sk_buff *skb, bool local_rcv, bool local_orig)
{
//...
if (should_deliver(to, skb)) {
if (local_rcv)
deliver_clone(to, skb, local_orig);
else
//转发
__br_forward(to, skb, local_orig);
return;
}
out:
if (!local_rcv)
kfree_skb(skb);
}
static void __br_forward(const struct net_bridge_port *to,
struct sk_buff *skb, bool local_orig)
{
struct net_bridge_vlan_group *vg;
struct net_device *indev;
struct net *net;
int br_hook;
vg = nbp_vlan_group_rcu(to);
skb = br_handle_vlan(to->br, to, vg, skb);
if (!skb)
return;
indev = skb->dev;
//将skb的dve改为目的dev
skb->dev = to->dev;
if (!local_orig) {
//...
net = dev_net(indev);
} else {
//...
net = dev_net(skb->dev);
indev = NULL;
}
NF_HOOK(NFPROTO_BRIDGE, br_hook,
net, NULL, skb, indev, skb->dev,
br_forward_finish);
}
int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{
skb->tstamp = 0;
return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
net, sk, skb, NULL, skb->dev,
br_dev_queue_push_xmit);
}
int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
{
skb_push(skb, ETH_HLEN);
if (!is_skb_forwardable(skb->dev, skb))
goto drop;
br_drop_fake_rtable(skb);
if (skb->ip_summed == CHECKSUM_PARTIAL &&
(skb->protocol == htons(ETH_P_8021Q) ||
skb->protocol == htons(ETH_P_8021AD))) {
int depth;
if (!__vlan_get_protocol(skb, skb->protocol, &depth))
goto drop;
skb_set_network_header(skb, depth);
}
dev_queue_xmit(skb);
return 0;
//...
}
最终调用了dev_queue_xmit,对于veth最终会调用到其ndo_start_xmit函数即veth_xmit,已经在前面梳理过,其会获取到另一端的peer,然后发送过去。
整个bridge的工作源码都是在设备层,也证明了其是二层设备。
bridge vlan支持
Bridge 本身是支持 VLAN 功能的,如下图所示,通过配置,Bridge 可以将一个物理网卡设备 eth0 划分成两个子设备 eth0.10,eth0.20,分别挂到 Bridge 虚拟出的两个 VLAN 上,VLAN id 分别为 VLAN 10 和 VLAN 20。同样,两个 VM 的虚拟网卡设备 vnet0 和 vnet 1 也分别挂到相应的 VLAN 上。这样配好的最终效果就是 VM1 不能和 VM2 通信了,达到了隔离。
图片借用一下:
docker网络
docker使用veth+bridge+网络命名空间来实现的docker网络
一般用来把 tap/tun、veth-pair连到Bridge上面,这样可以把一组容器,或者一组虚机连在一起,Docker就是用Bridge把Host里面的容器都连在一起,使得这些容器可以互相访问。也可以把Host上面的物理网卡也加入到Bridge,这样主机的VM就可以和外界通信了。
容器中一般不会把主机上原有的eth0也加入到Bridge,而虚机使用Bridge,一般会把原来Host上面的网卡加入到Bridge。(容器不加入,一是因为大量容器的IP,可能会和Host所在网络上,它那些兄弟VM们的IP冲突。二还有容器网络模式可以多种模式选择,保留灵活性。)。
具体可参考《跟唐老师学习云网络》 - Docker网络实现
容器与外部网络通信
延伸1:Open vSwtich与OpenFlow
openvswitch的内核模块位于net/openvswitch
OpenFlow
定义
OpenFlow是软件定义网络SDN(Software Defined Network)架构中控制平面和转发平面之间的通信协议,通过标准化开放接口实现控制平面和转发平面的分离。OpenFlow允许控制器直接访问和操作网络设备的转发平面,这些网络设备可能是物理上的设备,也可能是虚拟的设备。
目的
随着设备在物理网络基础设施上实施服务器虚拟化的快速发展,虚拟机的数量越来越多,网络管理变得越来越复杂,新业务的网络部署越来越慢。这就要求设备操作简单灵活,扩展性能高,可以集中控制和管理设备的转发行为。然而传统网络设备的控制平面和转发平面集成在一起,扩展性能低,技术更新周期长,难以实现集中控制和管理及快速部署新业务网络。SDN技术可以分离控制平面和网络转发平面,而OpenFlow技术可以实现控制平面和转发平面之间的通信,集中控制和管理整个网络的转发业务,实现新业务网络的快速部署。
延伸2:k8s网络模型CNI与cilium
以下是GPT给出的docker网络与k8s网络的对比
方面 | Docker 网络 | K8s 网络(Pod 网络) |
---|---|---|
网络模型 | 多种网络驱动,局限于单主机或 overlay | 扁平网络模型,主张全局直连 |
IP 分配 | 容器通常通过 NAT 或 bridge 访问外部 | 每个 Pod 独立 IP,集群内直连 |
通信隔离 | 容器间通信需配置端口映射或自定义网络 | Pod 间天然互通(可用 NetworkPolicy 限制) |
多主机通信 | 需 overlay 网络,配置复杂 | 由 CNI 插件自动处理 |
实现方式 | 网络 namespace + veth + bridge/overlay | CNI 插件,抽象更高 |
网络策略 | 支持有限 | 支持 NetworkPolicy 精细控制 |
延伸3: OpenStack
参考文章
跟唐老师学习云网络系列
《跟唐老师学云网络》—— 目录
《跟唐老师学习云网络》 - 网络命名空间 Network Namespace
《跟唐老师学习云网络》 - Veth网线
《跟唐老师学习云网络》 - TUN/TAP网线
《跟唐老师学习云网络》 - Bridge网桥
《跟唐老师学习云网络》 - OVS交换机
《跟唐老师学习云网络》 - Docker网络实现
《跟唐老师学习云网络》 - Kubernetes网络实现
《跟唐老师学习云网络》 - OpenStack网络实现
开发内功修炼系列
轻松理解 Docker 网络虚拟化基础之 veth 设备!
聊聊 Linux 上软件实现的“交换机” - Bridge!
动手实验+源码分析,彻底弄懂 Linux 网络命名空间
手工模拟实现 Docker 容器网络!
极客时间系列
其他
从 Bridge 到 OVS,探索虚拟交换机
实战演练:Linux Bridge 与 Open vSwitch 对比配置全过程
Open vSwitch 文档(中文)
Openvswitch原理与代码分析