引言

原Ubuntu虚拟机Microk8s组成的集群在某一次断电后故障无法恢复,在调研相关资料后,发现Talos比较简单合适省心,因此选择作为重新部署方案。 talos.png

一、集群搭建基础配置

前期准备

默认使用以下基础设施:

  1. 虚拟机配置:使用Proxmox和飞牛虚拟机,部署使用以下节点,相关配置参考Proxmox - Sidero Documentation

    10.1.1.101 master-cuipi
    10.1.1.102 node-jiangjiang
    10.1.1.103 node-peento
    10.1.1.104 node-poplar
    
  2. 内部镜像仓库代理:使用Harbor加速部署,配置如下:

    harbor.fuhao.tech/dockerhub -> hub.docker.com、docker.io
    harbor.fuhao.tech/gcr -> gcr.io
    harbor.fuhao.tech/ghcr -> ghcr.io
    harbor.fuhao.tech/registry -> registry.k8s.io
    harbor.fuhao.tech/quay -> quay.io
    

生成部署配置

export CONTROL_PLANE_IP=10.1.1.101
talosctl gen config talos-proxmox-cluster https://$CONTROL_PLANE_IP:6443 --output-dir . --install-image harbor.fuhao.tech/ghcr/siderolabs/installer:v1.12.1 \
  --registry-mirror docker.io=https://harbor.fuhao.tech/v2/dockerhub/ \
  --registry-mirror gcr.io=https://harbor.fuhao.tech/v2/gcr/ \
  --registry-mirror ghcr.io=https://harbor.fuhao.tech/v2/ghcr/ \
  --registry-mirror registry.k8s.io=https://harbor.fuhao.tech/v2/k8s/

修改镜像配置

修改相关镜像配置,为所有代理增加overridePath: true以适配harbor代理:

apiVersion: v1alpha1
kind: RegistryMirrorConfig

name: registry.k8s.io
endpoints:
    - url: https://harbor.fuhao.tech/v2/k8s/
      overridePath: true

参考:RegistryMirrorConfig - Sidero Documentation

节点配置

应用控制节点配置

talosctl apply-config --insecure --nodes $CONTROL_PLANE_IP --file controlplane.yaml

应用工作节点配置

talosctl apply-config --insecure --nodes 10.1.1.102 --file worker.yaml
talosctl apply-config --insecure --nodes 10.1.1.103 --file worker.yaml
talosctl apply-config --insecure --nodes 10.1.1.104 --file worker.yaml

启动集群

export TALOSCONFIG="talosconfig"
talosctl config endpoint $CONTROL_PLANE_IP
talosctl config node $CONTROL_PLANE_IP
talosctl bootstrap

更新NTP时间同步服务器

发现默认的时间同步服务器访问不通:

talosctl dmesg
10.1.1.101: user: warning: [2026-01-10T13:53:57.73712814Z]: [talos] time query error with server "162.159.200.1" {"component": "controller-runtime", "controller": "time.SyncController", "error": "read udp 10.1.1.101:54898->162.159.200.1:123: i/o timeout"}

通过命令直接编辑配置:

talosctl edit machineconfig

追加time部分配置:

machine:
    time:
        servers:
            - ntp.aliyun.com

参考:NTP配置 - Sidero Documentation

验证配置是否生效:

talosctl dmesg
...
10.1.1.101: user: warning: [2026-01-15T01:09:11.755102049Z]: [talos] setting time servers {"component": "controller-runtime", "controller": "network.TimeServerSpecController", "addresses": ["ntp.aliyun.com"]}
...

生成kubeconfig

talosctl kubeconfig .

二、K8s恢复

如果终端需要走代理,进行终端代理网络配置

export all_proxy=http://10.1.1.90:1080
export http_proxy=http://10.1.1.90:1080
export https_proxy=http://10.1.1.90:1080

NFS配置

通过Helm安装CSI-NFS驱动

talos已经支持nfs,不需要额外安装系统层面的软件

# 添加仓库
helm repo add csi-driver-nfs https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts

# 安装CSI驱动
helm install csi-driver-nfs csi-driver-nfs/csi-driver-nfs --namespace kube-system --version 4.12.0

# 验证安装状态
kubectl --namespace=kube-system get pods --selector="app.kubernetes.io/instance=csi-driver-nfs" --watch -owide

创建StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-client
provisioner: nfs.csi.k8s.io
parameters:
  server: 10.1.1.201
  share: /data/k8s
reclaimPolicy: Retain
volumeBindingMode: Immediate
allowVolumeExpansion: true

通过指定PV直接使用之前的文件夹

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: nfs.csi.k8s.io
  name: share-data-pv
spec:
  accessModes:
    - ReadWriteMany
  capacity:
    storage: 10Gi
  csi:
    driver: nfs.csi.k8s.io
    volumeAttributes:
      server: 10.1.1.201
      share: /data/k8s
      subdir: default/share-data-pv
    volumeHandle: 10.1.1.201#data/k8s#default#share-data-pv##
  persistentVolumeReclaimPolicy: Retain
  storageClassName: nfs-client
  volumeMode: Filesystem
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: share-data
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: nfs-client
  volumeMode: Filesystem
  volumeName: share-data-pv

Cert-Manager配置

安装Cert-Manager

helm install \
  cert-manager oci://quay.io/jetstack/charts/cert-manager \
  --version v1.19.2 \
  --namespace cert-manager \
  --create-namespace \
  --set crds.enabled=true

安装AliDNS Webhook

# 添加仓库
helm repo add cert-manager-alidns-webhook https://devmachine-fr.github.io/cert-manager-alidns-webhook
helm repo update

# 安装Webhook
helm install -n cert-manager --set groupName=fuhao.tech alidns-webhook cert-manager-alidns-webhook/alidns-webhook

# 创建密钥
kubectl -n cert-manager create secret generic alidns-secrets --from-literal="access-token=xxxx" --from-literal="secret-key=xxxx"

配置相关资源

这里定义集群证书发行,方便使用,使用刚刚安装的alidns作为letencrypt的dns01 solver验证使用 域名是自己的fuhao.tech

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt
spec:
  acme:
    email: hackstep@qq.com
    server: https://acme-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: letsencrypt
    solvers:
    - dns01:
        webhook:
            config:
              accessTokenSecretRef:
                key: access-token
                name: alidns-secrets
              regionId: cn-hangzhou
              secretKeySecretRef:
                key: secret-key
                name: alidns-secrets
            groupName: fuhao.tech
            solverName: alidns-solver

发行个证书验证一下衔接是否完好

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: kube-fuhao-tech-tls
spec:
  secretName: kube-fuhao-tech-tls
  commonName: kube.fuhao.tech
  dnsNames:
  - kube.fuhao.tech
  issuerRef:
    name: letsencrypt
    kind: ClusterIssuer

LoadBalancer配置

安装MetalLB作为裸机负载均衡器

# 添加仓库
helm repo add metallb https://metallb.github.io/metallb

# 安装MetalLB
helm install metallb metallb/metallb -n metallb-system --create-namespace

配置相关资源

使用Layer2模式,这里为了域名映射方便将控制节点作为单点入口 创建IPAddressPool和L2Advertisement 注:L2Advertisement不配置IPAddressPool选择器,会被解释为所有可用的IPAddressPool实例相关联。

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: default-pool
  namespace: metallb-system
spec:
  addresses:
    - 10.1.1.101/32
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: default
  namespace: metallb-system

Ingress Nginx配置

安装ingress-nginx

helm upgrade --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace ingress-nginx --create-namespace

配置与验证

搞个ingress验证一下ingress-nginx和LB以及证书发行衔接是否完好

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: code-server
  namespace: dev
  annotations:
    kubernetes.io/Ingress.class: nginx
    cert-manager.io/cluster-issuer: "letsencrypt"
spec:
  ingressClassName: "nginx"
  tls:
    - hosts:
        - code.kube.fuhao.tech
      secretName: code-kube-fuhao-tech-certs
  rules:
    - host: code.kube.fuhao.tech
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: code-server
                port:
                  number: 8080

code.kube.fuhao.tech域名解析指向控制节点IP10.1.1.101 curl -v https://code.kube.fuhao.tech 命令验证一下,可以发现一切正常工作

剩下的就是把k8s的资源挨个apply就恢复啦,结束。

参考资料

Talos相关

存储相关

证书管理相关

负载均衡相关

Ingress相关