引言
原Ubuntu虚拟机Microk8s组成的集群在某一次断电后故障无法恢复,在调研相关资料后,发现Talos比较简单合适省心,因此选择作为重新部署方案。

一、集群搭建基础配置
前期准备
默认使用以下基础设施:
-
虚拟机配置:使用Proxmox和飞牛虚拟机,部署使用以下节点,相关配置参考Proxmox - Sidero Documentation
10.1.1.101 master-cuipi 10.1.1.102 node-jiangjiang 10.1.1.103 node-peento 10.1.1.104 node-poplar -
内部镜像仓库代理:使用Harbor加速部署,配置如下:
harbor.fuhao.tech/dockerhub -> hub.docker.com、docker.io harbor.fuhao.tech/gcr -> gcr.io harbor.fuhao.tech/ghcr -> ghcr.io harbor.fuhao.tech/registry -> registry.k8s.io harbor.fuhao.tech/quay -> quay.io
生成部署配置
export CONTROL_PLANE_IP=10.1.1.101
talosctl gen config talos-proxmox-cluster https://$CONTROL_PLANE_IP:6443 --output-dir . --install-image harbor.fuhao.tech/ghcr/siderolabs/installer:v1.12.1 \
--registry-mirror docker.io=https://harbor.fuhao.tech/v2/dockerhub/ \
--registry-mirror gcr.io=https://harbor.fuhao.tech/v2/gcr/ \
--registry-mirror ghcr.io=https://harbor.fuhao.tech/v2/ghcr/ \
--registry-mirror registry.k8s.io=https://harbor.fuhao.tech/v2/k8s/
修改镜像配置
修改相关镜像配置,为所有代理增加overridePath: true以适配harbor代理:
apiVersion: v1alpha1
kind: RegistryMirrorConfig
name: registry.k8s.io
endpoints:
- url: https://harbor.fuhao.tech/v2/k8s/
overridePath: true
参考:RegistryMirrorConfig - Sidero Documentation
节点配置
应用控制节点配置
talosctl apply-config --insecure --nodes $CONTROL_PLANE_IP --file controlplane.yaml
应用工作节点配置
talosctl apply-config --insecure --nodes 10.1.1.102 --file worker.yaml
talosctl apply-config --insecure --nodes 10.1.1.103 --file worker.yaml
talosctl apply-config --insecure --nodes 10.1.1.104 --file worker.yaml
启动集群
export TALOSCONFIG="talosconfig"
talosctl config endpoint $CONTROL_PLANE_IP
talosctl config node $CONTROL_PLANE_IP
talosctl bootstrap
更新NTP时间同步服务器
发现默认的时间同步服务器访问不通:
talosctl dmesg
10.1.1.101: user: warning: [2026-01-10T13:53:57.73712814Z]: [talos] time query error with server "162.159.200.1" {"component": "controller-runtime", "controller": "time.SyncController", "error": "read udp 10.1.1.101:54898->162.159.200.1:123: i/o timeout"}
通过命令直接编辑配置:
talosctl edit machineconfig
追加time部分配置:
machine:
time:
servers:
- ntp.aliyun.com
参考:NTP配置 - Sidero Documentation
验证配置是否生效:
talosctl dmesg
...
10.1.1.101: user: warning: [2026-01-15T01:09:11.755102049Z]: [talos] setting time servers {"component": "controller-runtime", "controller": "network.TimeServerSpecController", "addresses": ["ntp.aliyun.com"]}
...
生成kubeconfig
talosctl kubeconfig .
二、K8s恢复
如果终端需要走代理,进行终端代理网络配置
export all_proxy=http://10.1.1.90:1080
export http_proxy=http://10.1.1.90:1080
export https_proxy=http://10.1.1.90:1080
NFS配置
通过Helm安装CSI-NFS驱动
talos已经支持nfs,不需要额外安装系统层面的软件
# 添加仓库
helm repo add csi-driver-nfs https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts
# 安装CSI驱动
helm install csi-driver-nfs csi-driver-nfs/csi-driver-nfs --namespace kube-system --version 4.12.0
# 验证安装状态
kubectl --namespace=kube-system get pods --selector="app.kubernetes.io/instance=csi-driver-nfs" --watch -owide
创建StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs-client
provisioner: nfs.csi.k8s.io
parameters:
server: 10.1.1.201
share: /data/k8s
reclaimPolicy: Retain
volumeBindingMode: Immediate
allowVolumeExpansion: true
通过指定PV直接使用之前的文件夹
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: nfs.csi.k8s.io
name: share-data-pv
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 10Gi
csi:
driver: nfs.csi.k8s.io
volumeAttributes:
server: 10.1.1.201
share: /data/k8s
subdir: default/share-data-pv
volumeHandle: 10.1.1.201#data/k8s#default#share-data-pv##
persistentVolumeReclaimPolicy: Retain
storageClassName: nfs-client
volumeMode: Filesystem
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: share-data
namespace: default
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
storageClassName: nfs-client
volumeMode: Filesystem
volumeName: share-data-pv
Cert-Manager配置
安装Cert-Manager
helm install \
cert-manager oci://quay.io/jetstack/charts/cert-manager \
--version v1.19.2 \
--namespace cert-manager \
--create-namespace \
--set crds.enabled=true
安装AliDNS Webhook
# 添加仓库
helm repo add cert-manager-alidns-webhook https://devmachine-fr.github.io/cert-manager-alidns-webhook
helm repo update
# 安装Webhook
helm install -n cert-manager --set groupName=fuhao.tech alidns-webhook cert-manager-alidns-webhook/alidns-webhook
# 创建密钥
kubectl -n cert-manager create secret generic alidns-secrets --from-literal="access-token=xxxx" --from-literal="secret-key=xxxx"
配置相关资源
这里定义集群证书发行,方便使用,使用刚刚安装的alidns作为letencrypt的dns01 solver验证使用 域名是自己的fuhao.tech
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt
spec:
acme:
email: hackstep@qq.com
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt
solvers:
- dns01:
webhook:
config:
accessTokenSecretRef:
key: access-token
name: alidns-secrets
regionId: cn-hangzhou
secretKeySecretRef:
key: secret-key
name: alidns-secrets
groupName: fuhao.tech
solverName: alidns-solver
发行个证书验证一下衔接是否完好
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: kube-fuhao-tech-tls
spec:
secretName: kube-fuhao-tech-tls
commonName: kube.fuhao.tech
dnsNames:
- kube.fuhao.tech
issuerRef:
name: letsencrypt
kind: ClusterIssuer
LoadBalancer配置
安装MetalLB作为裸机负载均衡器
# 添加仓库
helm repo add metallb https://metallb.github.io/metallb
# 安装MetalLB
helm install metallb metallb/metallb -n metallb-system --create-namespace
配置相关资源
使用Layer2模式,这里为了域名映射方便将控制节点作为单点入口 创建IPAddressPool和L2Advertisement 注:L2Advertisement不配置IPAddressPool选择器,会被解释为所有可用的IPAddressPool实例相关联。
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: default-pool
namespace: metallb-system
spec:
addresses:
- 10.1.1.101/32
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: default
namespace: metallb-system
Ingress Nginx配置
安装ingress-nginx
helm upgrade --install ingress-nginx ingress-nginx \
--repo https://kubernetes.github.io/ingress-nginx \
--namespace ingress-nginx --create-namespace
配置与验证
搞个ingress验证一下ingress-nginx和LB以及证书发行衔接是否完好
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: code-server
namespace: dev
annotations:
kubernetes.io/Ingress.class: nginx
cert-manager.io/cluster-issuer: "letsencrypt"
spec:
ingressClassName: "nginx"
tls:
- hosts:
- code.kube.fuhao.tech
secretName: code-kube-fuhao-tech-certs
rules:
- host: code.kube.fuhao.tech
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: code-server
port:
number: 8080
将code.kube.fuhao.tech域名解析指向控制节点IP10.1.1.101
curl -v https://code.kube.fuhao.tech 命令验证一下,可以发现一切正常工作
剩下的就是把k8s的资源挨个apply就恢复啦,结束。